Found 2 repositories(showing 2)
Direct Preference Optimization (DPO) for large language diffusion models (LLaDA-8B), using Monte Carlo ELBO-based preference loss, LoRA adapters, and 8-bit quantization for efficient single-GPU training. Achieves improved alignment +4% win rate over baseline on Anthropic HH-RLHF preference data
demo11122
A novel DPO framework that incorporates the strength of preferences in preference optimization. The framework is tested on state-of-the-art diffusion models and LLMs. All experiments in the submitted paper can be replicated using this code.
All 2 repositories loaded