Back to search
Direct Preference Optimization (DPO) for large language diffusion models (LLaDA-8B), using Monte Carlo ELBO-based preference loss, LoRA adapters, and 8-bit quantization for efficient single-GPU training. Achieves improved alignment +4% win rate over baseline on Anthropic HH-RLHF preference data
Stars
2
Forks
0
Watchers
2
Open Issues
0
Overall repository health assessment
No package.json found
This might not be a Node.js project
3
commits