Back to search
Codes for the paper "BAPO: Stabilizing Off-Policy Reinforcement Learning for LLMs via Balanced Policy Optimization with Adaptive Clipping" by Zhiheng Xi et al.
Stars
92
Forks
6
Watchers
92
Open Issues
3
Overall repository health assessment
No package.json found
This might not be a Node.js project
update policy_loss.py with explicit importance-ratio bounds and global consistency for bound-search across DP ranks
3b8389cView on GitHubdelete the paper as the offical paper is avaliable on arxiv
b1cd33cView on GitHub