Found 14 repositories(showing 14)
JingMog
[ICLR-2026] Official Implementation of our paper "THOR: Tool-Integrated Hierarchical Optimization via RL for Mathematical Reasoning".
Su-my
The official repository for Trust-Region Adaptive Policy Optimization (TRAPO) – a novel hybrid framework designed to enhance large language models' reasoning abilities by interleaving SFT and RL within each training instance.
wenquanlu
Fully open reproduction of DeepSeek-R1
ShuaiLyu0110
MBPO: Multi-Branch Policy Optimization via tree search for multimodal large language models. A tree-based RL framework enabling branch-level credit assignment to address credit ambiguity in vision-language reasoning.
mcar18
Core Idea Train a reinforcement learning agent that improves reasoning prompts for an LLM. Instead of fine-tuning the LLM directly, agent learns to optimize the reasoning process.
wenquanlu
No description available
flamehaven01
Hybrid Reasoning Policy Optimization (HRPO): a research prototype for hybrid latent reasoning with RL.
GRPO Reasoning Training: RL-based training with Group Relative Policy Optimization for improving reasoning in language models
Sayanc93
A repository for RL optimizing joint latent(coconut)+CoT reasoning(grpo/gspo) in language models
garg-tejas
Applied Group Relative Policy Optimization (GRPO) to induce structured reasoning in Mistral-7B, achieving improved GSM8K math accuracy through RL.
BrotherAI
Official implementation of "TAPO: Dynamic Teacher and Perturbed Answer Injection for Policy Optimization", a fine-grained RL framework for reasoning alignment in LLMs.
Auenchanters
A headless 2D data center simulation environment designed to benchmark LLMs and RL agents on spatial reasoning, thermodynamics, and Power Usage Effectiveness (PUE) optimization
A dual-process RL framework modeling human reasoning. Arbitrates between MCTS and heuristics to optimize cognitive effort in sparse-reward environments. Validated against human behavioral datasets.
AkshaySyal
End-to-end pipelines for post-training LLMs using Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), and Online RL (GRPO/PPO). Transforms base models into instruction-following, behavior-aligned assistants with improved reasoning and controllability.
All 14 repositories loaded