Search Results

Found 14 repositories(showing 14)

THOR

JingMog

🧡50

[ICLR-2026] Official Implementation of our paper "THOR: Tool-Integrated Hierarchical Optimization via RL for Mathematical Reasoning".

Apache-2.0

Python

Updated 1 month ago

The official repository for Trust-Region Adaptive Policy Optimization (TRAPO) – a novel hybrid framework designed to enhance large language models' reasoning abilities by interleaving SFT and RL within each training instance.

Python

Updated 18 hours ago

rl-reasoning-optimizer

wenquanlu

🧡50

Fully open reproduction of DeepSeek-R1

Apache-2.0

Python

Updated 2 months ago

MBPO

ShuaiLyu0110

🧡50

MBPO: Multi-Branch Policy Optimization via tree search for multimodal large language models. A tree-based RL framework enabling branch-level credit assignment to address credit ambiguity in vision-language reasoning.

MIT

Updated 2 months ago

rl-reasoning-optimizer

mcar18

❤️45

Core Idea Train a reinforcement learning agent that improves reasoning prompts for an LLM. Instead of fine-tuning the LLM directly, agent learns to optimize the reasoning process.

Python

Updated 1 month ago

rl-reasoning-optimizer-old

wenquanlu

❤️15

No description available

Apache-2.0

Python

Updated 8 months ago

HRPO-X

flamehaven01

🧡50

Hybrid Reasoning Policy Optimization (HRPO): a research prototype for hybrid latent reasoning with RL.

MIT

Python

Updated 2 months ago

ai-reasoningcognitive-architecturedeductive-reasoning+11

workflow-huggingface-open-r1-grpo-reasoning-training

leeroopedia

❤️45

GRPO Reasoning Training: RL-based training with Group Relative Policy Optimization for improving reasoning in language models

Python

Updated 1 month ago

latent-reasoner

Sayanc93

❤️20

A repository for RL optimizing joint latent(coconut)+CoT reasoning(grpo/gspo) in language models

Python

Updated 6 months ago

teaching-mistral-to-think

garg-tejas

❤️45

Applied Group Relative Policy Optimization (GRPO) to induce structured reasoning in Mistral-7B, achieving improved GSM8K math accuracy through RL.

Jupyter Notebook

Updated 2 months ago

alignmentgrpollm+5

Tapo

BrotherAI

❤️40

Official implementation of "TAPO: Dynamic Teacher and Perturbed Answer Injection for Policy Optimization", a fine-grained RL framework for reasoning alignment in LLMs.

MIT

Updated 11 months ago

ai-datacenter-sim

Auenchanters

🧡60

A headless 2D data center simulation environment designed to benchmark LLMs and RL agents on spatial reasoning, thermodynamics, and Power Usage Effectiveness (PUE) optimization

MIT

Python

Updated 1 week ago

ai-agentscellular-automatadata-center-simulation+4

Sparse-Reward-RL-Metacontroller

SoanKim

🧡50

A dual-process RL framework modeling human reasoning. Arbitrates between MCTS and heuristics to optimize cognitive effort in sparse-reward environments. Validated against human behavioral datasets.

MIT

Python

Updated 1 month ago

PostTrain-LLM

AkshaySyal

🧡55

End-to-end pipelines for post-training LLMs using Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), and Online RL (GRPO/PPO). Transforms base models into instruction-following, behavior-aligned assistants with improved reasoning and controllability.

Jupyter Notebook

Updated 3 weeks ago

All 14 repositories loaded

GitHub Explorer

Search Results

THOR

TRAPO

rl-reasoning-optimizer

MBPO

rl-reasoning-optimizer

rl-reasoning-optimizer-old

HRPO-X

workflow-huggingface-open-r1-grpo-reasoning-training

latent-reasoner

teaching-mistral-to-think

Tapo

ai-datacenter-sim

Sparse-Reward-RL-Metacontroller

PostTrain-LLM

THOR

TRAPO

rl-reasoning-optimizer

MBPO

rl-reasoning-optimizer

rl-reasoning-optimizer-old

HRPO-X

workflow-huggingface-open-r1-grpo-reasoning-training

latent-reasoner

teaching-mistral-to-think

Tapo

ai-datacenter-sim

Sparse-Reward-RL-Metacontroller

PostTrain-LLM