Found 302 repositories(showing 30)
modelscope
Use PEFT or Full-parameter to CPT/SFT/DPO/GRPO 600+ LLMs (Qwen3.5, DeepSeek-R1, GLM-5, InternLM3, Llama4, ...) and 300+ MLLMs (Qwen3-VL, Qwen3-Omni, InternVL3.5, Ovis2.5, GLM4.5v, Llava, Phi4, ...) (AAAI 2025).
lsdefine
A very simple GRPO implement for reproducing r1-like LLM thinking.
ARahim3
Fine-tune LLMs on your Mac with Apple Silicon. SFT, DPO, GRPO, Vision, TTS, STT, Embedding, and OCR fine-tuning — natively on MLX. Unsloth-compatible API.
datawhalechina
🎓 系统性大语言模型构建课程|🛠️ 覆盖预训练数据工程、Tokenizer、Transformer、MoE、GPU 编程 (CUDA/Triton)、分布式训练、Scaling Laws、推理优化及对齐 (SFT/RLHF/GRPO)|🚀 6 个渐进式作业 + 代码驱动,建立 LLM 全栈认知体系
Doriandarko
A pure MLX-based training pipeline for fine-tuning LLMs using GRPO on Apple Silicon.
TYH-labs
Zero-friction LLM fine-tuning skill for Claude Code, Gemini CLI & any ACP agent. Unsloth on NVIDIA · TRL+MPS/MLX on Apple Silicon. Automates env setup, LoRA training (SFT, DPO, GRPO, vision), post-hoc GRPO log diagnostics, evaluation, and export end-to-end. Part of the Gaslamp AI platform.
zht8506
Implement popular LLM post-training algorithms (SFT, DFT, DPO, GRPO, etc.) in PyTorch with easy code!
Oxen-AI
This repository has code for fine-tuning LLMs with GRPO specifically for Rust Programming using cargo as feedback
waltonfuture
[NeurIPS 2025] Unsupervised Post-Training for Multi-Modal LLM Reasoning via GRPO
Ruijian-Zha
🚀 A New DAPO Algorithm for Stock Trading (arXiv:2505.06408) Implementation of our IEEE IDS 2025 accepted algorithm combining Dynamic Sampling Policy Optimization (DAPO), Group Relative Policy Optimization (GRPO), and LLM-driven risk/sentiment signals for efficient and profitable stock trading on the NASDAQ-100 index.
mkurman
Fine-tunes a student LLM using teacher feedback for improved reasoning and answer quality. Implements GRPO with teacher-provided evaluations.
OpenMOSS
Official implementation of BandPO: Bridging Trust Regions and Ratio Clipping via Probability-Aware Bounds for LLM Reinforcement Learning. BandPO replaces canonical clipping (PPO/GRPO) with dynamic bounds to resolve exploration bottlenecks and prevent entropy collapse.
This repository contains a pipeline for fine-tuning Large Language Models (LLMs) for Text-to-SQL conversion using General Reward Proximal Optimization (GRPO).
zz1358m
Code for the SofT-GRPO algorithm on the LLM soft-thinking reasoning pattern.
axolotl-ai-cloud
A fast, local, and secure approach for training LLMs for coding tasks using GRPO with WebAssembly and interpreter feedback.
yaochenzhu
(ICLR'26 + Netflix) Rank-GRPO: Training LLM-based Conversational Recommender Systems with Reinforcement Learning
Infatoshi
Train an LLM to play Mafia via GRPO
iBacklight
PipelineLLM 是一个系统性的大语言模型(LLM)后训练学习项目,涵盖从监督微调(SFT)到偏好优化(DPO)、强化学习(RLHF/PPO/GRPO)再到持续学习(Continual Learning)的完整技术栈。
EsmaeilNarimissa
No description available
JIA-Lab-research
Scaf-GRPO: Scaffolded Group Relative Policy Optimization for Enhancing LLM Reasoning
Bader-CN
一个完整的 LLM 训练的基本流程笔记 (Tokenizer -> PreTraining -> SFT -> DPO -> GRPO)
Fine-tune LLMs with GRPO algorithm tutorial
Minami-su
This repository, deepspeed-grpo-qlora-vllm, provides a complete framework for fine-tuning LLMs using Group Relative Policy Optimization (GRPO) on 4-bit quantized models (QLoRA). It utilizes DeepSpeed ZeRO-3 for scalable training and integrates with a VLLM server to dynamically serve the fine-tuned LoRA adapters.
Azzedde
Intelligent web discovery agent with LLM-powered planning, multi-source search, smart deduplication, and GRPO preference dataset collection. Autonomously searches, analyzes, and summarizes web content while building training data for model fine-tuning.
rraghavkaushik
A curated collection of NLP and LLM resources. Covers essential papers and blogs on Transformers, Reinforcement Learning (RLHF, DPO, GRPO), Mechanistic Interpretability, Scaling Laws, and MLSys.
SuienS
A toolkit for fine-tuning Large Language Models (LLMs) to generate Manim animation code using Supervised Fine-Tuning (SFT) and Visually Grounded Reinforcement Learning using Group Relative Policy Optimisation (GRPO/GSPO) techniques.
Thrillcrazyer
"Thinking is Process." Leverage Process Mining Technique for LLM Reinforcement Learning. Official Repository of "Reasoning-Aware GRPO using Process Mining"
kechirojp
GRPO (Group Relative Policy Optimization) implementation for Stable Baselines3. Drop-in PPO replacement with instant action comparison. Easy pip install, full API compatibility. Used by DeepSeek for LLM training.
RFT with GRPO: RFT helps adapt LLMs to complex reasoning tasks like math and coding by using RL, enabling models to develop their own strategies instead of mimicking examples as in SFT. GRPO, a tailored RL algorithm, excels in tasks with verifiable outcomes and works well with small datasets.
XiaomingX
🚀 项目使命:弥合算法理论与工程实践的鸿沟 本项目是一个专为中文开发者设计的深度学习与强化学习算法全栈实验室。我们通过对 GPT-2、RLHF、MuZero 以及 Alignment (GRPO, Weak-to-Strong) 等前沿算法的现代化 PyTorch 重构,旨在提供一个“所见即所得”的学习与研究基准。 核心差异化价值 全栈重构: 彻底告别不再维护的 TensorFlow 1.x / JAX 遗留代码,全面拥抱 PyTorch 2.x 生态。 理论实战闭环: 每一行核心逻辑都配有详尽的中文注释,直接对应论文中的数学公式。 对齐技术前瞻: 率先集成了 GRPO (DeepSeek)、Weak-to-Strong (OpenAI) 等 LLM 对齐关键算法。