Found 19 repositories(showing 19)
anthropics
Human preference data for "Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback"
Performed supervised fine-tuning (SFT) on Llama 3.1 8B using HH-RLHF and Ranked 10K responses with Llama 3.1 70B to build a safety-optimized dataset
Direct Preference Optimization (DPO) for large language diffusion models (LLaDA-8B), using Monte Carlo ELBO-based preference loss, LoRA adapters, and 8-bit quantization for efficient single-GPU training. Achieves improved alignment +4% win rate over baseline on Anthropic HH-RLHF preference data
sp4s-s
No description available
sionic-ai
No description available
cosmic-heart
Mistral 7b - SFT on Alpaca + PEFT + DPO on HH-RLHF.
No description available
This project fine-tunes GPT-2 using Direct Preference Optimization (DPO) on preference pairs from the Anthropic HH-RLHF dataset, improving response quality without explicit reward functions. Training uses GPU acceleration and evaluates model performance via loss and accuracy.
yocim1285754508-dotcom
No description available
Deansinon
HH-RLHF dataset training environment with slime framework
dineshram0212
No description available
aryantiwariji007
This is repository is for finetuning mistral 7B on Anthropic's HH-RLHF dataset
Align dialogue models using SFT, ILQL, and PPO on the Anthropic HH-RLHF dataset with trlX
SIBAM890
An OpenEnv RL environment for RLHF preference simulation train agents to judge LLM responses using gold-standard labels from HH-RLHF, Ultra Feedback, and Stanford SHP.
point516
Alignment-Tuning dolly-v2-3b model via Direct Preference Optimization (DPO) method on Athropic's hh-rlhf dataset with cloud GPUs.
classyCommits
End-to-end LLM response evaluation pipeline with multi-judge scoring, inter-judge agreement analysis, and Streamlit dashboard — built on Anthropic HH-RLHF
btisler-DS
Quantify how large language models drift into humanistic / politeness-driven behavior over time, using public datasets and derived, text-free features. Measures H-Drift, FEATS affect dimensions, and Omega interrogative geometry across HH-RLHF, WebGPT, CA-1, and more.
Abdullah-Taha9
DPO training on GPT-2. It uses 5,000 samples from the HH-RLHF dataset. The goal is to find the smallest subset that improves model safety (trade-off performance vs. subset size). The project compares an SFT model and a DPO model using a refusal-rate metric.
This project is a PyTorch implementation of the Direct Preference Optimization (DPO) algorithm, a state-of-the-art technique for fine-tuning Large Language Models (LLMs) with human preferences. The base model used is gpt2, and it is fine-tuned on the "Helpful and Harmless" (HH-RLHF) dataset.
All 19 repositories loaded