Search Results

Found 19 repositories(showing 19)

hh-rlhf

anthropics

🧡68

Human preference data for "Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback"

1.8k

153

MIT

Updated 2 days ago

Safety-Focused-Large-Language-Model-LLM-Fine-Tuning-and-Dataset-Generation

EdwinSJ

❤️45

Performed supervised fine-tuning (SFT) on Llama 3.1 8B using HH-RLHF and Ranked 10K responses with Llama 3.1 70B to build a safety-optimized dataset

Jupyter Notebook

Updated 2 months ago

llm-diffusion-preference-optimization

le9endary

❤️35

Direct Preference Optimization (DPO) for large language diffusion models (LLaDA-8B), using Monte Carlo ELBO-based preference loss, LoRA adapters, and 8-bit quantization for efficient single-GPU training. Achieves improved alignment +4% win rate over baseline on Anthropic HH-RLHF preference data

Jupyter Notebook

Updated 3 months ago

reward-hh-rlhf

sp4s-s

❤️25

No description available

Jupyter Notebook

Updated 3 months ago

Data_Ko_hh-rlhf

sionic-ai

❤️25

No description available

Updated 1 year ago

LLM-SFT-DPO-PEFT

cosmic-heart

❤️35

Mistral 7b - SFT on Alpaca + PEFT + DPO on HH-RLHF.

Python

Updated 1 year ago

finetuning-gemma3-with-grpo-on-Anthropic-hh-rlhf-dataset

Sidhtang

❤️25

No description available

Jupyter Notebook

Updated 9 months ago

Direct-Preference-Optimization-gpt2

Elina117

❤️35

This project fine-tunes GPT-2 using Direct Preference Optimization (DPO) on preference pairs from the Anthropic HH-RLHF dataset, improving response quality without explicit reward functions. Training uses GPU acceleration and evaluates model performance via loss and accuracy.

Jupyter Notebook

Updated 7 months ago

RLHFdemo_HH_sty

yocim1285754508-dotcom

🧡55

No description available

Python

Updated 1 day ago

hh-rlhf-slime-training

Deansinon

❤️45

HH-RLHF dataset training environment with slime framework

Python

Updated 2 months ago

HH-Assistant-RLHF-Model

dineshram0212

❤️25

No description available

Updated 1 year ago

Fintuning-Mistral-7B-

aryantiwariji007

❤️35

This is repository is for finetuning mistral 7B on Anthropic's HH-RLHF dataset

Python

Updated 5 months ago

workflow-carperai-trlx-rlhf-dialogue-alignment

leeroopedia

❤️45

Align dialogue models using SFT, ILQL, and PPO on the Anthropic HH-RLHF dataset with trlX

Python

Updated 1 month ago

preferencelab

SIBAM890

🧡65

An OpenEnv RL environment for RLHF preference simulation train agents to judge LLM responses using gold-standard labels from HH-RLHF, Ultra Feedback, and Stanford SHP.

Python

Updated 1 hour ago

DPO_tuning

point516

❤️35

Alignment-Tuning dolly-v2-3b model via Direct Preference Optimization (DPO) method on Athropic's hh-rlhf dataset with cloud GPUs.

Python

Updated 5 months ago

llm-eval-tracker

classyCommits

🧡55

End-to-end LLM response evaluation pipeline with multi-judge scoring, inter-judge agreement analysis, and Streamlit dashboard — built on Anthropic HH-RLHF

Python

Updated 1 week ago

annotationdata-pipelineevaluation+10

h-drift-lab

btisler-DS

🧡50

Quantify how large language models drift into humanistic / politeness-driven behavior over time, using public datasets and derived, text-free features. Measures H-Drift, FEATS affect dimensions, and Omega interrogative geometry across HH-RLHF, WebGPT, CA-1, and more.

NOASSERTION

Python

Updated 1 month ago

alignmentfeatsh-drift+4

DPO-Best-Subset

Abdullah-Taha9

❤️35

DPO training on GPT-2. It uses 5,000 samples from the HH-RLHF dataset. The goal is to find the smallest subset that improves model safety (trade-off performance vs. subset size). The project compares an SFT model and a DPO model using a refusal-rate metric.

Jupyter Notebook

Updated 4 months ago

Direct-Preference-Optimization-Implementation

Ethan0991

❤️35

This project is a PyTorch implementation of the Direct Preference Optimization (DPO) algorithm, a state-of-the-art technique for fine-tuning Large Language Models (LLMs) with human preferences. The base model used is gpt2, and it is fine-tuned on the "Helpful and Harmless" (HH-RLHF) dataset.

Updated 6 months ago

All 19 repositories loaded

GitHub Explorer

Search Results

hh-rlhf

Safety-Focused-Large-Language-Model-LLM-Fine-Tuning-and-Dataset-Generation

llm-diffusion-preference-optimization

reward-hh-rlhf

Data_Ko_hh-rlhf

LLM-SFT-DPO-PEFT

finetuning-gemma3-with-grpo-on-Anthropic-hh-rlhf-dataset

Direct-Preference-Optimization-gpt2

RLHFdemo_HH_sty

hh-rlhf-slime-training

HH-Assistant-RLHF-Model

Fintuning-Mistral-7B-

workflow-carperai-trlx-rlhf-dialogue-alignment

preferencelab

DPO_tuning

llm-eval-tracker

h-drift-lab

DPO-Best-Subset

Direct-Preference-Optimization-Implementation

hh-rlhf

Safety-Focused-Large-Language-Model-LLM-Fine-Tuning-and-Dataset-Generation

llm-diffusion-preference-optimization

reward-hh-rlhf

Data_Ko_hh-rlhf

LLM-SFT-DPO-PEFT

finetuning-gemma3-with-grpo-on-Anthropic-hh-rlhf-dataset

Direct-Preference-Optimization-gpt2

RLHFdemo_HH_sty

hh-rlhf-slime-training

HH-Assistant-RLHF-Model

Fintuning-Mistral-7B-

workflow-carperai-trlx-rlhf-dialogue-alignment

preferencelab

DPO_tuning

llm-eval-tracker

h-drift-lab

DPO-Best-Subset

Direct-Preference-Optimization-Implementation