Reinforcement Learning from Human Feedback

Train a reward model on human preferences, then optimise against it — what made GPT conversational and refusable.

Suggested next → AI Alignment & Evaluation · CS·AI

The brief

Pretrained large language models can do many things. They can also do many things you would rather they didn't: produce confidently wrong answers, follow harmful instructions, drift into incoherent monologues, refuse to admit uncertainty, plagiarize training data verbatim. The base model is raw capability; turning that into something useful and honest and safe enough to ship requires a second training stage absent from classical ML textbooks. Reinforcement Learning from Human Feedback (RLHF) — formalized by Christiano et al. in 2017 and made famous by OpenAI's InstructGPT (2022) and ChatGPT (November 2022) — is the technique that made GPT-style models conversational, helpful, and willing to refuse; a small machinery on top of a large one that determines almost every behavioural property of the deployed system.

RLHF trains a language model to optimise a learned reward signal derived from human preference rankings, in three stages. Stage 1 — supervised fine-tuning (SFT): human contractors write demonstrations of the desired behaviour, and the pretrained base model is fine-tuned via standard supervised learning. Stage 2 — reward modelling: the SFT model generates multiple candidate responses per prompt, human raters rank them pairwise, and a separate reward model is trained to predict human preferences. Stage 3 — RL fine-tuning: the SFT model is optimised against the reward model using Proximal Policy Optimization (PPO) with a KL penalty against the SFT model — the penalty is essential, since without it the policy exploits the reward model to produce nonsensical outputs that score high. What RLHF actually changes is refusal behaviour, conversational style, willingness to admit uncertainty, formatting, sycophancy, and helpfulness; capability ceilings are largely set by pretraining. Known failure modes are characteristic — reward hacking, sycophancy, overrefusal, length bias, mode collapse — and alternatives have emerged: Direct Preference Optimization (DPO; Rafailov et al. 2023) eliminates the reward model and trains directly on preference data; Constitutional AI (Anthropic 2022) replaces some human labels with AI critique against a written constitution (RLAIF); reward shaping with execution feedback in code-generation models replaces the reward model with automatic correctness checks; and process reward models score reasoning steps rather than only outcomes.

Why nowEvery commercial general-purpose LLM — Claude, ChatGPT, Gemini, Llama-Instruct, Grok, Mistral, DeepSeek-Chat, Qwen — is post-trained with some variant of RLHF or DPO. RLHF data labelling has become a substantial industry, with Surge AI, Scale AI, Invisible Technologies, and Outlier employing thousands of contractors, while Constitutional AI and RLAIF reduce labelling cost at the price of AI judges encoding their own biases. Reasoning models (OpenAI o1/o3, DeepSeek-R1, Claude with extended thinking) introduce test-time chain-of-thought scaling and outcome-based rewards on long reasoning trajectories — a substantial shift in the RL pipeline. Whether the human-labelled-preference signal is fundamentally sufficient for capabilities humans cannot themselves evaluate is the scalable oversight problem.