What is RLHF (Reinforcement Learning from Human Feedback)?

RLHF (Reinforcement Learning from Human Feedback) — AI Glossary

RLHF (Reinforcement Learning from Human Feedback)

RLHF is a training step that tunes a model toward what people actually prefer: humans rank model outputs, those rankings train a reward model, and the model is then optimized to score well against it. It's a big part of why chat models feel helpful instead of just fluent.

Also known as: reinforcement learning from human feedback

Jun 16, 2026 · Chain of Thought

A pretrained model predicts likely text, which isn’t the same as text people find helpful, honest, or safe. RLHF bridges that gap. Humans compare and rank model outputs; those comparisons train a reward model that predicts human preference; then the language model is optimized to maximize that reward. The result is a model aligned to what raters wanted, not just to raw next-token likelihood.

It’s much of why modern chat models follow instructions and refuse obviously bad requests. It also has limits worth knowing: the model inherits the biases and blind spots of the raters and the reward model, and optimizing too hard against the reward can produce sycophantic or gamed behavior. Related techniques (preference optimization methods, AI-generated feedback) are variations on the same idea — train on preferences, not just text.

RLHF (Reinforcement Learning from Human Feedback)

Related terms