Beta. Content is under active construction and has not been peer-reviewed. Report errors on
GitHub
.
Disclaimer
Theorem
Path
Curriculum
Paths
Demos
Diagnostic
Search
Quiz Hub
/
Reinforcement Learning from Human Feedback: Deep Dive
Reinforcement Learning from Human Feedback: Deep Dive
3 questions
Difficulty 5-7
View topic
Intermediate
0 / 3
2 intermediate
1 advanced
Adapts to your performance
1 / 3
intermediate (5/10)
compare
RLHF typically trains a reward model on pairwise human preferences before policy optimization. Why preferences rather than absolute ratings?
Hide and think first
A.
Pairwise preferences generate more training data because each pair has two observations
B.
Pairwise preferences are more consistent across annotators than absolute scales, which suffer from scale drift and calibration differences
C.
Preferences are mathematically equivalent to a cross-entropy loss, which is the only loss that works for reward modeling
D.
Absolute ratings require continuous output from the model, while preferences work with discrete categories only
Submit Answer