Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Reinforcement Learning from Human Feedback: Deep Dive

3 questionsDifficulty 5-7View topic
Intermediate
0 / 3
2 intermediate1 advancedAdapts to your performance
1 / 3
intermediate (5/10)compare
RLHF typically trains a reward model on pairwise human preferences before policy optimization. Why preferences rather than absolute ratings?