Unlock: Reinforcement Learning from Human Feedback

The RLHF pipeline as math, not folklore: Bradley-Terry with sycophancy and intransitivity, KL-shielded PPO with overoptimization mitigation, the DPO implicit-reward identity and its likelihood-displacement failure, online-vs-offline preference learning, Nash-LHF, and the LIMA challenge to whether you need RL at all.

264 Prerequisites0 Mastered0 Working207 Gaps

Prerequisite mastery22%

Recommended probe

Floating-Point Arithmetic is your weakest prerequisite with available questions. You haven't been assessed on this topic yet.

Reinforcement Learning from Human FeedbackTARGET