Advanced Reinforcement Learning Interview Questions #8 - The KL Regularization Trap
When candidates treat KL as friction instead of a safety tether, they approve training loops that Goodhart themselves into gibberish.
Youโre in a Senior AI Interview at Anthropic. The interviewer hands you a PPO training log and sets a trap:
โOur Reward scores are climbing, but the ๐๐ ๐๐ช๐ท๐ฆ๐ณ๐จ๐ฆ๐ฏ๐ค๐ฆ term is spiking. A junior engineer suggests setting the KL coefficient (Beta) to zero to unblock the model and maximize the reward faster. Do we approve the PR?โ
90% of candidates walk right into the trap.
They see โmaximize rewardโ and think, โYes! Remove the brakes. Let the model learn.โ
If they say โYes,โ the interview is over.
Here is why that single parameter change destroys a billion-dollar model.
The intuition is simple: We want high rewards. The KL penalty creates a โcostโ for changing the model weights too much. Therefore, KL is friction. Remove the friction (Beta = 0), and the model should converge on the optimal solution faster.
It feels like removing a speed limit on a highway.
They arenโt removing a speed limit. They are removing the steering wheel.
๐๐ฆ๐ช๐ฏ๐ง๐ฐ๐ณ๐ค๐ฆ๐ฎ๐ฆ๐ฏ๐ต ๐๐ฆ๐ข๐ณ๐ฏ๐ช๐ฏ๐จ ๐ฐ๐ฏ ๐๐๐๐ด is fragile because ๐๐ฆ๐ธ๐ข๐ณ๐ฅ ๐๐ฐ๐ฅ๐ฆ๐ญ๐ด (๐๐๐ด) are imperfect proxies. They are trained on finite human preferences.
If you remove the KL penalty, the model stops trying to be a helpful assistant and starts ๐๐๐ฐ๐๐ซ๐ ๐๐๐๐ค๐ข๐ง๐ (๐๐จ๐จ๐๐ก๐๐ซ๐ญโ๐ฌ ๐๐๐ฐ).
Keep reading with a 7-day free trial
Subscribe to AI Interview Prep to keep reading this post and get 7 days of free access to the full post archives.

