AI Interview Prep

AI Interview Prep

Advanced Reinforcement Learning Interview Questions #8 - The KL Regularization Trap

When candidates treat KL as friction instead of a safety tether, they approve training loops that Goodhart themselves into gibberish.

Hao Hoang's avatar
Hao Hoang
Feb 03, 2026
โˆ™ Paid

Youโ€™re in a Senior AI Interview at Anthropic. The interviewer hands you a PPO training log and sets a trap:

โ€œOur Reward scores are climbing, but the ๐˜’๐˜“ ๐˜‹๐˜ช๐˜ท๐˜ฆ๐˜ณ๐˜จ๐˜ฆ๐˜ฏ๐˜ค๐˜ฆ term is spiking. A junior engineer suggests setting the KL coefficient (Beta) to zero to unblock the model and maximize the reward faster. Do we approve the PR?โ€

90% of candidates walk right into the trap.

They see โ€œmaximize rewardโ€ and think, โ€œYes! Remove the brakes. Let the model learn.โ€

If they say โ€œYes,โ€ the interview is over.

Here is why that single parameter change destroys a billion-dollar model.

The intuition is simple: We want high rewards. The KL penalty creates a โ€œcostโ€ for changing the model weights too much. Therefore, KL is friction. Remove the friction (Beta = 0), and the model should converge on the optimal solution faster.

It feels like removing a speed limit on a highway.

They arenโ€™t removing a speed limit. They are removing the steering wheel.

๐˜™๐˜ฆ๐˜ช๐˜ฏ๐˜ง๐˜ฐ๐˜ณ๐˜ค๐˜ฆ๐˜ฎ๐˜ฆ๐˜ฏ๐˜ต ๐˜“๐˜ฆ๐˜ข๐˜ณ๐˜ฏ๐˜ช๐˜ฏ๐˜จ ๐˜ฐ๐˜ฏ ๐˜“๐˜“๐˜”๐˜ด is fragile because ๐˜™๐˜ฆ๐˜ธ๐˜ข๐˜ณ๐˜ฅ ๐˜”๐˜ฐ๐˜ฅ๐˜ฆ๐˜ญ๐˜ด (๐˜™๐˜”๐˜ด) are imperfect proxies. They are trained on finite human preferences.

If you remove the KL penalty, the model stops trying to be a helpful assistant and starts ๐‘๐ž๐ฐ๐š๐ซ๐ ๐‡๐š๐œ๐ค๐ข๐ง๐  (๐†๐จ๐จ๐๐ก๐š๐ซ๐ญโ€™๐ฌ ๐‹๐š๐ฐ).

AI Interview Prep is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

Keep reading with a 7-day free trial

Subscribe to AI Interview Prep to keep reading this post and get 7 days of free access to the full post archives.

Already a paid subscriber? Sign in
ยฉ 2026 Hao Hoang ยท Privacy โˆ™ Terms โˆ™ Collection notice
Start your SubstackGet the app
Substack is the home for great culture