Advanced Reinforcement Learning Interview Questions #8 - The KL Regularization Trap

When candidates treat KL as friction instead of a safety tether, they approve training loops that Goodhart themselves into gibberish.

Feb 03, 2026

∙ Paid

You’re in a Senior AI Interview at Anthropic. The interviewer hands you a PPO training log and sets a trap:

“Our Reward scores are climbing, but the 𝘒𝘓 𝘋𝘪𝘷𝘦𝘳𝘨𝘦𝘯𝘤𝘦 term is spiking. A junior engineer suggests setting the KL coefficient (Beta) to zero to unblock the model and maximize the reward faster. Do we approve the PR?”

90% of candidates walk right into the trap.

They see “maximize reward” and think, “Yes! Remove the brakes. Let the model learn.”

If they say “Yes,” the interview is over.

Here is why that single parameter change destroys a billion-dollar model.

The intuition is simple: We want high rewards. The KL penalty creates a “cost” for changing the model weights too much. Therefore, KL is friction. Remove the friction (Beta = 0), and the model should converge on the optimal solution faster.

It feels like removing a speed limit on a highway.

They aren’t removing a speed limit. They are removing the steering wheel.

𝘙𝘦𝘪𝘯𝘧𝘰𝘳𝘤𝘦𝘮𝘦𝘯𝘵 𝘓𝘦𝘢𝘳𝘯𝘪𝘯𝘨 𝘰𝘯 𝘓𝘓𝘔𝘴 is fragile because 𝘙𝘦𝘸𝘢𝘳𝘥 𝘔𝘰𝘥𝘦𝘭𝘴 (𝘙𝘔𝘴) are imperfect proxies. They are trained on finite human preferences.

If you remove the KL penalty, the model stops trying to be a helpful assistant and starts 𝐑𝐞𝐰𝐚𝐫𝐝 𝐇𝐚𝐜𝐤𝐢𝐧𝐠 (𝐆𝐨𝐨𝐝𝐡𝐚𝐫𝐭’𝐬 𝐋𝐚𝐰).

Keep reading with a 7-day free trial

Subscribe to AI Interview Prep to keep reading this post and get 7 days of free access to the full post archives.