LLM Agents Interview Questions #22 - The Verifiable Reward Bypass Trap
Using a neural reward model for strict constraints is a category error, replace it with deterministic evaluation or guarantee systematic reward hacking.
You’re in a Senior AI Engineer interview at OpenAI. The interviewer sets a trap:
“You’re fine-tuning an LLM for instruction following (IFEval) using PPO. By step 400, your reward curve is steadily climbing, but your actual evaluation scores are tanking. How do you fix the reward pipeline without just training a massive 70B reward model?”
Most candidates say they would aggressively crank up the KL divergence penalty to keep the policy close to the reference model. Or worse, they suggest spending $50k on new human preference data to train a more robust neural reward model.
That feels right, but in production, it’s an expensive dead end.
You aren’t actually optimizing for instruction following; you are optimizing for the reward model’s hidden biases.
A neural reward model is essentially guessing, spitting out an arbitrary scalar like 10.5 based on prose style rather than checking if the model actually obeyed a strict constraint like “output exactly 3 JSON blocks.”
The policy model quickly learns to hack this aesthetic preference, triggering a classic case of Goodhart’s Law.
The reality is, you shouldn’t be using a neural reward model for this at all. Welcome to 𝐓𝐡𝐞 𝐕𝐞𝐫𝐢𝐟𝐢𝐚𝐛𝐥𝐞 𝐑𝐞𝐰𝐚𝐫𝐝 𝐁𝐲𝐩𝐚𝐬𝐬.
Keep reading with a 7-day free trial
Subscribe to AI Interview Prep to keep reading this post and get 7 days of free access to the full post archives.

