Advanced Reinforcement Learning Interview Questions #21 - The Happy Path Trap
When reward design mirrors end-user success metrics, the policy converges to safe trajectories and systematically under-explores the brittle edges of the simulation.
You’re in a Machine Learning Engineer interview at OpenAI and the interviewer asks:
“We are building an RL agent to grade student-coded video games (like Breakout). How do you design the reward function to catch the most bugs?”
Most candidates smirk and say:
“Easy. Reward the agent for maximizing the score. If it can beat the game, the code works.”
Wrong. …


