Advanced Reinforcement Learning Interview Questions #3 - The Covariate Shift Trap
Imitation Learning collapses when the policy drifts off expert trajectories, while online policy gradients deliberately train on their own mistakes.
Youโre in a Machine Learning Engineer interview at OpenAI and the lead researcher asks:
โWe have a massive dataset of human expert demonstrations for this task. Why shouldnโt we just stick with ๐๐ฎ๐ช๐ต๐ข๐ต๐ช๐ฐ๐ฏ ๐๐ฆ๐ข๐ณ๐ฏ๐ช๐ฏ๐จ (๐๐ฆ๐ฉ๐ข๐ท๐ช๐ฐ๐ณ ๐๐ญ๐ฐ๐ฏ๐ช๐ฏ๐จ)? Why take on the instability of ๐๐ฏ๐ญ๐ช๐ฏ๐ฆ ๐๐ฐ๐ญ๐ช๐ค๐บ ๐๐ณ๐ข๐ฅ๐ช๐ฆ๐ฏ๐ต๐ด?โ
Donโt say: โBecause Reinforcement Learning is just better...โ or โBecause the model might overfit the training data.โ
This is too vague and ignores the fundamental mathematical difference.
The reality is that ๐๐ฆ๐ข๐ญ๐๐ญ๐ข๐จ๐ง ๐๐๐๐ซ๐ง๐ข๐ง๐ (๐๐) has a hard ceiling. It treats the expert as the โground truth,โ meaning your model can, at best, only be as good as the human providing the data.
To explain why we switch to ๐๐ง๐ฅ๐ข๐ง๐ ๐๐จ๐ฅ๐ข๐๐ฒ ๐๐ซ๐๐๐ข๐๐ง๐ญ๐ฌ (๐๐), you need to hit three specific points:
1๏ธโฃ ๐๐ฉ๐ฆ ๐๐ฆ๐ณ๐ง๐ฐ๐ณ๐ฎ๐ข๐ฏ๐ค๐ฆ ๐๐ฆ๐ช๐ญ๐ช๐ฏ๐จ:
IL maximizes the likelihood of expert actions. Itโs strictly limited by human capability.
Online PG maximizes the expected sum of rewards. This allows the agent to discover superhuman strategies that the expert never considered.
2๏ธโฃ ๐๐ฉ๐ฆ โ๐๐ฐ๐ท๐ข๐ณ๐ช๐ข๐ต๐ฆ ๐๐ฉ๐ช๐ง๐ตโ ๐๐ณ๐ข๐ฑ:
Expert data usually contains only โcorrectโ paths. If your IL model deviates slightly (which it will), it lands in a state it has never seen before and fails catastrophically.
Because it never saw the expert make a mistake, it doesnโt know how to recover. Online RL forces the agent to experience failures and learn recovery policies.
3๏ธโฃ ๐๐ฑ๐ต๐ช๐ฎ๐ช๐ป๐ข๐ต๐ช๐ฐ๐ฏ ๐ท๐ด. ๐๐ช๐ฎ๐ช๐ค๐ณ๐บ:
Imitation is trying to paint a picture by tracing over someone elseโs lines.
Policy Gradients are learning to paint by seeing what actually sells at the gallery.
๐๐ก๐ ๐๐ง๐ฌ๐ฐ๐๐ซ ๐๐ก๐๐ญ ๐๐๐ญ๐ฌ ๐๐จ๐ฎ ๐๐ข๐ซ๐๐: โImitation Learning minimizes the distance to the expertโs behavior. Policy Gradients maximize the distance to failure. If you want to outperform the human, you must stop mimicking them and start optimizing the reward directly.โ
#ReinforcementLearning #MachineLearning #ArtificialIntelligence #DeepLearning #TechInterviews #DataScience #OpenAI
If my work helps you learn faster, build smarter, and level up in AI - consider supporting my journey and staying connected ๐ช
โ๏ธBuy me a coffee: https://buymeacoffee.com/haohoang
๐LinkedIn: https://www.linkedin.com/in/hoang-van-hao/
๐ฐPayPal Support: https://paypal.me/HaoHoang2808


๐ Related Papers:
- A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning. Available at: https://arxiv.org/abs/1011.0686
- End-to-End Driving via Conditional Imitation Learning. Available at: https://arxiv.org/abs/1710.02410
- Stable-BC: Controlling Covariate Shift with Stable Behavior Cloning. Available at: https://arxiv.org/abs/2408.06246
- DARIL: When Imitation Learning outperforms Reinforcement Learning in Surgical Action Planning. Available at: https://arxiv.org/abs/2507.05011