Advanced Reinforcement Learning Interview Questions #6 - The Initialization Gap Trap

A policy isn't done when it succeeds at its task, it's done when its final state is compatible with whatever comes next.

Feb 01, 2026

You’re in a final-round interview for a Senior AI Engineer role at NVIDIA Robotics.

The VP of Engineering draws a simple diagram on the whiteboard and sets the trap:

“We trained Policy A (Boil Water) to 99% accuracy. We trained Policy B (Find Pasta) to 99% accuracy. Both work perfectly in isolation. But when we run them in sequence (A → B), the robot fails immediately. Why?”

90% of candidates walk right into the trap.

Most engineers immediately treat this like a software engineering bug. They say:

- “There’s a latency issue in the handover.”

- “The goal string passed from the planner is malformed.”

- “The two policies are competing for GPU resources.”

They assume that if Function_A() works and Function_B() works, then Function_A() + Function_B() must work.

This isn’t deterministic code. This is probabilistic Deep Learning.

They assume the “End State” of Policy A is identical to the “Start State” of Policy B. It never is.

Policy B (Find Pasta) was likely trained using “Clean Resets”, starting the robot in a perfect, neutral position in front of the pantry.

But when Policy A (Boil Water) finishes, the robot is likely in a “messy” state: arm slightly extended, torso tilted 5 degrees left, holding a hot pot. Policy B has literally never seen this state in its training distribution. It panics and fails.

-----

The Solution: We need to identify this immediately as a specific type of Distribution Shift called 𝐓𝐡𝐞 𝐈𝐧𝐢𝐭𝐢𝐚𝐥𝐢𝐳𝐚𝐭𝐢𝐨𝐧 𝐆𝐚𝐩.

In hierarchical systems, modularity is an illusion. To fix this in production, you cannot train modules in isolation. You have two options:

- Chained Inference Training: Train Policy B specifically on the noisy terminal states of Policy A, not just clean resets.

- Overlap Buffering: Ensure the “Success Condition” for Policy A is stricter than the “Entry Condition” for Policy B, creating a safety buffer of state coverage.

𝐓𝐡𝐞 𝐀𝐧𝐬𝐰𝐞𝐫 𝐓𝐡𝐚𝐭 𝐆𝐞𝐭𝐬 𝐘𝐨𝐮 𝐇𝐢𝐫𝐞𝐝:

“A modular policy is only as good as its handoff. I don’t optimize for task success in isolation; I optimize for Terminal State Compatibility, ensuring the output distribution of the first policy fits the input distribution of the second.”

#ReinforcementLearning #RoboticsAI #HierarchicalRL #ProductionML #DistributionShift #PolicyLearning #AIEngineering #LLMSystems

If my work helps you learn faster, build smarter, and level up in AI - consider supporting my journey and staying connected 💪
☕️Buy me a coffee: https://buymeacoffee.com/haohoang
🔗LinkedIn: https://www.linkedin.com/in/hoang-van-hao/
🔗Facebook: https://www.facebook.com/haohoangaie/
🔗X: https://x.com/HaoHoangAI
💰PayPal Support: https://paypal.me/HaoHoang2808

📚 Related Papers:

- Compositional Transfer in Hierarchical Reinforcement Learning. Available at: https://arxiv.org/abs/1906.11228

- Addressing Distribution Shift in Robotic Imitation Learning. Available at: https://starslab.ca/wp-content/papercite-data/pdf/2025_ablett_addressing.pdf

- How to Mitigate the Distribution Shift Problem in Robotics Control: A Robust and Adaptive Approach Based on Offline to Online Imitation Learning. Available at: https://openreview.net/forum?id=FI2mcfPoOc

AI Interview Prep

Discussion about this post

Ready for more?