AI Interview Prep

AI Interview Prep

Advanced Reinforcement Learning Interview Questions #19 - The Small-Batch Policy Gradient Trap

When N is small and rewards are strictly positive, the gradient pushes up both failures and successes, turning learning into a variance-dominated coin flip.

Hao Hoang's avatar
Hao Hoang
Feb 14, 2026
∙ Paid

You’re in a Senior RL interview at OpenAI. The interviewer sets a trap:

“We collected 6 robot trajectories. 5 failed (low reward). 1 succeeded (high reward). We run a vanilla Policy Gradient update on this small batch. What happens to the gradient?”

90% of candidates walk right into the trap.

They say: “The gradient will point towards the successful trajec…

User's avatar

Continue reading this post for free, courtesy of Hao Hoang.

Or purchase a paid subscription.
© 2026 Hao Hoang · Privacy ∙ Terms ∙ Collection notice
Start your SubstackGet the app
Substack is the home for great culture