Advanced Reinforcement Learning Interview Questions #19 - The Small-Batch Policy Gradient Trap

When N is small and rewards are strictly positive, the gradient pushes up both failures and successes, turning learning into a variance-dominated coin flip.

Feb 14, 2026

∙ Paid

You’re in a Senior RL interview at OpenAI. The interviewer sets a trap:

“We collected 6 robot trajectories. 5 failed (low reward). 1 succeeded (high reward). We run a vanilla Policy Gradient update on this small batch. What happens to the gradient?”

90% of candidates walk right into the trap.

They say: “The gradient will point towards the successful trajec…

Continue reading this post for free, courtesy of Hao Hoang.

Or purchase a paid subscription.

AI Interview Prep

Advanced Reinforcement Learning Interview Questions #19 - The Small-Batch Policy Gradient Trap

When N is small and rewards are strictly positive, the gradient pushes up both failures and successes, turning learning into a variance-dominated coin flip.

Continue reading this post for free, courtesy of Hao Hoang.