Advanced Reinforcement Learning Interview Questions #19 - The Small-Batch Policy Gradient Trap
When N is small and rewards are strictly positive, the gradient pushes up both failures and successes, turning learning into a variance-dominated coin flip.
You’re in a Senior RL interview at OpenAI. The interviewer sets a trap:
“We collected 6 robot trajectories. 5 failed (low reward). 1 succeeded (high reward). We run a vanilla Policy Gradient update on this small batch. What happens to the gradient?”
90% of candidates walk right into the trap.
They say: “The gradient will point towards the successful trajec…


