Computer Vision Interview Questions #12 - The Large Batch Generalization Trap
Why linear learning-rate scaling silently kills SGD’s implicit regularization and destroys test accuracy.
You’re in a Senior AI Engineer interview at Google DeepMind and the interviewer asks:
“We just scaled our infrastructure to 4x our batch size (256 to 1024) to speed up training. We followed the 𝘓𝘪𝘯𝘦𝘢𝘳 𝘚𝘤𝘢𝘭𝘪𝘯𝘨 𝘙𝘶𝘭𝘦 and multiplied our 𝘓𝘦𝘢𝘳𝘯𝘪𝘯𝘨 𝘙𝘢𝘵𝘦 by 4. But our test accuracy still degraded. What fundamental property of SGD did we accidentally kill?”
Most of candidates say: “We probably should have used the 𝘚𝘲𝘶𝘢𝘳𝘦 𝘙𝘰𝘰𝘵 𝘴𝘤𝘢𝘭𝘪𝘯𝘨 𝘳𝘶𝘭𝘦 instead.”
or
“We didn’t tune the 𝘓𝘦𝘢𝘳𝘯𝘪𝘯𝘨 𝘙𝘢𝘵𝘦 enough.”
This is the trap. It assumes the math is wrong, rather than the dynamics.
The reality is that they didn’t just change the speed, they changed the 𝐈𝐦𝐩𝐥𝐢𝐜𝐢𝐭 𝐑𝐞𝐠𝐮𝐥𝐚𝐫𝐢𝐳𝐚𝐭𝐢𝐨𝐧.
When they increase the batch size, they reduce 𝘎𝘳𝘢𝘥𝘪𝘦𝘯𝘵 𝘕𝘰𝘪𝘴𝘦. And in Deep Learning, noise is often a feature, not a bug.
Here is the trade-off:
Keep reading with a 7-day free trial
Subscribe to AI Interview Prep to keep reading this post and get 7 days of free access to the full post archives.

