AI Interview Prep

AI Interview Prep

LLM System Design Interview #46 - The ZeRO-1 Bandwidth Illusion

Why assuming optimizer sharding adds network overhead is a fatal interview trap, and how decomposing an All-Reduce guarantees mathematically identical communication cost while slashing VRAM.

Hao Hoang's avatar
Hao Hoang
May 09, 2026
∙ Paid

You’re in a Senior ML Systems Engineer interview at OpenAI. The interviewer sets a trap:

“Your cluster is running standard Data Parallelism, but Adam optimizer states are causing a massive VRAM bottleneck. You suggest sharding the optimizer state across GPUs using ZeRO Stage 1, but the interviewer pushes back: Doesn’t that cause a massive network bottleneck from constantly transmitting state updates?”

95% of candidates walk right into it.

Most candidates say: “Yes, it adds communication overhead, but we can hide the latency by utilizing larger batch sizes or overlapping computation with communication using custom CUDA streams.”

Wrong. They just failed. That is a patch, not a mathematical solution.


𝐓𝐡𝐞 𝐑𝐞𝐚𝐥𝐢𝐭𝐲:

In standard Data Parallelism (DDP), your Adam optimizer states - master weights, momentum, and variance - are eating up a 12-16 bytes per parameter on every single GPU. That is where your VRAM goes to die.

The assumption is that sharding this state across devices means adding extra network steps to constantly sync the updates over your InfiniBand links.

But distributed training isn’t about blindly throwing bandwidth at a problem. If we understand the physics of collective communication operations, we realize the bandwidth penalty is a complete illusion.


𝐓𝐡𝐞 𝐒𝐨𝐥𝐮𝐭𝐢𝐨𝐧:

AI Interview Prep is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

User's avatar

Continue reading this post for free, courtesy of Hao Hoang.

Or purchase a paid subscription.
© 2026 Hao Hoang · Privacy ∙ Terms ∙ Collection notice
Start your SubstackGet the app
Substack is the home for great culture