LLM System Design Interview #45 - The FP32 Hidden Tax

Why initializing your training script violently crashes an 80GB A100, and the invisible optimizer states you must shard to survive a Meta systems interview.

May 08, 2026

∙ Paid

You’re in a Senior AI Engineer interview at Meta. The interviewer sets a trap:

“You load a 7-billion parameter model onto an 80GB A100 in BF16. You calculate the weights take up a mere 14 gigabytes. But the moment you initialize your Adam optimizer and take a single training step, the script violently crashes with an Out-Of-Memory (OOM) error. Down to the exact byte multipliers, what hidden variables just silently consumed the vast majority of your memory footprint?”

90% of candidates walk right into it.

Most candidates say: “It’s the activations. Forward passes generate massive activation maps that scale with your sequence length, so you need to implement gradient checkpointing to save VRAM.”

Wrong. They just failed.

𝐓𝐡𝐞 𝐑𝐞𝐚𝐥𝐢𝐭𝐲:

Activations definitely grow, but your script crashed the instant you initialized the optimizer and took a step, before sequence length even had a chance to scale out of control. Candidates forget that the model weights are just the tip of the iceberg. To train with AdamW without catastrophic numerical instability, you must maintain optimizer states in FP32. For a 7B model, your 14GB of BF16 parameters demands an unholy amount of hidden overhead just for the optimizer, instantly blowing past the physical limits of an 80GB GPU.

𝐓𝐡𝐞 𝐒𝐨𝐥𝐮𝐭𝐢𝐨𝐧:

Continue reading this post for free, courtesy of Hao Hoang.

Or purchase a paid subscription.

AI Interview Prep

LLM System Design Interview #45 - The FP32 Hidden Tax

Why initializing your training script violently crashes an 80GB A100, and the invisible optimizer states you must shard to survive a Meta systems interview.

Continue reading this post for free, courtesy of Hao Hoang.