LLM System Design Interview #4 - The Gradient Highway

Why post-norm Transformers break at scale - and how one architectural swap enables stable training for 100B-parameter models.

Nov 05, 2025

∙ Paid

You’re in a ML Engineer interview at Google and the interviewer asks:

“Your team is struggling with training instability and exploding gradients in a new 100B+ model. The original ‘Attention Is All You Need’ paper used post-norm with learning rate warm-up. Why is that a bad idea for deep models, and what’s the one simple architectural change that solves…

Continue reading this post for free, courtesy of Hao Hoang.

Or purchase a paid subscription.

AI Interview Prep

LLM System Design Interview #4 - The Gradient Highway

Why post-norm Transformers break at scale - and how one architectural swap enables stable training for 100B-parameter models.

Continue reading this post for free, courtesy of Hao Hoang.