LLM System Design Interview #4 - The Gradient Highway
Why post-norm Transformers break at scale - and how one architectural swap enables stable training for 100B-parameter models.
You’re in a ML Engineer interview at Google and the interviewer asks:
“Your team is struggling with training instability and exploding gradients in a new 100B+ model. The original ‘Attention Is All You Need’ paper used post-norm with learning rate warm-up. Why is that a bad idea for deep models, and what’s the one simple architectural change that solves…


