AI Interview Prep

AI Interview Prep

LLM System Design Interview #4 - The Gradient Highway

Why post-norm Transformers break at scale - and how one architectural swap enables stable training for 100B-parameter models.

Hao Hoang's avatar
Hao Hoang
Nov 05, 2025
∙ Paid

You’re in a ML Engineer interview at Google and the interviewer asks:

“Your team is struggling with training instability and exploding gradients in a new 100B+ model. The original ‘Attention Is All You Need’ paper used post-norm with learning rate warm-up. Why is that a bad idea for deep models, and what’s the one simple architectural change that solves…

User's avatar

Continue reading this post for free, courtesy of Hao Hoang.

Or purchase a paid subscription.
© 2026 Hao Hoang · Privacy ∙ Terms ∙ Collection notice
Start your SubstackGet the app
Substack is the home for great culture