AI Interview Prep

AI Interview Prep

LLM System Design Interview #49 - The Vocab Embedding Paradox

How a massive 128k vocabulary quietly breaks scaling math on proxy models, and why counting only non-embedding parameters is the key to safely authorizing a 100B+ run.

Hao Hoang's avatar
Hao Hoang
May 12, 2026
∙ Paid

You’re in a Senior Pre-training Engineer interview at DeepMind. The interviewer sets a trap:

“You’ve trained a series of smaller proxy models to project scaling laws for your next 100B+ flagship LLM. However, your parameter-to-loss plot isn’t a straight line in log-log space, it’s bending noticeably at the low-parameter end. Assuming training was perfectly stable, what basic structural miscalculation is ruining your extrapolation curve?”

95% of candidates walk right into it.

Most candidates say: “The smallest models are underfitting due to a suboptimal learning rate schedule. We need to tune the cool-down phase or adjust the batch size to ensure the proxy models hit their true minimums.”

They just failed the interview.

Tuning the optimizer on a 50M parameter model won’t straighten your curve. It’s an accounting error, not an optimization error.


𝐓𝐡𝐞 𝐑𝐞𝐚𝐥𝐢𝐭𝐲:

Scaling laws reliably govern the compute-bound operations of your network. We are tracking the dense, structural layers that actually dictate power-law intelligence, specifically your attention weights and MLP blocks.

But vocabulary embeddings? They do not scale the same way.

In a massive 100B parameter model, the embedding table is a statistical rounding error. But in a 50M parameter proxy model with a modern 128k vocabulary size, the embedding parameters artificially dominate your total network count.

If you plot “total parameters” on the x-axis, the low end gets massively bloated by what is essentially a giant lookup table. You aren’t mapping a failure of scaling laws, you are mapping a failure of arithmetic.


𝐓𝐡𝐞 𝐒𝐨𝐥𝐮𝐭𝐢𝐨𝐧:

To get a mathematically flawless, predictable log-log linear scaling curve, you must isolate the parameters that actually drive compute-bound learning.

AI Interview Prep is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

Keep reading with a 7-day free trial

Subscribe to AI Interview Prep to keep reading this post and get 7 days of free access to the full post archives.

Already a paid subscriber? Sign in
© 2026 Hao Hoang · Privacy ∙ Terms ∙ Collection notice
Start your SubstackGet the app
Substack is the home for great culture