LLM System Design Interview #49 - The Vocab Embedding Paradox
How a massive 128k vocabulary quietly breaks scaling math on proxy models, and why counting only non-embedding parameters is the key to safely authorizing a 100B+ run.
You’re in a Senior Pre-training Engineer interview at DeepMind. The interviewer sets a trap:
“You’ve trained a series of smaller proxy models to project scaling laws for your next 100B+ flagship LLM. However, your parameter-to-loss plot isn’t a straight line in log-log space, it’s bending noticeably at the low-parameter end. Assuming training was perfectly stable, what basic structural miscalculation is ruining your extrapolation curve?”
95% of candidates walk right into it.
Most candidates say: “The smallest models are underfitting due to a suboptimal learning rate schedule. We need to tune the cool-down phase or adjust the batch size to ensure the proxy models hit their true minimums.”
They just failed the interview.
Tuning the optimizer on a 50M parameter model won’t straighten your curve. It’s an accounting error, not an optimization error.
𝐓𝐡𝐞 𝐑𝐞𝐚𝐥𝐢𝐭𝐲:
Scaling laws reliably govern the compute-bound operations of your network. We are tracking the dense, structural layers that actually dictate power-law intelligence, specifically your attention weights and MLP blocks.
But vocabulary embeddings? They do not scale the same way.
In a massive 100B parameter model, the embedding table is a statistical rounding error. But in a 50M parameter proxy model with a modern 128k vocabulary size, the embedding parameters artificially dominate your total network count.
If you plot “total parameters” on the x-axis, the low end gets massively bloated by what is essentially a giant lookup table. You aren’t mapping a failure of scaling laws, you are mapping a failure of arithmetic.
𝐓𝐡𝐞 𝐒𝐨𝐥𝐮𝐭𝐢𝐨𝐧:
To get a mathematically flawless, predictable log-log linear scaling curve, you must isolate the parameters that actually drive compute-bound learning.
Keep reading with a 7-day free trial
Subscribe to AI Interview Prep to keep reading this post and get 7 days of free access to the full post archives.

