LLM System Design Interview #47 - The Grid Search Trap
The hidden reason testing data mixtures at the 1B scale is computationally reckless - and how exploiting the log-log scaling offset lets you perfectly transfer optimal weights to your 100B run.
You’re in a Senior Pre-training Engineer interview at DeepMind. The interviewer sets a trap:
“Compute is tight, but you need to find the exact optimal ratio of code, web, and book data for a 100B parameter model. How do you empirically determine the perfect mixture without wasting millions of dollars running ablation tests at massive scale?”
95% of candidates walk right into it.
Most candidates say: “I’d train a grid of 1B parameter models on different data mixtures, evaluate them on downstream benchmarks like MMLU and HumanEval, and just pick the mixture that scores the highest. Then we scale that winning ratio up to 100B.”
That is a guaranteed way to burn a $10M hole in your GPU cluster budget.
𝐓𝐡𝐞 𝐑𝐞𝐚𝐥𝐢𝐭𝐲:
Downstream task performance is notoriously noisy and non-linear at small scales.
If you use zero-shot accuracy to evaluate 1B proxy models, you are optimizing for statistical noise, not true distribution learning. Furthermore, running massive grid searches on arbitrary mixtures is computationally reckless.
The physics of scaling laws tell us something critical: changing your data composition does not change the slope of your scaling curve in log-log space. It only shifts the offset (the y-intercept). If you understand that, you don’t need a massive grid search or 100B parameters to find the optimal ratio.
𝐓𝐡𝐞 𝐒𝐨𝐥𝐮𝐭𝐢𝐨𝐧:
1️⃣ Spin up a set of small proxy models (e.g., 50M to 1B parameters) across a few orders of magnitude of compute.
2️⃣ Train them on a handful of distinct, orthogonal data mixtures (e.g., heavy code, heavy web, heavy books).
3️⃣ Ignore downstream benchmarks completely. Evaluate the models strictly on next-token prediction cross-entropy loss against a high-quality, held-out validation set.
4️⃣ Because data composition only affects the scaling offset, you can confidently fit a log-log linear scaling law to your cross-entropy loss for each mixture.
5️⃣ Formulate the expected loss as a function of data mixing weights, and solve for the optimal weights at the small scale using standard regression. That mathematically derived minimum will flawlessly transfer to your 100B run.
𝐓𝐡𝐞 𝐀𝐧𝐬𝐰𝐞𝐫 𝐓𝐡𝐚𝐭 𝐆𝐞𝐭𝐬 𝐘𝐨𝐮 𝐇𝐢𝐫𝐞𝐝:
You use small proxy models to fit scaling laws on next-token validation loss, exploiting the fact that data composition shifts the scaling offset but not the slope. This allows you to mathematically regress the exact optimal data weights at the 1B scale and confidently transfer that mixture to the 100B production run.
#MachineLearning #MLEngineering #LLMs #ScalingLaws #PreTraining #DeepLearning #AIArchitecture


📚 Related Papers:
- Scaling Laws for Neural Language Models. Available at: https://arxiv.org/abs/2001.08361
- DoReMi: Optimizing Data Mixtures Speeds Up Language Model Pretraining. Available at: https://arxiv.org/abs/2305.10429
- Data Mixing Laws: Optimizing Data Mixtures by Predicting Language Modeling Performance. Available at: https://arxiv.org/abs/2403.16952
- BiMix: Bivariate Data Mixing Law for Language Model Pretraining. Available at: https://arxiv.org/abs/2405.14908