LLM System Design Interview #48 - The Dimensionality Trap
Why dumping 10x more data into your model silently flatlines your scaling curve, and how to manipulate effective dimensionality before you burn millions in useless H100 compute.
You’re in a Senior AI Engineer interview at DeepMind. The interviewer sets a trap:
“You scaled your pre-training dataset by 10x, but the error rate barely budged. Your model is massively over-parameterized, so capacity isn’t the issue. What ‘intrinsic’ statistical property of your target task is fundamentally bottlenecking your power-law returns?”
95% of candidates walk right into it.
Most candidates say: “It is a data quality issue. The new 10x scrape is full of low-entropy garbage, or the duplication rate is too high. We need to run strict MinHash deduplication, upsample high-quality sources, or drop our learning rate to escape a local minimum.”
That is a patch, not a solution. If you blindly blame the data pipeline without understanding the underlying statistical physics, you are going to burn millions of dollars in H100 compute chasing a flat curve.
𝐓𝐡𝐞 𝐑𝐞𝐚𝐥𝐢𝐭𝐲:
The bottleneck isn’t just dirty data. It is the intrinsic dimensionality of the target manifold.
In statistical machine learning, scaling laws follow polynomial decay rates tied directly to the flexibility of the function class and the dimension of the data. For flexible, non-parametric models fitting a complex space, your error decays at a rate of roughly O(n^(-1/D)), where D is the intrinsic dimension. When D is massive, adding 10x the data mathematically barely moves the needle. Your power-law slope flattens out not because the data is “bad,” but because the volume of the statistical space you need to sample is growing exponentially faster than your compute budget.
𝐓𝐡𝐞 𝐒𝐨𝐥𝐮𝐭𝐢𝐨𝐧:
You don’t just dump more raw tokens into the void. You manipulate the effective dimensionality.


