Machine Learning System Design Interview #43 - The Overfitting Illusion
The hidden trap where skipping a single-batch memorization test to "save compute" silently destroys your entire cluster budget on an architecture incapable of learning.
You’re in a Senior Staff AI Engineer interview at Meta. The interviewer sets a trap:
“Before spinning up a massive distributed training run across a cluster of 512 H100 GPUs, you mandate that your team run a test to deliberately overfit the architecture on a single batch of data. Your team objects, arguing that compute is too expensive to waste on memorizing one batch. How do you justify this?”
95% of candidates walk right into it.
They immediately suggest: “ We should rely on standard PyTorch shape validation and static CI/CD syntax checks, then immediately kick off the distributed run with proper validation splits to optimize for generalizability instead of wasting compute on an intentional failure mode.”
𝐓𝐡𝐞 𝐑𝐞𝐚𝐥𝐢𝐭𝐲:
Deep learning pipelines almost never throw traditional syntax errors when they are fundamentally broken. Instead, they fail silently - tensor operations like .view() or .reshape() can match dimensions perfectly while secretly scrambling token sequence or spatial integrity, or a misplaced minus sign in a custom loss function will still compile without crashing.
If your network cannot drive training loss to absolute zero on a single micro-batch, your optimization mechanics are fractured. Skipping this 2-minute check means risking a $100,000 compute budget training an architecture that is mathematically incapable of learning.
𝐓𝐡𝐞 𝐒𝐨𝐥𝐮𝐭𝐢𝐨𝐧:


