LLM System Design Interview #8 - The Contaminated Benchmark Trap
When 95% on MMLU doesn’t mean you’ve built a smarter model - it means your training data leaked the exam answers. How to detect semantic contamination before your press release backfires.
You’re in a Lead AI Engineer interview at Anthropic and the interviewer asks:
“Our new model just hit 95% on MMLU, beating GPT-4. The marketing team is drafting a press release. As the engineering lead, what’s the 𝘧𝘪𝘳𝘴𝘵 𝘵𝘩𝘪𝘯𝘨 you check for that could invalidate this result?”


