Machine Learning System Design Interview #42 - The Base-Rate F1 Trap
Why a phenomenal 0.90 F1-score can quietly mask a completely untrained dummy model, and how to decouple aggregate metrics before they cause a silent production crash.
You’re in a Senior ML Engineer interview at Meta. The interviewer sets a trap:
“An engineer shows you a binary classification model boasting a phenomenal 0.90 F1-score on a newly curated validation set, claiming it’s ready for production deployment. Before even looking at the architecture, you flag this metric as a potential illusion. What hidden data profile characteristic are you suspecting, and how do you prove it?”
95% of candidates walk right into it.
Most candidates say: “A 0.90 F1-score is highly robust against class imbalance, unlike accuracy, so the model is fundamentally solid. To be safe, I’ll just check the confusion matrix, plot the ROC-AUC curve, and tune the classification threshold.”
𝐓𝐡𝐞 𝐑𝐞𝐚𝐥𝐢𝐭𝐲:
They forgot how easily aggregate metrics mask high base-rate skews and Simpson’s Paradox. If your newly curated validation set has an underlying 90% positive class distribution, a completely brainless, untrained dummy model that randomly outputs the positive class 90% of the time will naturally achieve a 0.90 F1-score.
You aren’t looking at a production-ready model; you are looking at a baseline illusion. Relying on global metrics across a macro-level validation set completely blinds you to systemic failures inside critical data slices and minority classes, ensuring a silent crash the moment the model encounters real-world data distributions.
𝐓𝐡𝐞 𝐒𝐨𝐥𝐮𝐭𝐢𝐨𝐧:


