Computer Vision Interview Questions #2 – The Redundant Data Trap
Why labeling 500k more images from the same distribution won’t fix overfitting—and how active learning actually moves the decision boundary.
You’re in a Senior Computer Vision interview at Google and the interviewer drops this scenario:
“We trained a high-capacity ResNet on 500k images, but it’s still overfitting. My Product Manager wants to spend $20k to label another 500k random images scraped from the same source. Do you approve the budget?”
Don’t say: “Yes! Deep learning models are data-hungry. To fix high variance, we just need to feed the beast more data.”
That answer is how companies burn millions on compute with zero performance gain.
The reality is that “𝘉𝘪𝘨 𝘋𝘢𝘵𝘢” is often just “𝘙𝘦𝘥𝘶𝘯𝘥𝘢𝘯𝘵 𝘋𝘢𝘵𝘢.”
If your model is overfitting, it means it has memorized the training set but fails on the validation set. Adding 500k more images from the exact same distribution (e.g., more sunny highway driving) often provides near-zero 𝐌𝐚𝐫𝐠𝐢𝐧𝐚𝐥 𝐈𝐧𝐟𝐨𝐫𝐦𝐚𝐭𝐢𝐨𝐧 𝐆𝐚𝐢𝐧.


