AI Interview Prep

AI Interview Prep

Computer Vision Interview Questions #2 – The Redundant Data Trap

Why labeling 500k more images from the same distribution won’t fix overfitting—and how active learning actually moves the decision boundary.

Hao Hoang's avatar
Hao Hoang
Jan 03, 2026
∙ Paid

You’re in a Senior Computer Vision interview at Google and the interviewer drops this scenario:

“We trained a high-capacity ResNet on 500k images, but it’s still overfitting. My Product Manager wants to spend $20k to label another 500k random images scraped from the same source. Do you approve the budget?”

Don’t say: “Yes! Deep learning models are data-hungry. To fix high variance, we just need to feed the beast more data.”

That answer is how companies burn millions on compute with zero performance gain.

The reality is that “𝘉𝘪𝘨 𝘋𝘢𝘵𝘢” is often just “𝘙𝘦𝘥𝘶𝘯𝘥𝘢𝘯𝘵 𝘋𝘢𝘵𝘢.”

If your model is overfitting, it means it has memorized the training set but fails on the validation set. Adding 500k more images from the exact same distribution (e.g., more sunny highway driving) often provides near-zero 𝐌𝐚𝐫𝐠𝐢𝐧𝐚𝐥 𝐈𝐧𝐟𝐨𝐫𝐦𝐚𝐭𝐢𝐨𝐧 𝐆𝐚𝐢𝐧.

AI Interview Prep is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

User's avatar

Continue reading this post for free, courtesy of Hao Hoang.

Or purchase a paid subscription.
© 2026 Hao Hoang · Privacy ∙ Terms ∙ Collection notice
Start your SubstackGet the app
Substack is the home for great culture