AI Interview Prep

AI Interview Prep

Computer Vision Interview Questions #24 - The Signal-to-Noise Trap

Why 6 billion image-text pairs can lose to 700k dense captions, and how signal density beats brute-force scale.

Hao Hoang's avatar
Hao Hoang
Jan 25, 2026
∙ Paid

You’re in a Senior AI Interview at Google DeepMind. The interviewer sets a trap:

“Our competitor just trained a VLM on 6 billion image-text pairs. We only have the compute budget for 700k images. How do we beat them?”

90% of candidates walk right into the “𝘚𝘤𝘢𝘭𝘦 𝘛𝘳𝘢𝘱.”

Most candidates immediately pivot to architectural tweaks or hyper-parameter tuning.

- “We need a larger ViT backbone.”

- “We should train for more epochs since the dataset is small.”

- “We need aggressive data augmentation to artificially expand the 700k.”

The Result: They fail. They cannot augment their way out of a 10,000x data deficit. They are bringing a knife to a nuclear war.

The interviewer isn’t testing their knowledge of scale. The interviewer is testing the candidate’s understanding of 𝐒𝐢𝐠𝐧𝐚𝐥-𝐭𝐨-𝐍𝐨𝐢𝐬𝐞 𝐑𝐚𝐭𝐢𝐨.

The problem with 6 billion web-scraped images isn’t the images. It’s the text.

𝘐𝘯𝘵𝘦𝘳𝘯𝘦𝘵 𝘵𝘦𝘹𝘵 𝘪𝘴 𝘐𝘯𝘤𝘪𝘥𝘦𝘯𝘵𝘢𝘭.

When a human uploads a photo of a dog on a beach, they caption it: “𝘓𝘪𝘷𝘪𝘯𝘨 𝘮𝘺 𝘣𝘦𝘴𝘵 𝘭𝘪𝘧𝘦! ☀️” or “𝘎𝘰𝘰𝘥 𝘣𝘰𝘺.”

They do not write: “𝘈 𝘨𝘰𝘭𝘥𝘦𝘯 𝘳𝘦𝘵𝘳𝘪𝘦𝘷𝘦𝘳 𝘴𝘵𝘢𝘯𝘥𝘪𝘯𝘨 𝘰𝘯 𝘸𝘩𝘪𝘵𝘦 𝘴𝘢𝘯𝘥 𝘵𝘰 𝘵𝘩𝘦 𝘭𝘦𝘧𝘵 𝘰𝘧 𝘢 𝘣𝘭𝘶𝘦 𝘰𝘤𝘦𝘢𝘯 𝘶𝘯𝘥𝘦𝘳 𝘢 𝘤𝘭𝘦𝘢𝘳 𝘴𝘬𝘺.”

AI Interview Prep is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

Keep reading with a 7-day free trial

Subscribe to AI Interview Prep to keep reading this post and get 7 days of free access to the full post archives.

Already a paid subscriber? Sign in
© 2026 Hao Hoang · Privacy ∙ Terms ∙ Collection notice
Start your SubstackGet the app
Substack is the home for great culture