AI Interview Prep

AI Interview Prep

Computer Vision Interview Questions #16 - The Contrastive Hard Negative Trap

How aggressive batch difficulty pushes CLIP from semantic understanding into pixel-level cheating.

Hao Hoang's avatar
Hao Hoang
Jan 17, 2026
∙ Paid

You’re in a Senior AI Interview at OpenAI. The interviewer sets a trap:

“Our CLIP model keeps confusing Golden Retrievers with Yellow Labs. To fix it, we’re going to manually curate hard negative batches, forcing these similar breeds into the same training step. Good idea?”

95% of candidates nod “Yes” immediately. They just walked right into the trap.

They continue: “Of course. If the model is struggling to differentiate A from B, we must force them together. By increasing the difficulty of the batch (Hard Mining), the gradient signal will be stronger, forcing the model to learn fine-grained features. Harder training = More robust model.”

This intuition works for Supervised Learning (e.g., ResNet on ImageNet).

It fails catastrophically for Contrastive Foundation Models.

When you force a CLIP model to distinguish between two nearly identical concepts in the same batch, you aren’t teaching it “nuance.” You are forcing it to cheat.

AI Interview Prep is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

Keep reading with a 7-day free trial

Subscribe to AI Interview Prep to keep reading this post and get 7 days of free access to the full post archives.

Already a paid subscriber? Sign in
© 2026 Hao Hoang · Privacy ∙ Terms ∙ Collection notice
Start your SubstackGet the app
Substack is the home for great culture