Computer Vision Interview Questions #16 - The Contrastive Hard Negative Trap
How aggressive batch difficulty pushes CLIP from semantic understanding into pixel-level cheating.
You’re in a Senior AI Interview at OpenAI. The interviewer sets a trap:
“Our CLIP model keeps confusing Golden Retrievers with Yellow Labs. To fix it, we’re going to manually curate hard negative batches, forcing these similar breeds into the same training step. Good idea?”
95% of candidates nod “Yes” immediately. They just walked right into the trap.
They continue: “Of course. If the model is struggling to differentiate A from B, we must force them together. By increasing the difficulty of the batch (Hard Mining), the gradient signal will be stronger, forcing the model to learn fine-grained features. Harder training = More robust model.”
This intuition works for Supervised Learning (e.g., ResNet on ImageNet).
It fails catastrophically for Contrastive Foundation Models.
When you force a CLIP model to distinguish between two nearly identical concepts in the same batch, you aren’t teaching it “nuance.” You are forcing it to cheat.
Keep reading with a 7-day free trial
Subscribe to AI Interview Prep to keep reading this post and get 7 days of free access to the full post archives.

