Computer Vision Interview Questions #25 โ The Contrastive Shortcut Trap
Why CLIP learns just enough to pass, and how a generative decoder forces the encoder to stop being lazy.
Youโre in a Computer Vision interview at OpenAI. The interviewer sets a trap:
โWe are building a ๐ก๐ฆ๐ณ๐ฐ-๐๐ฉ๐ฐ๐ต ๐๐ญ๐ข๐ด๐ด๐ช๐ง๐ช๐ฆ๐ณ. We have the budget for a standard CLIP architecture. Why should we burn 25% more VRAM adding a ๐๐ฆ๐ฏ๐ฆ๐ณ๐ข๐ต๐ช๐ท๐ฆ ๐๐ฆ๐ค๐ฐ๐ฅ๐ฆ๐ณ (๐๐ฐ๐๐ข) if we donโt need to generate captions?โ
90% of candidates walk right into it.
The candidates say: โYou add the decoder for Multi-Task Learning. It allows the model to handle captioning tasks if business requirements change later.โ
The interviewer nods politely, makes a note, and the candidates never hear back. Why? Because they treated the architecture as a feature list, not a representation engine.
They arenโt optimizing for ๐ท๐ฆ๐ณ๐ด๐ข๐ต๐ช๐ญ๐ช๐ต๐บ. You are optimizing for signal ๐ฅ๐ฆ๐ฏ๐ด๐ช๐ต๐บ.
๐๐ฐ๐ฏ๐ต๐ณ๐ข๐ด๐ต๐ช๐ท๐ฆ ๐๐ฐ๐ด๐ด (the mechanism behind CLIP) is inherently โlazy.โ It is a global โvibe check.โ To minimize loss, the model only needs to learn the minimum features necessary to distinguish a โDogโ from a โTableโ in the current batch.
It discards fine-grained details, texture, exact count, spatial relation, because it doesnโt need them to satisfy the contrastive objective.
-----
๐๐ก๐ ๐๐จ๐ฅ๐ฎ๐ญ๐ข๐จ๐ง: The real reason to add a decoder is to enforce ๐๐ก๐ โ๐๐ซ๐๐ง๐ฎ๐ฅ๐๐ซ๐ข๐ญ๐ฒ ๐๐๐ฑโ.
Keep reading with a 7-day free trial
Subscribe to AI Interview Prep to keep reading this post and get 7 days of free access to the full post archives.

