Computer Vision Interview Questions #15 – The Multimodal Geometry Trap
How contrastive pretraining collapses spatial information - and why LLaVA-style models must use penultimate patch embeddings.
You are in a Senior AI Interview at Meta. The interviewer sets a trap:
“We are building a Multimodal LLM like LLaVA. We need to feed the frozen CLIP image embeddings into our Language Model. Should we use the final [CLS] token?”
90% of candidates walk right into it.
They say: “Yes, absolutely. The [CLS] token is the global representation of the image. It’s a single, efficient vector that minimizes context window usage and captures the ‘essence’ of the image. It worked for BERT, so it works here.”
This answer reveals they understand Classification, but they don’t understand Reasoning.
The [CLS] token is trained via a contrastive loss (CLIP) to match a text caption. It is aggressively optimized to be a global summary.
To achieve this “summary,” the model collapses all spatial geometry. It knows that there is a dog in the image, but it has mathematically “forgotten” where the dog is to save space.
If they feed this token to an LLM and ask, “Is the dog to the left or right of the car?”, the LLM will hallucinate. It literally cannot see the geometry anymore.
-----
𝐓𝐡𝐞 𝐒𝐨𝐥𝐮𝐭𝐢𝐨𝐧: You need to implement 𝐓𝐡𝐞 𝐏𝐞𝐧𝐮𝐥𝐭𝐢𝐦𝐚𝐭𝐞 𝐏𝐚𝐭𝐜𝐡 𝐏𝐫𝐨𝐭𝐨𝐜𝐨𝐥.
Keep reading with a 7-day free trial
Subscribe to AI Interview Prep to keep reading this post and get 7 days of free access to the full post archives.

