Computer Vision Interview Questions #11 – The CLIP Prompt Variance Trap
Why single-text prompts are noisy estimates in high-dimensional space—and how centroid stabilization fixes zero-shot accuracy.
You’re in a Senior Computer Vision interview at OpenAI. The interviewer sets a trap:
“We just deployed a CLIP model for zero-shot classification. We’re feeding in raw class names like 𝘥𝘰𝘨 or 𝘱𝘭𝘢𝘯𝘦 as text prompts. The accuracy is shaky and the variance is high. Without retraining a single parameter, 𝐡𝐨𝐰 𝐝𝐨 𝐲𝐨𝐮 𝐟𝐢𝐱 𝐭𝐡𝐞 𝐬𝐭𝐚𝐛𝐢𝐥𝐢𝐭𝐲 𝐚𝐧𝐝 𝐛𝐨𝐨𝐬𝐭 𝐈𝐦𝐚𝐠𝐞𝐍𝐞𝐭 𝐚𝐜𝐜𝐮𝐫𝐚𝐜𝐲?”
90% of candidates walk right into the trap.
Most say: “Just change the prompt to 𝘈 𝘱𝘩𝘰𝘵𝘰 𝘰𝘧 𝘢 [𝘤𝘭𝘢𝘴𝘴].”
It’s not wrong, it helps, but it reveals they treat 𝐅𝐨𝐮𝐧𝐝𝐚𝐭𝐢𝐨𝐧 𝐌𝐨𝐝𝐞𝐥𝐬 like magic black boxes rather than 𝘩𝘪𝘨𝘩-𝘥𝘪𝘮𝘦𝘯𝘴𝘪𝘰𝘯𝘢𝘭 𝘷𝘦𝘤𝘵𝘰𝘳 𝘴𝘱𝘢𝘤𝘦𝘴. They are betting your production metrics on a single point in latent space.
The Senior Engineer knows that in high-dimensional space, a single text embedding is noisy. 𝘋𝘰𝘨 could mean a pet, a hot dog, or a friend. Even 𝘈 𝘱𝘩𝘰𝘵𝘰 𝘰𝘧 𝘢 𝘥𝘰𝘨 is just one vector direction.
To pass this interview, you need to mention 𝐓𝐡𝐞 𝐂𝐞𝐧𝐭𝐫𝐨𝐢𝐝 𝐒𝐭𝐚𝐛𝐢𝐥𝐢𝐳𝐚𝐭𝐢𝐨𝐧 𝐏𝐫𝐨𝐭𝐨𝐜𝐨𝐥.
We don’t want the vector for a specific sentence. We want the mean vector that represents the concept itself, robust to linguistic noise.
-----
𝐓𝐡𝐞 𝐒𝐨𝐥𝐮𝐭𝐢𝐨𝐧:
Keep reading with a 7-day free trial
Subscribe to AI Interview Prep to keep reading this post and get 7 days of free access to the full post archives.

