AI Interview Prep

AI Interview Prep

Computer Vision Interview Questions #17 - The Counting Hallucination Trap

Why caption-only supervision lets VLMs hallucinate counts, and how forcing spatial proof fixes it.

Hao Hoang's avatar
Hao Hoang
Jan 18, 2026
∙ Paid

You’re in a Senior AI Interview at OpenAI. The interviewer sets a trap:

“Our VLM constantly hallucinates object counts in crowded images. It says ‘8 people’ when there are only 5. We have zero budget for new data collection. How do you fix this?”

90% of candidates walk right into the trap.

Most candidates say...

“I’d use Chain-of-Thought (CoT) prompting to make it reason step-by-step,” or “I’d use RAG to retrieve similar examples.”

These answers are fine for LLMs. But for VLMs, they are dead wrong. You are trying to solve a vision problem with language tools.

The reality is that text is cheap but pixels are expensive.

AI Interview Prep is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

Keep reading with a 7-day free trial

Subscribe to AI Interview Prep to keep reading this post and get 7 days of free access to the full post archives.

Already a paid subscriber? Sign in
© 2026 Hao Hoang · Privacy ∙ Terms ∙ Collection notice
Start your SubstackGet the app
Substack is the home for great culture