Computer Vision Interview Questions #18 – The Compositionality Trap

Why 99% Mean Average Precision fails in the real world, and why bounding boxes can’t reason about relationships.

Jan 19, 2026

∙ Paid

You are in a Senior Computer Vision interview at Google DeepMind. The Lead Engineer sets a trap:

“Our YOLO model has 99% mAP ( Mean Average Precision ) on 𝘗𝘦𝘰𝘱𝘭𝘦 and 𝘍𝘪𝘳𝘦 𝘏𝘺𝘥𝘳𝘢𝘯𝘵𝘴 individually. But in production, we saw a person sitting on a fire hydrant, and the model didn’t flag it as anomalous. It just saw two boxes. Why did we fail, and how do you fix it?”

90% of candidates walk right into the trap.

They say “It’s a data problem. We need to collect 5,000 images of people sitting on fire hydrants and retrain the model on a new class person_on_hydrant.”

They just signed the company up for a lifetime of manual data labeling.

The real world follows a long-tail distribution. They cannot collect datasets for every infinite combination of reality (e.g., a horse wearing a hat, a toaster in a bathtub, or a person on a hydrant).

Keep reading with a 7-day free trial

Subscribe to AI Interview Prep to keep reading this post and get 7 days of free access to the full post archives.