Computer Vision Interview Questions #18 – The Compositionality Trap
Why 99% Mean Average Precision fails in the real world, and why bounding boxes can’t reason about relationships.
You are in a Senior Computer Vision interview at Google DeepMind. The Lead Engineer sets a trap:
“Our YOLO model has 99% mAP ( Mean Average Precision ) on 𝘗𝘦𝘰𝘱𝘭𝘦 and 𝘍𝘪𝘳𝘦 𝘏𝘺𝘥𝘳𝘢𝘯𝘵𝘴 individually. But in production, we saw a person sitting on a fire hydrant, and the model didn’t flag it as anomalous. It just saw two boxes. Why did we fail, and how do you fix it?”
90% of candidates walk right into the trap.
They say “It’s a data problem. We need to collect 5,000 images of people sitting on fire hydrants and retrain the model on a new class person_on_hydrant.”
They just signed the company up for a lifetime of manual data labeling.
The real world follows a long-tail distribution. They cannot collect datasets for every infinite combination of reality (e.g., a horse wearing a hat, a toaster in a bathtub, or a person on a hydrant).
Keep reading with a 7-day free trial
Subscribe to AI Interview Prep to keep reading this post and get 7 days of free access to the full post archives.

