Computer Vision Interview Questions #19 – The Fine-Grained Invariance Trap
Why Global Average Pooling silently destroys the only features that matter in fine-grained classification.
You’re in a Senior Computer Vision interview at Google DeepMind. The interviewer sets a trap:
“We need to classify 10,000 distinct car models (Make, Model, Year) for a demographics study. How do you build the model?”
90% of candidates walk right into it. They say:
“I’ll grab a ResNet-50 pre-trained on ImageNet and fine-tune the final layer.”
They think if a ResNet can distinguish a Golden Retriever from a Border Collie, surely it can distinguish a 2018 Honda Accord from a 2019 Honda Accord.
But actually, the interview is effectively over. They just proved you don’t understand the problem.
The reality is that they relies on standard Transfer Learning, assuming that the feature extractors learned from general objects (cats, tables, planes) will work for specific sub-categories. They fails to realize that they are fighting the model’s own design.
Standard architectures (like ResNet or VGG) use massive downsampling and Global Average Pooling to create Invariance. They are designed to ignore small deformations. They want to say “this is a car” regardless of whether the headlight is round or square.
Keep reading with a 7-day free trial
Subscribe to AI Interview Prep to keep reading this post and get 7 days of free access to the full post archives.

