Computer Vision Interview Questions #1 – The Translation Equivariance Efficiency Trap
Why CNNs learn one visual feature once, while dense networks must relearn it at every pixel.
You’re in a Senior Computer Vision Engineer interview at Tesla and the interviewer drops this on you:
“We all know 𝐂𝐍𝐍𝐬 are translation equivariant. But why exactly does that property make them exponentially more data-efficient than a 𝐅𝐮𝐥𝐥𝐲 𝐂𝐨𝐧𝐧𝐞𝐜𝐭𝐞𝐝 𝐧𝐞𝐭𝐰𝐨𝐫𝐤 for processing high-res images?”
Most of candidates say: “It means if you shift the input image, the output feature map shifts by the same amount. Also, convolutions use fewer parameters because they are small.”
𝐖𝐡𝐲 𝐭𝐡𝐢𝐬 𝐟𝐚𝐢𝐥𝐬: They just gave a textbook definition of the math. They didn’t answer the engineering question about efficiency.
The real answer isn’t about the math of shifting pixels, it’s about 𝐒𝐭𝐚𝐭𝐢𝐬𝐭𝐢𝐜𝐚𝐥 𝐄𝐟𝐟𝐢𝐜𝐢𝐞𝐧𝐜𝐲 and 𝐈𝐧𝐝𝐮𝐜𝐭𝐢𝐯𝐞 𝐁𝐢𝐚𝐬.
Here is the reality 𝘢 𝘋𝘦𝘯𝘴𝘦 (𝘍𝘶𝘭𝘭𝘺 𝘊𝘰𝘯𝘯𝘦𝘤𝘵𝘦𝘥) network faces:
1️⃣ It has no concept of space. A Dense network treats pixel (0,0) and pixel (100,100) as completely unrelated variables.
Keep reading with a 7-day free trial
Subscribe to AI Interview Prep to keep reading this post and get 7 days of free access to the full post archives.

