Computer Vision Interview Questions #21 – The Data Scaling Trap
Why scaling vision data fails in the real world, and how semantic handoff beats brute-force perception.
You’re in a Senior Robotics interview at NVIDIA. The interviewer sets a trap:
“We need a robot to open any drawer in any user’s home. We cannot pre-train it on every possible handle shape. How do you build this?”
90% of candidates walk right into the “𝐃𝐚𝐭𝐚 𝐒𝐜𝐚𝐥𝐢𝐧𝐠” trap.
They say: “We need more data. Let’s scrape 10 million images of drawers or build a massive NVIDIA Omniverse simulation with procedurally generated handles. We’ll train a massive end-to-end ResNet policy to map pixels directly to motor torques.”
𝘛𝘩𝘪𝘴 𝘧𝘢𝘪𝘭𝘴 𝘪𝘯 𝘱𝘳𝘰𝘥𝘶𝘤𝘵𝘪𝘰𝘯. 𝘞𝘩𝘺?
Because reality has an infinite long tail. The moment the robot sees a handle with a weird texture or a lighting condition your sim didn’t catch, the end-to-end black box fails. They cannot brute-force “𝐓𝐡𝐞 𝐖𝐢𝐥𝐝.”
They aren’t optimizing for 𝘮𝘦𝘮𝘰𝘳𝘪𝘻𝘢𝘵𝘪𝘰𝘯. They are optimizing for 𝘤𝘰𝘮𝘱𝘰𝘴𝘢𝘣𝘪𝘭𝘪𝘵𝘺.
Trying to teach a neural network to memorize the physics of every drawer in existence is a waste of compute. They don’t need a bigger dataset, they need a smarter architecture that separates 𝘓𝘰𝘨𝘪𝘤 from 𝘗𝘦𝘳𝘤𝘦𝘱𝘵𝘪𝘰𝘯.
-----
The Solution: You implement 𝐓𝐡𝐞 𝐒𝐞𝐦𝐚𝐧𝐭𝐢𝐜 𝐇𝐚𝐧𝐝𝐨𝐟𝐟.
Keep reading with a 7-day free trial
Subscribe to AI Interview Prep to keep reading this post and get 7 days of free access to the full post archives.

