Computer Vision Interview Questions #22 - The Interactive Segmentation Trap
Why guessing user intent from one click leads to blurry masks and unstable convergence.
You’re in a final-round Computer Vision interview at OpenAI. The interviewer pulls up an image of a pair of scissors and draws a single dot on the handle.
“Your user clicks here. What mask does your model output?”
90% of candidates walk right into the trap.
“I’d train the model to output the most likely object, the whole scissors,” they say.
Or perhaps, “I’d train it to detect the specific part, the handle, based on the pixel class.”
It sounds decisive. It’s also fatal for their loss curve.
They are assuming the user’s intent is knowable. It isn’t.
A single point is mathematically ambiguous. Does the user want the handle? The blade? Or the entire tool?
If they force their model to converge on one “correct” answer during training, they are punishing it for valid hypotheses.
If Image A says “dot on handle = scissors” and Image B says “dot on handle = handle,” they create conflicting gradients. The model tries to satisfy both, fails, and converges on a blurry, mediocre average.
-----
𝐓𝐡𝐞 𝐒𝐨𝐥𝐮𝐭𝐢𝐨𝐧: You don’t guess. You architect for uncertainty using 𝐓𝐡𝐞 𝐀𝐦𝐛𝐢𝐠𝐮𝐢𝐭𝐲 𝐃𝐞𝐜𝐨𝐮𝐩𝐥𝐢𝐧𝐠.
Keep reading with a 7-day free trial
Subscribe to AI Interview Prep to keep reading this post and get 7 days of free access to the full post archives.

