AI Interview Prep

AI Interview Prep

Computer Vision Interview Questions #7 – The Receptive Field Trap

Why replacing a 7×7 convolution with three 3×3 layers isn’t about parameters — it’s about nonlinear expressivity.

Hao Hoang's avatar
Hao Hoang
Jan 08, 2026
∙ Paid

You are in a Senior Computer Vision interview at OpenAI. The interviewer sets a classic trap:

“In VGGNet, we replace a single 7x7 convolution with a stack of three 3x3 convolutions. Why?”

90% of candidates walk right into the trap.

They grab the whiteboard marker and start doing arithmetic.

They say: “It’s about efficiency and parameter reduction. A 7x7 filter has 49 weights (7^2). Three 3x3 filters have 27 weights (3x3^2). So, we get the same 7x7 receptive field with 45% fewer parameters. It’s a memory optimization.”

The interviewer nods politely. They didn’t get the job.

They aren’t optimizing for storage. They are optimizing for expressivity.

If the goal was purely parameter reduction, there are dozen ways to factorize matrices. The answer ignores the single most important component of Deep Learning: The Activation Function.

A single 7x7 layer applies one linear transformation followed by one ReLU. It can only model a simple linear relationship within that 7x7 pixel patch.

-----

The Solution: The Senior Engineer realizes that by stacking three layers, you

AI Interview Prep is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

aren’t just covering space, you are injecting non-linearity. You are utilizing 𝐓𝐡𝐞

Keep reading with a 7-day free trial

Subscribe to AI Interview Prep to keep reading this post and get 7 days of free access to the full post archives.

Already a paid subscriber? Sign in
© 2026 Hao Hoang · Privacy ∙ Terms ∙ Collection notice
Start your SubstackGet the app
Substack is the home for great culture