Computer Vision Interview Questions #10 – The Early vs Slow Fusion Trap

The hidden activation-memory cost of keeping time alive in deep video networks.

Jan 11, 2026

∙ Paid

You’re in a Computer Vision Engineer interview at Meta and the interviewer drops this on you:

“We’re debating between 𝘌𝘢𝘳𝘭𝘺 𝘍𝘶𝘴𝘪𝘰𝘯 and 𝘚𝘭𝘰𝘸 𝘍𝘶𝘴𝘪𝘰𝘯 for our new video understanding model. Everyone knows 𝘚𝘭𝘰𝘸 𝘍𝘶𝘴𝘪𝘰𝘯 captures motion better, but what is the specific computational consequence of maintaining that temporal dimension through multiple layers that kills our training budget?”

Don’t say: “𝘚𝘭𝘰𝘸 𝘍𝘶𝘴𝘪𝘰𝘯 is slower because 3D convolutions are just more complex than 2D convolutions.”

Technically true, but it misses the actual bottleneck.

The real killer isn’t just the operation complexity, it’s the feature map volume explosion.

When you do 𝘌𝘢𝘳𝘭𝘺 𝘍𝘶𝘴𝘪𝘰𝘯, you collapse the temporal dimension (T) immediately in the first layer. You’re essentially turning a video into an image instantly. Your subsequent feature maps are just H x W x C.

In 𝘚𝘭𝘰𝘸 𝘍𝘶𝘴𝘪𝘰𝘯, you are maintaining that T dimension deep into the network.

1️⃣ 𝐓𝐡𝐞 𝐌𝐞𝐦𝐨𝐫𝐲 𝐓𝐫𝐚𝐩: You aren’t just storing weights; you are storing activations for every single time step across multiple layers.

Continue reading this post for free, courtesy of Hao Hoang.

Or purchase a paid subscription.

AI Interview Prep

Computer Vision Interview Questions #10 – The Early vs Slow Fusion Trap

The hidden activation-memory cost of keeping time alive in deep video networks.

Continue reading this post for free, courtesy of Hao Hoang.