Computer Vision Interview Questions #10 – The Early vs Slow Fusion Trap
The hidden activation-memory cost of keeping time alive in deep video networks.
You’re in a Computer Vision Engineer interview at Meta and the interviewer drops this on you:
“We’re debating between 𝘌𝘢𝘳𝘭𝘺 𝘍𝘶𝘴𝘪𝘰𝘯 and 𝘚𝘭𝘰𝘸 𝘍𝘶𝘴𝘪𝘰𝘯 for our new video understanding model. Everyone knows 𝘚𝘭𝘰𝘸 𝘍𝘶𝘴𝘪𝘰𝘯 captures motion better, but what is the specific computational consequence of maintaining that temporal dimension through multiple layers that kills our training budget?”
Don’t say: “𝘚𝘭𝘰𝘸 𝘍𝘶𝘴𝘪𝘰𝘯 is slower because 3D convolutions are just more complex than 2D convolutions.”
Technically true, but it misses the actual bottleneck.
The real killer isn’t just the operation complexity, it’s the feature map volume explosion.
When you do 𝘌𝘢𝘳𝘭𝘺 𝘍𝘶𝘴𝘪𝘰𝘯, you collapse the temporal dimension (T) immediately in the first layer. You’re essentially turning a video into an image instantly. Your subsequent feature maps are just H x W x C.
In 𝘚𝘭𝘰𝘸 𝘍𝘶𝘴𝘪𝘰𝘯, you are maintaining that T dimension deep into the network.
1️⃣ 𝐓𝐡𝐞 𝐌𝐞𝐦𝐨𝐫𝐲 𝐓𝐫𝐚𝐩: You aren’t just storing weights; you are storing activations for every single time step across multiple layers.


