Computer Vision Interview Questions #23 - The Flamingo Architecture Trap
The exact components you need to add vision to a frozen LLM - without paying the fine-tuning cost.
You’re in a Senior AI Interview at Google DeepMind. The interviewer sets a trap:
“We have a 70B parameter LLM. We need it to ‘see’ images. But here’s the constraint: We have zero budget to fine-tune the 70B weights, and we can’t afford to destroy the model’s existing reasoning capabilities.”
90% of candidates walk right into the trap.
They say: “Easy. Just turn the images into tokens and concatenate them with the text prompt.”
Here is why that answer fails the interview:
1️⃣ 𝐂𝐨𝐧𝐭𝐞𝐱𝐭 𝐄𝐱𝐩𝐥𝐨𝐬𝐢𝐨𝐧: Raw image patches will flood your context window, leaving no room for actual reasoning.
2️⃣ 𝐃𝐞𝐬𝐭𝐫𝐮𝐜𝐭𝐢𝐯𝐞 𝐈𝐧𝐭𝐞𝐫𝐟𝐞𝐫𝐞𝐧𝐜𝐞: Even if you use LoRA, you risk shifting the distribution of the core LLM too much if you aren’t careful with initialization.
The interviewer isn’t looking for a “𝘱𝘳𝘰𝘮𝘱𝘵 𝘦𝘯𝘨𝘪𝘯𝘦𝘦𝘳𝘪𝘯𝘨” hack. They are testing if the candidates know how to adapt the model structurally, not just at the input level, while keeping 99% of the compute graph frozen.
To pass, you must identify the two specific “surgical” components from the Flamingo architecture:
Keep reading with a 7-day free trial
Subscribe to AI Interview Prep to keep reading this post and get 7 days of free access to the full post archives.

