AI Interview Prep

AI Interview Prep

Computer Vision Interview Questions #23 - The Flamingo Architecture Trap

The exact components you need to add vision to a frozen LLM - without paying the fine-tuning cost.

Hao Hoang's avatar
Hao Hoang
Jan 24, 2026
∙ Paid

You’re in a Senior AI Interview at Google DeepMind. The interviewer sets a trap:

“We have a 70B parameter LLM. We need it to ‘see’ images. But here’s the constraint: We have zero budget to fine-tune the 70B weights, and we can’t afford to destroy the model’s existing reasoning capabilities.”

90% of candidates walk right into the trap.

They say: “Easy. Just turn the images into tokens and concatenate them with the text prompt.”

Here is why that answer fails the interview:

1️⃣ 𝐂𝐨𝐧𝐭𝐞𝐱𝐭 𝐄𝐱𝐩𝐥𝐨𝐬𝐢𝐨𝐧: Raw image patches will flood your context window, leaving no room for actual reasoning.

2️⃣ 𝐃𝐞𝐬𝐭𝐫𝐮𝐜𝐭𝐢𝐯𝐞 𝐈𝐧𝐭𝐞𝐫𝐟𝐞𝐫𝐞𝐧𝐜𝐞: Even if you use LoRA, you risk shifting the distribution of the core LLM too much if you aren’t careful with initialization.

The interviewer isn’t looking for a “𝘱𝘳𝘰𝘮𝘱𝘵 𝘦𝘯𝘨𝘪𝘯𝘦𝘦𝘳𝘪𝘯𝘨” hack. They are testing if the candidates know how to adapt the model structurally, not just at the input level, while keeping 99% of the compute graph frozen.

To pass, you must identify the two specific “surgical” components from the Flamingo architecture:

AI Interview Prep is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

Keep reading with a 7-day free trial

Subscribe to AI Interview Prep to keep reading this post and get 7 days of free access to the full post archives.

Already a paid subscriber? Sign in
© 2026 Hao Hoang · Privacy ∙ Terms ∙ Collection notice
Start your SubstackGet the app
Substack is the home for great culture