LLM System Design Interview #42 - The Global Memory Trap
Why treating your GPU like a pure calculator creates a hidden latency nightmare, and how kernel tiling and operator fusion keep your data trapped in ultra-fast SRAM where it belongs.
You’re in a Senior AI Engineer interview at DeepMind. The interviewer sets a trap:
“Your training job is unacceptably slow, so you secure the budget to upgrade to a new cluster with 5x the raw teraFLOPs. However, your end-to-end throughput barely increases by 1.2x. What fundamental hardware scaling reality did you fail to profile before upgrading?”
90% of candidates walk right into it.
Most candidates say: “We must be hitting a dataloader bottleneck on the CPU side, or PCIe transfer speeds are choking the pipeline. I would optimize our asynchronous data fetching and increase the batch size to make sure we are properly saturating the new CUDA cores.”
They just failed.
𝐓𝐡𝐞 𝐑𝐞𝐚𝐥𝐢𝐭𝐲:
They are blindly chasing compute without understanding the physics of the “Memory Wall.”
Over the last decade, GPU compute (teraFLOPs) has scaled super-exponentially, while global memory bandwidth (HBM) has barely scaled linearly.
You didn’t buy a faster model; you bought a faster processor that now spends 85% of its time sitting completely idle, waiting for bytes to move from slow DRAM into the Streaming Multiprocessors (SMs).
You are stuck on the memory-bound slope of the hardware Roofline model.
Throwing H100 compute at an unoptimized, low-arithmetic-intensity workload is exactly like putting a Formula 1 engine inside a car with a garden-hose fuel line.
𝐓𝐡𝐞 𝐒𝐨𝐥𝐮𝐭𝐢𝐨𝐧:


