LLM System Design Interview #40 - The Expert Capacity Paradox
Why batched MoE inference silently hallucinates bugs even at zero temperature, and how to trade FLOPs for drop-free routing to restore mathematical determinism.
You’re in a Senior ML Engineer interview at DeepMind. The interviewer sets a trap:
You deploy a massive Mixture of Experts (MoE) model for batch inference. Temperature is strictly set to 0. A major enterprise client files a furious bug report: “Sending the exact same prompt yields slightly different outputs depending on the time of day.” Assuming zero hardware faults or floating-point non-determinism, what is silently altering the forward pass?
95% of candidates walk right into it.
Most candidates say: “It must be a KV cache corruption issue. Since Temperature 0 is mathematically deterministic, the continuous batching engine must be leaking state, or there’s an alignment bug in your padding masks.”
Wrong. They just failed.
𝐓𝐡𝐞 𝐑𝐞𝐚𝐥𝐢𝐭𝐲:
That is a debugging wild goose chase. The KV cache is perfectly fine. The real culprit is the fundamental system physics of how MoE routers handle cross-batch traffic.
In an MoE, tokens are routed to specific experts (MLPs) distributed across physical GPUs. But those GPUs have strict compute and VRAM limits. To prevent catastrophic OOMs or massive latency spikes, inference engines enforce an “expert capacity factor”, a hard cap on the maximum number of tokens a single expert can process per forward pass.
If a random batch happens to contain a high concentration of tokens that all strongly prefer Expert #4, Expert #4 hits its physical limit.
𝐓𝐡𝐞 𝐒𝐨𝐥𝐮𝐭𝐢𝐨𝐧:
When an expert overflows during batch inference, the system executes Token Dropping. Here is exactly what is happening to your client’s prompts:
1️⃣ The overloaded expert physically cannot compute the overflow, so it rejects the excess tokens.
2️⃣ Those dropped tokens bypass the MLP entirely. The routing algorithm just passes the residual stream straight forward (essentially multiplying by zero).
3️⃣ Because your client’s prompt is batched with other users’ random queries, the “competition” for experts changes every millisecond.
4️⃣ Your client’s token might get processed successfully at 9:00 AM, but get dropped at 9:05 AM because another user’s prompt temporarily hogged that specific expert.
To fix this in production, you have to trade compute for determinism: increase the expert capacity factor (wasting FLOPs on padding), or implement drop-free routing optimizations and sequence-wise load balancing like those found in modern architectures (e.g., DeepSeek-V3).
𝐓𝐡𝐞 𝐀𝐧𝐬𝐰𝐞𝐫 𝐓𝐡𝐚𝐭 𝐆𝐞𝐭𝐬 𝐘𝐨𝐮 𝐇𝐢𝐫𝐞𝐝:
In MoEs, batch composition dictates expert load; if an expert overflows due to cross-batch competition, tokens are dropped and pass unchanged through the residual connection, injecting stochasticity into an otherwise deterministic T=0 pipeline.
#MachineLearning #MLEngineering #LLM #MoE #AIInfrastructure #DeepLearning #SystemDesign


📚 Related Papers:
- Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity. Available at: https://arxiv.org/abs/2101.03961
- GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding. Available at: https://arxiv.org/abs/2006.16668
- MegaBlocks: Efficient Sparse Training with Mixture-of-Experts. Available at: https://arxiv.org/abs/2211.15841