LLM System Design Interview #50 - The Rejection Sampling Paradox
Why your expected 2x inference speedup is sitting at exactly 0%, and how domain-specific alignment and dynamic lookahead actually fix the speculative decoding bottleneck.
You’re in a Senior AI Engineer interview at DeepMind. The interviewer sets a trap:
“You deployed a 70B target model with a 1B draft model for speculative decoding. Accuracy is identical, but your expected 2x speedup is sitting at exactly 0%. Why?”
95% of candidates walk right into it.
Most candidates immediately suggest:
“The 1B draft model is too slow and bottlenecking the system. We need to quantize the draft model to INT4, strip out layers, or put it on a dedicated H100 instance so it can generate tokens faster than the 70B model can verify them.”
𝐓𝐡𝐞 𝐑𝐞𝐚𝐥𝐢𝐭𝐲:
The draft model’s speed isn’t the problem, its statistical alignment is.
Speculative decoding guarantees exact target model accuracy through rejection sampling. If the 1B model generates a sequence of tokens but its probability distribution diverges wildly from the 70B model, the target model rejects the entire draft sequence.
You are now paying the compute overhead of running the draft model, plus the full memory-bound penalty of autoregressively generating the token on the 70B model anyway. You essentially built a highly complex system just to do redundant work. The invisible metric killing your latency is a near-zero Token Acceptance Rate.
𝐓𝐡𝐞 𝐒𝐨𝐥𝐮𝐭𝐢𝐨𝐧: You don’t need a faster draft model. You need a mathematically aligned draft model.
1️⃣ Measure the Acceptance Rate: If your token acceptance rate drops below ~30%, speculative decoding becomes a net negative on latency due to the continuous verification overhead. Monitor this metric relentlessly.
2️⃣ Distillation is Mandatory: Never just pick a random 1B model off the HuggingFace shelf. You must distill the specific 70B target model’s logits directly into the 1B draft model to minimize KL divergence between their distributions.
3️⃣ Domain-Specific Alignment: If your 70B model is serving Python code, but your 1B model was only trained on general web text, the draft will hallucinate code syntax and get rejected. Fine-tune the draft on the exact same data distribution.
4️⃣ Dynamic Lookahead (K): Don’t stubbornly generate 5 tokens every time. Dynamically adjust the draft token lookahead based on the real-time acceptance rate of the current sequence. High confidence? Generate further ahead. Low confidence? Fall back.
𝐓𝐡𝐞 𝐀𝐧𝐬𝐰𝐞𝐫 𝐓𝐡𝐚𝐭 𝐆𝐞𝐭𝐬 𝐘𝐨𝐮 𝐇𝐢𝐫𝐞𝐝:
“A 0% speedup in speculative decoding means our Token Acceptance Rate has bottomed out due to high KL divergence; the 70B model is constantly rejecting the 1B model’s outputs, so we need to distill the target’s logits directly into the draft model to align their probability distributions.”
#MachineLearning #MLEngineering #LLMs #SpeculativeDecoding #AIInfrastructure #DeepLearning #SystemDesign


📚 Related Papers:
- Fast Inference from Transformers via Speculative Decoding. Available at: https://arxiv.org/abs/2211.17192
- Accelerating Large Language Model Decoding with Speculative Sampling. Available at: https://arxiv.org/abs/2302.01318
- DistillSpec: Improving Speculative Decoding via Knowledge Distillation. Available at: https://arxiv.org/abs/2310.08461
- Accelerating Speculative Decoding with Block Diffusion Draft Trees. Available at: https://arxiv.org/abs/2604.12989