Machine Learning System Design Interview #26 - The Inference Bottleneck Illusion
Why obsessing over model compression silently ignores your true RecSys latency killer, and how attacking network I/O and feature fetching actually hits your strict 100ms SLA.
You’re in a Senior ML Engineer interview at Meta. The interviewer sets a trap: “You’ve built a two-tower recommendation system balancing high recall and high precision. The problem? It takes 400ms to run the pipeline, but product demands a strict 100ms SLA. Where do you cut latency without destroying the user experience?”
95% of candidates walk right into it.
Most candidates immediately suggest compressing the ranking model. They talk about INT8 quantization, knowledge distillation, or reducing the number of layers in their deep cross network. They assume the heavy math of the ranking inference is the primary bottleneck.
They just failed.
𝐓𝐡𝐞 𝐑𝐞𝐚𝐥𝐢𝐭𝐲:
Model inference is almost never the silent killer in a modern RecSys pipeline.
The real bottleneck is network I/O and feature fetching.
When you retrieve thousands of candidates from your Approximate Nearest Neighbor (ANN) index, you have to fetch real-time user and item features from a remote Key-Value store to feed your ranker.
If you do poorly batched feature joins online, your P99 latency will easily explode past 400ms due to network hops alone.
You can quantize your ranking model all day, but you’ll only shave off 10ms if your system is still bottlenecked waiting on I/O and memory bandwidth.
𝐓𝐡𝐞 𝐒𝐨𝐥𝐮𝐭𝐢𝐨𝐧:


