AI Interview Prep

AI Interview Prep

Machine Learning System Design Interview #26 - The Inference Bottleneck Illusion

Why obsessing over model compression silently ignores your true RecSys latency killer, and how attacking network I/O and feature fetching actually hits your strict 100ms SLA.

Hao Hoang's avatar
Hao Hoang
May 14, 2026
∙ Paid

You’re in a Senior ML Engineer interview at Meta. The interviewer sets a trap: “You’ve built a two-tower recommendation system balancing high recall and high precision. The problem? It takes 400ms to run the pipeline, but product demands a strict 100ms SLA. Where do you cut latency without destroying the user experience?”

95% of candidates walk right into it.

Most candidates immediately suggest compressing the ranking model. They talk about INT8 quantization, knowledge distillation, or reducing the number of layers in their deep cross network. They assume the heavy math of the ranking inference is the primary bottleneck.

They just failed.


𝐓𝐡𝐞 𝐑𝐞𝐚𝐥𝐢𝐭𝐲:

Model inference is almost never the silent killer in a modern RecSys pipeline.

The real bottleneck is network I/O and feature fetching.

When you retrieve thousands of candidates from your Approximate Nearest Neighbor (ANN) index, you have to fetch real-time user and item features from a remote Key-Value store to feed your ranker.

If you do poorly batched feature joins online, your P99 latency will easily explode past 400ms due to network hops alone.

You can quantize your ranking model all day, but you’ll only shave off 10ms if your system is still bottlenecked waiting on I/O and memory bandwidth.


𝐓𝐡𝐞 𝐒𝐨𝐥𝐮𝐭𝐢𝐨𝐧:

AI Interview Prep is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

User's avatar

Continue reading this post for free, courtesy of Hao Hoang.

Or purchase a paid subscription.
© 2026 Hao Hoang · Privacy ∙ Terms ∙ Collection notice
Start your SubstackGet the app
Substack is the home for great culture