AI Interview Prep

AI Interview Prep

Advanced Reinforcement Learning Interview Questions #4 - The LLM-as-a-Judge Trap

Outsourcing preference labels to a stronger model just distills its bias, while execution-verified self-play produces cleaner gradients and scales without a teacher ceiling.

Hao Hoang's avatar
Hao Hoang
Jan 30, 2026
โˆ™ Paid

Youโ€™re in a Machine Learning interview at DeepSeek AI and the lead researcher asks:

โ€œWe want to train a reasoning model using ๐ƒ๐ข๐ซ๐ž๐œ๐ญ ๐๐ซ๐ž๐Ÿ๐ž๐ซ๐ž๐ง๐œ๐ž ๐Ž๐ฉ๐ญ๐ข๐ฆ๐ข๐ณ๐š๐ญ๐ข๐จ๐ง (๐ƒ๐๐Ž), but we have zero budget for human annotators. How do we procedurally generate high-quality ๐˜ž๐˜ช๐˜ฏ๐˜ฏ๐˜ฆ๐˜ณ ๐˜ท๐˜ด. ๐˜“๐˜ฐ๐˜ด๐˜ฆ๐˜ณ pairs from the modelโ€™s own generations?โ€

Most of candidates say: โ€œWe should use a stronger model like GPT-4 to score the outputs and create labels (LLM-as-a-Judge).โ€

๐–๐ก๐ฒ ๐ญ๐ก๐ข๐ฌ ๐Ÿ๐š๐ข๐ฅ๐ฌ: Itโ€™s expensive, slow, and fundamentally limited by the teacher modelโ€™s ceiling. You arenโ€™t teaching reasoning; youโ€™re just distilling bias.

The real bottleneck isnโ€™t โ€œwho judges the answer,โ€ itโ€™s how we isolate the error.

You donโ€™t need humans. You need ๐„๐ฑ๐ž๐œ๐ฎ๐ญ๐ข๐จ๐ง ๐…๐ž๐ž๐๐›๐š๐œ๐ค.

In domains like Math or Coding, ๐˜Ž๐˜ณ๐˜ฐ๐˜ถ๐˜ฏ๐˜ฅ ๐˜›๐˜ณ๐˜ถ๐˜ต๐˜ฉ is deterministic (the code runs, or the answer is 8). We leverage this to create ๐’๐ž๐ฅ๐Ÿ-๐†๐ž๐ง๐ž๐ซ๐š๐ญ๐ž๐ ๐๐ซ๐ž๐Ÿ๐ž๐ซ๐ž๐ง๐œ๐ž ๐๐š๐ข๐ซ๐ฌ.

Here is the production recipe:

AI Interview Prep is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

Keep reading with a 7-day free trial

Subscribe to AI Interview Prep to keep reading this post and get 7 days of free access to the full post archives.

Already a paid subscriber? Sign in
ยฉ 2026 Hao Hoang ยท Privacy โˆ™ Terms โˆ™ Collection notice
Start your SubstackGet the app
Substack is the home for great culture