Advanced Reinforcement Learning Interview Questions #22 - The Information Density Trap

Maximizing numeric “richness” with 1–10 scores backfires because inconsistent human baselines corrupt the signal before the model ever sees it.

Feb 17, 2026

∙ Paid

You’re in a Senior RLHF interview at OpenAI. The VP of Engineering sets a trap:

“We have a $50k budget for human labeling. We need a reward model for ‘helpfulness.’ Do we pay humans to score responses on a 1-10 scale, or rank pairs (A > B)?”

90% of candidates walk right into the Scalar Trap.

Continue reading this post for free, courtesy of Hao Hoang.

Or purchase a paid subscription.

AI Interview Prep

Advanced Reinforcement Learning Interview Questions #22 - The Information Density Trap

Maximizing numeric “richness” with 1–10 scores backfires because inconsistent human baselines corrupt the signal before the model ever sees it.

Continue reading this post for free, courtesy of Hao Hoang.