AI Interview Prep

AI Interview Prep

Advanced Reinforcement Learning Interview Questions #22 - The Information Density Trap

Maximizing numeric “richness” with 1–10 scores backfires because inconsistent human baselines corrupt the signal before the model ever sees it.

Hao Hoang's avatar
Hao Hoang
Feb 17, 2026
∙ Paid

You’re in a Senior RLHF interview at OpenAI. The VP of Engineering sets a trap:

“We have a $50k budget for human labeling. We need a reward model for ‘helpfulness.’ Do we pay humans to score responses on a 1-10 scale, or rank pairs (A > B)?”

90% of candidates walk right into the Scalar Trap.

User's avatar

Continue reading this post for free, courtesy of Hao Hoang.

Or purchase a paid subscription.
© 2026 Hao Hoang · Privacy ∙ Terms ∙ Collection notice
Start your SubstackGet the app
Substack is the home for great culture