Advanced Reinforcement Learning Interview Questions #22 - The Information Density Trap
Maximizing numeric “richness” with 1–10 scores backfires because inconsistent human baselines corrupt the signal before the model ever sees it.
You’re in a Senior RLHF interview at OpenAI. The VP of Engineering sets a trap:
“We have a $50k budget for human labeling. We need a reward model for ‘helpfulness.’ Do we pay humans to score responses on a 1-10 scale, or rank pairs (A > B)?”
90% of candidates walk right into the Scalar Trap.


