Advanced Reinforcement Learning Interview Questions #20 - The Static CoT Trap
Training on reasoning traces without a compute-aware reward turns the model into a pattern imitator, not an agent that allocates inference dynamically.
You’re in a Principal AI Engineer interview at a top AI lab and the interviewer asks:
“We’re building a reasoning model like DeepSeek R1. We want the model to burn test-time compute exploring solutions for complex math, but answer instantly for ‘2+2’. How do you formulate the RL objective to achieve this adaptive behavior?”


