Advanced Reinforcement Learning Interview Questions #5 - The Success-Only Dataset Trap
Filtering for correct answers discards optimal reasoning steps, preventing models from stitching together better policies than any single trace.
Youโre in a Research Scientist interview at Google DeepMind, and the lead researcher throws you a curveball:
โI have a dataset of reasoning traces, but theyโre all flawed.
- ๐๐ณ๐ข๐ค๐ฆ ๐ ๐ด๐ต๐ข๐ณ๐ต๐ด ๐ธ๐ช๐ต๐ฉ ๐ฑ๐ฆ๐ณ๐ง๐ฆ๐ค๐ต ๐ญ๐ฐ๐จ๐ช๐ค ๐ฃ๐ถ๐ต ๐ฉ๐ข๐ญ๐ญ๐ถ๐ค๐ช๐ฏ๐ข๐ต๐ฆ๐ด ๐ต๐ฉ๐ฆ ๐ง๐ช๐ฏ๐ข๐ญ ๐ด๐ต๐ฆ๐ฑ (๐๐ข๐ช๐ญ).
- ๐๐ณ๐ข๐ค๐ฆ ๐ ๐ด๐ต๐ข๐ณ๐ต๐ด ๐ธ๐ช๐ต๐ฉ ๐ข ๐ฎ๐ช๐ด๐ต๐ขโฆ


