Do We Need to Verify Step by Step? Rethinking Process Supervision from a Theoretical Perspective

Best AI papers explained - A podcast by Enoch H. Kang

Categories:

This research examines two fundamental paradigms in reinforcement learning: process supervision and outcome supervision. Process supervision offers fine-grained, step-by-step reward feedback, while outcome supervision provides only a cumulative reward at the end of a task. The paper challenges the conventional belief that outcome supervision is inherently more difficult, demonstrating that, under certain data conditions, outcome supervision is no more statistically challenging than process supervision. Furthermore, it explores how advantage functions, but not necessarily Q-functions, can serve as optimal process reward models when a verifier or rollout capability is available, offering new perspectives on data collection and algorithm design for large language models. The "Change of Trajectory Measure Lemma" is introduced as a key technical contribution, bridging return-based trajectory measures and step-level distribution shifts, which is then extended to preference-based reinforcement learning, improving previous analyses.