🔗 Source: arXiv

Learning to Reason Without External Rewards

Mechanism: Replaces verifiable rewards in Group Relative Policy Optimization (GRPO) with a model’s self-certainty (average KL divergence between its output distribution and a uniform distribution) as the sole intrinsic reward signal.
Nuance: Eliminates dependency on gold solutions, test cases, or domain-specific verifiers by optimizing process-oriented internal feedback rather than outcome-based external rewards, enabling training purely on unlabeled queries.

Matches GRPO performance on in-domain mathematical benchmarks (GSM8K, MATH500) without any labeled data or ground truth.
Achieves superior out-of-domain generalization, notably a 65% relative improvement on LiveCodeBench versus zero gain for GRPO, and a 76% gain on CRUXEval-O.
Enables base LLMs to spontaneously generate coherent reasoning chains and structured code from purely unlabeled data, demonstrating emergent instruction-following capabilities.

Relies entirely on the calibration accuracy of self-certainty as a correctness proxy, which may not perfectly align with task success in ambiguous or poorly calibrated regimes.
Lacks explicit verification guarantees, potentially allowing suboptimal but highly confident trajectories to be reinforced during training without external correction.