Intrinsic RL for LLMs
🔗 Source: arXiv
Learning to Reason Without External Rewards
🚀 Technical Novelty
- Mechanism: Replaces verifiable rewards in Group Relative Policy Optimization (GRPO) with a model’s self-certainty (average KL divergence between its output distribution and a uniform distribution) as the sole intrinsic reward signal.
- Nuance: Eliminates dependency on gold solutions, test cases, or domain-specific verifiers by optimizing process-oriented internal feedback rather than outcome-based external rewards, enabling training purely on unlabeled queries.
💡 Yield
- Matches GRPO performance on in-domain mathematical benchmarks (GSM8K, MATH500) without any labeled data or ground truth.
- Achieves superior out-of-domain generalization, notably a 65% relative improvement on LiveCodeBench versus zero gain for GRPO, and a 76% gain on CRUXEval-O.
- Enables base LLMs to spontaneously generate coherent reasoning chains and structured code from purely unlabeled data, demonstrating emergent instruction-following capabilities.
⚠️ Limitations
- Relies entirely on the calibration accuracy of self-certainty as a correctness proxy, which may not perfectly align with task success in ambiguous or poorly calibrated regimes.
- Lacks explicit verification guarantees, potentially allowing suboptimal but highly confident trajectories to be reinforced during training without external correction.