đź”— Source: arXiv

A Theory of Generalization in Deep Learning

🚀 Technical Novelty

  • Mechanism: Partitions the empirical neural tangent kernel’s output space into a “signal channel” (where SGD accumulates coherent population signal via fast linear drift) and a “reservoir” (where label noise is trapped in test-invisible directions). Derives an exact population-risk objective from a single training run, implemented as an SNR preconditioner on Adam.
  • Nuance: Unlike frozen-kernel or vacuous capacity bounds, this framework operates in the full feature-learning regime with an evolving kernel (O(1) operator norm drift), mechanistically unifying benign overfitting, double descent, implicit bias, and grokking within a single output-space decomposition.

đź’ˇ Yield

  • Proves generalization survives full feature learning and maps classical phenomena to signal/reservoir dynamics.
  • Derives a validation-free population-risk objective that reduces to an SNR preconditioner adding only one state vector.
  • Empirically accelerates grokking by 5Ă—, suppresses memorization in PINNs/INRs (2.36Ă— faster convergence), and improves DPO fine-tuning under noisy preferences (1.13–1.16Ă— accuracy gain, 3Ă— less reward drift).

⚠️ Limitations

  • Requires tracking per-example Jacobians and cumulative dissipation Gramian during training, which may scale poorly with extreme model widths or complex architectures.
  • Core theoretical derivations rely on squared loss dynamics, though the framework claims broader applicability.
  • Empirical validation focuses on PINNs, implicit neural representations, and DPO rather than large-scale language models.