Theory of Deep Generalization
🔗 Source: arXiv
A Theory of Generalization in Deep Learning
🚀 Technical Novelty
- Mechanism: Partitions output space via cumulative dissipation Gramian into a signal channel (where loss dissipates) and a reservoir (where noise is trapped), enabling exact population-risk estimation from a single training run.
- Nuance: Extends frozen-kernel NTK theory to the full feature-learning regime with an evolving kernel, mechanistically unifying benign overfitting, double descent, implicit bias, and grokking within a single bias-variance decomposition.
💡 Yield
- Derives an exact population-risk objective computable from per-example Jacobians during training, requiring only one extra state vector in Adam.
- Empirically accelerates grokking by 5×, suppresses memorization in PINNs/INRs, and improves DPO fine-tuning under noisy preferences while staying closer to the reference policy.
⚠️ Limitations
- Relies on squared loss assumptions for the core signal/reservoir decomposition (though claims broader applicability).
- Requires computing per-example Jacobians and maintaining a cumulative dissipation Gramian, which may scale poorly with extremely large batch sizes or memory-constrained setups.
- Theoretical guarantees assume C² parameterization and specific gradient flow/SGD dynamics; practical performance depends on the SNR threshold condition (μ_k² > σ_k²/(b-1)).