EAGLE Speculative Sampling
🔗 Source: arXiv
EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty
🚀 Technical Novelty
- Mechanism: Introduces a lightweight draft model that autoregressively predicts second-to-top-layer features instead of discrete tokens, incorporating a one-step-ahead token sequence to explicitly resolve sampling-induced feature ambiguity.
- Nuance: Unlike Medusa or Lookahead which predict spaced tokens or rely on n-grams/Jacobi iteration with lower acceptance accuracy (~0.6), EAGLE operates at the continuous feature level with shifted-token conditioning, achieving ~0.8 acceptance accuracy without fine-tuning the backbone model.
💡 Yield
- Achieves 2.7x-3.5x latency speedup and doubles throughput across LLaMA2/Vicuna/Mixtral models while theoretically guaranteeing exact output distribution preservation for both greedy and non-greedy decoding.
⚠️ Limitations
- Speedup ratio diminishes as batch size increases due to GPU memory/compute constraints; requires a small amount of training data (2-4B tokens or fixed ShareGPT subset) for the draft model, though amortized cost becomes negligible over time.