đź”— Source: arXiv

EAGLE-2: Faster Inference of Language Models with Dynamic Draft Trees

🚀 Technical Novelty

  • Mechanism: Dynamically adjusts the speculative draft tree structure during generation by using the draft model’s confidence scores as real-time proxies for token acceptance rates, expanding branches only where contextual complexity warrants it.
  • Nuance: Unlike prior SOTA methods (e.g., EAGLE-1, Medusa) that rely on fixed, position-dependent draft trees or relaxed acceptance conditions, EAGLE-2 adapts in real-time to context difficulty while strictly preserving the original LLM’s output distribution without extra training.

đź’ˇ Yield

  • Achieves 3.05x–4.26x average speedup (up to 5x on code generation) across Vicuna and LLaMA model families, increasing average acceptance length to ~4–5.5 tokens per drafting-verification cycle.
  • Delivers lossless acceleration with zero additional training overhead or parameter modifications to the base LLM or draft model.

⚠️ Limitations

  • Performance drops on world-knowledge QA and summarization tasks because the draft models are trained exclusively on SFT data rather than large-scale pretraining corpora.
  • Speedup ratios are hardware-dependent, and the method’s efficacy relies heavily on the draft model’s calibration quality across different architectures.