🔗 Source: arXiv

EAGLE-2: Faster Inference of Language Models with Dynamic Draft Trees

Mechanism: Replaces static draft trees in speculative sampling with context-aware dynamic structures that expand or contract per token based on the draft model’s confidence scores.
Nuance: Unlike prior SOTA methods that assume position-dependent acceptance rates, EAGLE-2 exploits the draft model’s inherent calibration to approximate real-time acceptance probabilities, enabling adaptive tree shaping without relaxing acceptance conditions or retraining.

Achieves 3.05x–4.26x speedup over vanilla autoregressive decoding and 20%–40% faster than EAGLE-1 across Vicuna, LLaMA2, and LLaMA3 models on six diverse tasks while guaranteeing exact output distribution parity.

Performance degrades on knowledge-heavy QA/summarization tasks due to draft model training data bias (SFT-only vs. pretraining), and speedup metrics remain hardware-dependent with unisolated draft-model forward overhead.