Entropy-Based Draft Stopping
🔗 Source: arXiv
AdaEDL: Early Draft Stopping for Speculative Decoding of Large Language Models via an Entropy-based Lower Bound on Token Acceptance Probability
🚀 Technical Novelty
- Mechanism: Dynamically terminates the token drafting phase during speculative decoding by computing a lower bound on expected acceptance probability using draft model logits’ entropy.
- Nuance: Unlike prior adaptive methods that rely on static confidence thresholds or require training task-specific predictors, AdaEDL is entirely training-free and leverages distributional uncertainty (entropy) for robust, plug-and-play acceleration.
💡 Yield
- Achieves 10%-57% inference speedups over static draft-length speculative decoding and up to 10% gains over other training-free baselines across multiple datasets and model pairs.
- Maintains consistent performance gains even at high sampling temperatures where traditional speculative decoding collapses below autoregressive baselines.
- Eliminates the need for dataset-specific fine-tuning or hyperparameter search, enabling seamless integration into existing LLM inference pipelines.
⚠️ Limitations
- Sensitive to the initial stopping threshold (λ), particularly for short-generation tasks where insufficient drafting rounds hinder dynamic threshold convergence.
- Relies on a fixed entropy scaling factor (γ ≈ 0.2) in current implementations; optimal γ may vary across models and requires further investigation.
- Efficiency gains are inherently bounded by the draft model’s computational cost and alignment with the target model, limiting absolute speedups in highly mismatched setups.