Entropy-Based Adaptive Speculative Decoding
🔗 Source: arXiv
AdaEDL: Early Draft Stopping for Speculative Decoding of Large Language Models via an Entropy-based Lower Bound on Token Acceptance Probability
🚀 Technical Novelty
- Mechanism: Uses the draft model’s output entropy to approximate a lower bound on token acceptance probability, enabling dynamic, on-the-fly early stopping of the drafting process.
- Nuance: Replaces static draft lengths and max-confidence thresholds by evaluating the full probability distribution via entropy, eliminating the need for training task-specific predictors while maintaining robustness across varying sampling temperatures.
💡 Yield
- Outperforms static draft-length speculative decoding by 10%-57% and other training-free methods by up to 10% across multiple datasets and model pairs.
- Maintains inference speed gains even at high sampling temperatures (up to 1.7) where baseline speculative decoding collapses below autoregressive speeds.
- Requires zero training or dataset-specific fine-tuning, acting as a plug-and-play acceleration layer for existing LLM systems.
⚠️ Limitations
- Sensitive to the initial stopping threshold (λ), particularly for short generations where insufficient drafting rounds prevent dynamic threshold convergence.
- Relies on the draft model’s distributional alignment with the target; performance gains diminish if the draft model is poorly correlated or untrained.
- Entropy scaling factor (γ) was fixed at 0.2 in experiments, leaving room for dataset-specific tuning or dynamic adaptation in future work.