🔗 Source: arXiv

AdaEDL: Early Draft Stopping for Speculative Decoding of Large Language Models via an Entropy-based Lower Bound on Token Acceptance Probability

Mechanism: Dynamically terminates the token drafting phase during speculative decoding by computing a lower bound on expected acceptance probability using draft model logits’ entropy.
Nuance: Unlike prior adaptive methods that rely on static confidence thresholds or require training task-specific predictors, AdaEDL is entirely training-free and leverages distributional uncertainty (entropy) for robust, plug-and-play acceleration.

Achieves 10%-57% inference speedups over static draft-length speculative decoding and up to 10% gains over other training-free baselines across multiple datasets and model pairs.
Maintains consistent performance gains even at high sampling temperatures where traditional speculative decoding collapses below autoregressive baselines.
Eliminates the need for dataset-specific fine-tuning or hyperparameter search, enabling seamless integration into existing LLM inference pipelines.

Sensitive to the initial stopping threshold (λ), particularly for short-generation tasks where insufficient drafting rounds hinder dynamic threshold convergence.
Relies on a fixed entropy scaling factor (γ ≈ 0.2) in current implementations; optimal γ may vary across models and requires further investigation.
Efficiency gains are inherently bounded by the draft model’s computational cost and alignment with the target model, limiting absolute speedups in highly mismatched setups.