Future-Aware Speculative Decoding

🔗 Source: arXiv

ConFu: Contemplate the Future for Better Speculative Sampling

🚀 Technical Novelty

Mechanism: Deploys learnable soft prompts and dynamic Mixture-of-Experts to extract target model reasoning states as auxiliary inputs for draft token generation.
Nuance: Unlike EAGLE which conditions only on the current prefix (causing drift), ConFu explicitly anticipates the target’s semantic trajectory, aligning draft distributions with future states at negligible overhead.

💡 Yield

Delivers 8–11% average gains in token acceptance rates and generation throughput over EAGLE-3 across Llama-3 3B/8B models on SpecBench.

⚠️ Limitations

Contemplate token insertion adds compute proportional to draft tree size, requiring future work to minimize overhead for maximal scalability.