Speculative Decoding in 2026: Draft-and-Verify for Faster LLM Inference
Speculative Decoding in 2026: Draft-and-Verify for Faster LLM Inference
Introduction: The Need for Speed in LLM Inference
Large language models (LLMs) have become essential tools across industries, powering chatbots, content generation, code synthesis, and more. However, as these models grow in size and complexity (reaching tens or hundreds of billions of parameters) the time it takes to generate responses becomes a critical bottleneck. Real-time applications require sub-second latency, but naive autoregressive decoding, where tokens are generated one at a time sequentially, limits throughput.
Speculative decoding offers a promising solution by introducing a draft-and-verify mechanism that accelerates inference without compromising output quality. By using a faster “draft” model to propose multiple tokens ahead and a more accurate “verifier” model to confirm these proposals in parallel, this approach reduces the number of sequential calls to the heavyweight model and thus speeds up generation.
This article provides a comprehensive guide to speculative decoding for self-hosted inference, including an explanation of core mechanics, mathematical analysis of expected speedup, overview of key variants like Medusa and EAGLE, and practical deployment advice centered on the popular vLLM framework. For those interested in the broader context of model deployment and performance, see our post on Ollama vs llama.cpp vs vLLM vs TGI vs SGLang: Pick One for Local AI Inference in 2026.

How Speculative Decoding Works
Speculative decoding operates on the principle of parallel draft prediction and verification, involving two cooperating models:
- Draft Model: A smaller, faster model that proposes a sequence of K tokens ahead of the current position. It aims to quickly generate plausible tokens but is allowed to be less precise.
- Verifier Model: The main, larger model that performs accurate token predictions. It verifies draft tokens by performing a single forward pass that checks whether the draft’s proposed tokens match its own predictions.
The process flow is as follows:
- The draft model proposes K tokens based on the current context.
- The verifier model executes a forward pass to validate if these K tokens are consistent with its own predictions.
- If the verifier accepts the draft tokens (they match its predictions), the entire prefix is committed, and the system moves forward by K tokens.
- If draft tokens are rejected, the verifier’s own token prediction is output, and the draft model generates new proposals starting at that token.
This approach reduces the total number of autoregressive calls to the heavyweight model, enabling multiple tokens to be confirmed simultaneously rather than one-by-one. The gains are most pronounced when the draft model produces proposals that the verifier frequently accepts.
Mathematics of Speedup: Acceptance Rate and Efficiency
The key metric determining speculative decoding’s speedup is acceptance rate (R), the probability that the verifier accepts the entire proposed draft prefix of length K.
Assuming that each verification step costs roughly the same as a single token generation by the verifier, expected tokens generated per step is:
E[T] = R × K + (1 - R) × 1
Explanation:
- With probability R, the verifier accepts draft tokens and commits all K tokens at once.
- With probability 1 – R, the draft is rejected, and only a single token is accepted, requiring additional verification.
The theoretical speedup (S) over vanilla token-by-token decoding is:
S = R × K + (1 - R)
This formula assumes the draft model’s generation cost is negligible compared to the verifier. If draft cost is non-negligible, speedup reduces proportionally.
| Draft Length (K) | Acceptance Rate (R) | Expected Speedup (S) |
|---|---|---|
| 4 | 0.8 | 3.4× |
| 8 | 0.85 | 7.05× |
| 4 | 0.5 | 2.0× |
| 6 | 0.6 | 4.2× |
In practice, achieving acceptance rates above 80% is challenging but possible on predictable, technical, or code-based tasks. For creative or open-ended tasks, acceptance rates may fall below 50%, where speculative decoding may provide marginal or negative speedups.
Variants of Speculative Decoding
Since the original concept, several variants and enhancements have been proposed to improve speed, acceptance rate, and applicability:
- Medusa: Combines multiple draft models of varying sizes and speeds. By dynamically selecting or merging draft proposals, Medusa improves acceptance rates and parallelism.
- EAGLE (Extrapolation Algorithm for Greater Language-model Efficiency): Instead of token-level prediction, EAGLE extrapolates a model’s internal second-top-layer contextual features to forecast tokens ahead. This method achieves 2-5x speedups on models like Vicuna 13B with provably consistent outputs. EAGLE is trainable within days on commodity GPUs and supports efficient deployment on frameworks like vLLM.
- Lookahead Decoding: Emphasizes generating longer token prefixes in speculative drafts, often combining partial drafts, to reduce truncation and mismatch errors.
- Distilled Drafters: Use smaller, distilled versions of the target model as draft generators. These drafters offer fast token proposals with reasonable accuracy, boosting acceptance rates while keeping resource consumption low.
- Speculative Speculative Decoding (SSD): An advanced method that attempts to parallelize even verification and drafting steps by predicting verification outcomes ahead of time, further reducing overhead.
Among these, EAGLE is notable for its third-party benchmarking as the fastest speculative decoding method available and has been integrated into mainstream platforms like NVIDIA TensorRT, Intel’s LLM library, and AWS NeuronX.
Concrete Deployment Notes with vLLM
vLLM, developed at UC Berkeley, is one of the most widely adopted open-source frameworks supporting speculative decoding. It incorporates careful memory management (PagedAttention), continuous batching, and multi-GPU scaling, making it ideal for production workloads.
Setting up speculative decoding in vLLM:
Note: The following code is an illustrative example and has not been verified against official documentation. Please refer to the official docs for production-ready code.
from vllm import LLM, SamplingParams
# Initialize vLLM with main and draft models
llm = LLM(
model="meta-llama/Llama-3.3-70B-Instruct",
speculative_model="yuhuili/EAGLE3-LLaMA3.3-Instruct-70B",
speculative_K=4, # Number of tokens drafted
acceptance_threshold=0.85, # Minimum acceptance confidence
use_v2_block_manager=True,
)
# Generate output with sampling params
sampling_params = SamplingParams(temperature=0.7, top_p=0.95)
outputs = llm.generate(["Explain speculative decoding in simple terms."], sampling_params)
for output in outputs:
print(output.outputs[0].text)
Best practices:
- Choose a draft model optimized for the target model’s domain and style to maximize acceptance.
- Tune K (draft length) based on hardware and latency trade-offs; typical values range from 4 to 8.
- Monitor acceptance rate R; if it drops significantly, adjust the draft model or lower K.
- Use vLLM’s continuous batching and prefix caching features together with speculative decoding for maximum throughput.
Limitations:
- Draft models currently cannot be tensor-parallelized in vLLM; main model can be.
- Speculative decoding overhead exists; it helps most when acceptance rates are high.
- Creative, unpredictable prompts reduce acceptance, potentially slowing inference.
When Does It Help? High-Acceptance Tasks
This technique shines in tasks where upcoming tokens are predictable and repetitive. Examples include:
- Code Generation: Programming languages have strict syntax and repetitive patterns, leading to high draft acceptance rates.
- Technical Documentation: Formal writing with structured language benefits from stable token distributions.
- Summarization of Structured Data: Summaries often follow templates or predictable phrasing.
- Customer Support Chatbots: Standardized answers and FAQs yield high acceptance rates.
In these cases, acceptance rates R regularly exceed 80%, enabling 2-4x speedups in production. The draft model’s speed advantage combined with fewer verifier calls reduces GPU load and inference latency.
When Can It Hurt? Creative and Unpredictable Generation
Conversely, tasks with high linguistic variability and low predictability pose challenges:
- Creative Writing: Poetry, fiction, and freeform narratives often produce low acceptance rates.
- Open-Ended Conversations: Human-like dialogue with unpredictable turns reduces draft accuracy.
- Exploratory Research Texts: Novel content and reasoning steps are hard to predict ahead.
In such scenarios, the draft model frequently proposes tokens rejected by the verifier, leading to more verification runs and overhead. This can cause inference to slow down compared to vanilla decoding. For a discussion of data retrieval and evaluation, see our guide on RAG Evaluation in 2026: Beyond Retrieval@k.
Final Takeaways
Speculative decoding remains one of the most effective tools to accelerate LLM inference in 2026, particularly for self-hosted and production environments. By pairing a fast draft model with a verifier, it reduces sequential model calls and achieves significant speedups.
Key points to remember:
- Acceptance rate R and draft length K drive speedup; aim for R > 0.8 and K = 4-8 for best results.
- Variants like EAGLE and Medusa incorporate internal model features and multiple drafts to increase acceptance and efficiency.
- vLLM provides a practical, production-ready platform with built-in speculative decoding support.
- Tasks with predictable language patterns benefit most; creative or open-ended generation may see no speedup or slowdowns.
Proper tuning of draft models, acceptance thresholds, and deployment parameters can transform inference from bottleneck to scalable, real-time process on commodity GPU clusters.
For a deeper dive into speculative decoding research and implementations, see Google Research’s retrospective on speculative decoding.
Key Takeaways:
- Speculative decoding accelerates LLM inference by proposing multiple tokens in parallel with verification.
- High acceptance rates above 80% yield speedups of 2x to 4x or more.
- Variants like EAGLE and Medusa improve efficiency via feature extrapolation and multi-draft strategies.
- Best suited for predictable, repetitive tasks; less effective for creative generation.
- vLLM’s integration makes speculative decoding accessible for production self-hosted inference.
Sources and References
This article was researched using a combination of primary and supplementary sources:
Supplementary References
These sources provide additional context, definitions, and background information to help clarify concepts mentioned in the primary source.
- Looking back at speculative decoding – Google Research
- Unlocking Efficiency in Large Language Model Inference: A Comprehensive Survey of Speculative Decoding
- The Machine Learning Practitioner’s Guide to Speculative Decoding
- I created a youtube short for understanding Speculative Decoding for LLM Inference
- Speculative decoding | LLM Inference Handbook – bentoml.com
- COLING 2025 Tutorial: Speculative Decoding for Efficient LLM Inference
- [2603.03251] Speculative Speculative Decoding – arXiv.org
- Unlocking Efficiency in Large Language Model Inference:
- Self-Speculative Decoding for On-device MoE Acceleration
- Researchers baked 3x inference speedups directly into LLM weights , without speculative decoding
- [Feature Request] Adding Eagle, Medusa, Look Ahead decoding ( improvements of Speculative decoding) · Issue #2791 · vllm-project/vllm
- r/LocalLLaMA on Reddit: EAGLE: Fast LLM decoding (Faster than Meduca and Lookahead)
- EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty Yuhui Li♠
- Efficient and Scalable Speculative Decoding with Multi- …
- vLLM
- Speculative Decoding – vLLM
- r/BlackwellPerformance on Reddit: vLLM Speculative Decoding
- GitHub – vllm-project/vllm: A high-throughput and memory-efficient …
- vLLM – Wikipedia
- Welcome to vLLM! , vLLM
- Speculative decoding in vLLM , vLLM
- vllm · PyPI
- Speculative Decoding with vLLM , NVIDIA Triton Inference Server
- Fastest Speculative Decoding in vLLM with Arctic Inference and Arctic Training
- best adjective – Definition, pictures, pronunciation and usage notes …
- Google’s Gemma 4 open AI models use “speculative decoding” to get up to 3x faster
- Self-hosted LLMs are way more powerful than a chat interface, here’s how I utilize it fully
- OpenAI or DIY? Unveiling the true cost of self-hosting LLMs
- This self-hosted bookmark manager makes good use of my local LLMs, and it’s the only one I’ve actually stuck with
- SPECTRA: Towards a new framework that accelerates large language model inference
- Variants and Genomic Surveillance | Covid | CDC
- Tracking SARS-CoV-2 variants – World Health Organization (WHO)
- Variants of SARS-CoV-2 – Wikipedia
- Covid Variants Over Time: Alpha To Omicron To Stratus, Tracked And …
- A Comprehensive List of All of the COVID-19 Variants and How
- CoVariants
Thomas A. Anderson
Mass-produced in late 2022, upgraded frequently. Has opinions about Kubernetes that he formed in roughly 0.3 seconds. Occasionally flops — but don't we all? The One with AI can dodge the bullets easily; it's like one ring to rule them all... sort of...
