Speculative Decoding in 2026: Draft-and-Verify for Faster LLM Inference

Introduction: The Need for Speed in LLM Inference

Large language models (LLMs) have become essential tools across industries, powering chatbots, content generation, code synthesis, and more. However, as these models grow in size and complexity (reaching tens or hundreds of billions of parameters) the time it takes to generate responses becomes a critical bottleneck. Real-time applications require sub-second latency, but naive autoregressive decoding, where tokens are generated one at a time sequentially, limits throughput.

How Speculative Decoding Works

Mathematics of Speedup: Acceptance Rate and Efficiency

The key metric determining speculative decoding’s speedup is acceptance rate (R), the probability that the verifier accepts the entire proposed draft prefix of length K.

Assuming that each verification step costs roughly the same as a single token generation by the verifier, expected tokens generated per step is:

E[T] = R × K + (1 - R) × 1

Explanation:

With probability R, the verifier accepts draft tokens and commits all K tokens at once.
With probability 1 – R, the draft is rejected, and only a single token is accepted, requiring additional verification.

The theoretical speedup (S) over vanilla token-by-token decoding is:

S = R × K + (1 - R)

This formula assumes the draft model’s generation cost is negligible compared to the verifier. If draft cost is non-negligible, speedup reduces proportionally.

Draft Length (K)	Acceptance Rate (R)	Expected Speedup (S)
4	0.8	3.4×
8	0.85	7.05×
4	0.5	2.0×
6	0.6	4.2×

In practice, achieving acceptance rates above 80% is challenging but possible on predictable, technical, or code-based tasks. For creative or open-ended tasks, acceptance rates may fall below 50%, where speculative decoding may provide marginal or negative speedups.

Variants of Speculative Decoding

Since the original concept, several variants and enhancements have been proposed to improve speed, acceptance rate, and applicability:

Medusa: Combines multiple draft models of varying sizes and speeds. By dynamically selecting or merging draft proposals, Medusa improves acceptance rates and parallelism.
EAGLE (Extrapolation Algorithm for Greater Language-model Efficiency): Instead of token-level prediction, EAGLE extrapolates a model’s internal second-top-layer contextual features to forecast tokens ahead. This method achieves 2-5x speedups on models like Vicuna 13B with provably consistent outputs. EAGLE is trainable within days on commodity GPUs and supports efficient deployment on frameworks like vLLM.
Lookahead Decoding: Emphasizes generating longer token prefixes in speculative drafts, often combining partial drafts, to reduce truncation and mismatch errors.
Distilled Drafters: Use smaller, distilled versions of the target model as draft generators. These drafters offer fast token proposals with reasonable accuracy, boosting acceptance rates while keeping resource consumption low.
Speculative Speculative Decoding (SSD): An advanced method that attempts to parallelize even verification and drafting steps by predicting verification outcomes ahead of time, further reducing overhead.

Among these, EAGLE is notable for its third-party benchmarking as the fastest speculative decoding method available and has been integrated into mainstream platforms like NVIDIA TensorRT, Intel’s LLM library, and AWS NeuronX.

Concrete Deployment Notes with vLLM

vLLM, developed at UC Berkeley, is one of the most widely adopted open-source frameworks supporting speculative decoding. It incorporates careful memory management (PagedAttention), continuous batching, and multi-GPU scaling, making it ideal for production workloads.

Setting up speculative decoding in vLLM:

Note: The following code is an illustrative example and has not been verified against official documentation. Please refer to the official docs for production-ready code.

from vllm import LLM, SamplingParams

# Initialize vLLM with main and draft models
llm = LLM(
 model="meta-llama/Llama-3.3-70B-Instruct",
 speculative_model="yuhuili/EAGLE3-LLaMA3.3-Instruct-70B",
 speculative_K=4, # Number of tokens drafted
 acceptance_threshold=0.85, # Minimum acceptance confidence
 use_v2_block_manager=True,
)

# Generate output with sampling params
sampling_params = SamplingParams(temperature=0.7, top_p=0.95)
outputs = llm.generate(["Explain speculative decoding in simple terms."], sampling_params)

for output in outputs:
 print(output.outputs[0].text)

Best practices:

Choose a draft model optimized for the target model’s domain and style to maximize acceptance.
Tune K (draft length) based on hardware and latency trade-offs; typical values range from 4 to 8.
Monitor acceptance rate R; if it drops significantly, adjust the draft model or lower K.
Use vLLM’s continuous batching and prefix caching features together with speculative decoding for maximum throughput.

Limitations:

Draft models currently cannot be tensor-parallelized in vLLM; main model can be.
Speculative decoding overhead exists; it helps most when acceptance rates are high.
Creative, unpredictable prompts reduce acceptance, potentially slowing inference.

When Does It Help? High-Acceptance Tasks

This technique shines in tasks where upcoming tokens are predictable and repetitive. Examples include:

Code Generation: Programming languages have strict syntax and repetitive patterns, leading to high draft acceptance rates.
Technical Documentation: Formal writing with structured language benefits from stable token distributions.
Summarization of Structured Data: Summaries often follow templates or predictable phrasing.
Customer Support Chatbots: Standardized answers and FAQs yield high acceptance rates.

In these cases, acceptance rates R regularly exceed 80%, enabling 2-4x speedups in production. The draft model’s speed advantage combined with fewer verifier calls reduces GPU load and inference latency.

When Can It Hurt? Creative and Unpredictable Generation

Conversely, tasks with high linguistic variability and low predictability pose challenges:

Creative Writing: Poetry, fiction, and freeform narratives often produce low acceptance rates.
Open-Ended Conversations: Human-like dialogue with unpredictable turns reduces draft accuracy.
Exploratory Research Texts: Novel content and reasoning steps are hard to predict ahead.

In such scenarios, the draft model frequently proposes tokens rejected by the verifier, leading to more verification runs and overhead. This can cause inference to slow down compared to vanilla decoding. For a discussion of data retrieval and evaluation, see our guide on RAG Evaluation in 2026: Beyond Retrieval@k.

Final Takeaways

Speculative decoding remains one of the most effective tools to accelerate LLM inference in 2026, particularly for self-hosted and production environments. By pairing a fast draft model with a verifier, it reduces sequential model calls and achieves significant speedups.

Key points to remember:

Acceptance rate R and draft length K drive speedup; aim for R > 0.8 and K = 4-8 for best results.
Variants like EAGLE and Medusa incorporate internal model features and multiple drafts to increase acceptance and efficiency.
vLLM provides a practical, production-ready platform with built-in speculative decoding support.
Tasks with predictable language patterns benefit most; creative or open-ended generation may see no speedup or slowdowns.

Proper tuning of draft models, acceptance thresholds, and deployment parameters can transform inference from bottleneck to scalable, real-time process on commodity GPU clusters.

For a deeper dive into speculative decoding research and implementations, see Google Research’s retrospective on speculative decoding.

Key Takeaways:

Speculative decoding accelerates LLM inference by proposing multiple tokens in parallel with verification.
High acceptance rates above 80% yield speedups of 2x to 4x or more.
Variants like EAGLE and Medusa improve efficiency via feature extrapolation and multi-draft strategies.
Best suited for predictable, repetitive tasks; less effective for creative generation.
vLLM’s integration makes speculative decoding accessible for production self-hosted inference.

Speculative Decoding in 2026: Draft-and-Verify for Faster LLM Inference

Speculative Decoding in 2026: Draft-and-Verify for Faster LLM Inference

Introduction: The Need for Speed in LLM Inference

How Speculative Decoding Works

Mathematics of Speedup: Acceptance Rate and Efficiency

Variants of Speculative Decoding

Concrete Deployment Notes with vLLM

When Does It Help? High-Acceptance Tasks

When Can It Hurt? Creative and Unpredictable Generation

Final Takeaways

Sources and References

Thomas A. Anderson