Speculative Decoding in Production LLM Inference 2026: EAGLE-3, Medusa, vLLM, 3× Throughput

Q: How is acceptance rate measured in production and what should the alert threshold be for detecting head drift or workload shift?

Acceptance rate is measured per speculative step as accepted_tokens / drafted_tokens, then averaged across steps and reported as a rolling window. The vLLM telemetry exposes this directly: each scheduler iteration emits a metric tuple of (sequence_id, drafted_tokens_this_step, accepted_tokens_this_step) and the aggregation layer rolls these up into per-workload, per-model-version, per-context-length-bucket time series. The right window length is workload-dependent: a high-throughput production deployment can use a 5-minute rolling window and get smooth statistics; a low-volume deployment may need a 1-hour or 6-hour window to have enough samples per bucket. The threshold for alerting depends on baseline: the right framing is not "α dropped below 0.7" but "α dropped more than 5 percentage points below its 7-day moving baseline for this bucket". That framing catches both gradual drift (head training distribution diverging from production traffic distribution over months) and sudden shifts (a model version change, a target re-quantisation, a fine-tune push). The bucketed dimensions that matter most: context-length bucket (catches long-context head failure), temperature bucket (catches sampling-heavy workload regression), model-version bucket (catches a deployment that pushed a new target without re-validating the head), workload-tag bucket (catches a customer onboarding with a different prompt distribution). The downstream metric — realised speedup, computed as observed_tokens_per_second / vanilla_equivalent_tokens_per_second where the vanilla equivalent is estimated from the target forward-pass timing — is the operational gauge that translates acceptance-rate movement into business impact. An alert on realised speedup dropping below a configurable floor (say 1.5×, below which the speculation is no longer worth its operational complexity) is the right business-level alarm to pair with the technical acceptance-rate alarms.

Q: When does speculative decoding hurt throughput rather than help it, and how does the team detect and disable it for those workloads automatically?

The condition for speculative decoding to hurt rather than help is that the speedup formula evaluates below 1.0 at the operating α, K, and c. From the formula (1 − α^(K+1)) / ((1 − α) × (K × c + 1)), this happens when α is low and K is large: at α = 0.4 and K = 8 with c = 0.1, the formula gives roughly 0.95 — the speculation is net-negative. At α = 0.3 and K = 4 with c = 0.1, it gives 0.88 — clearly negative. The pathological case is a workload where the draft model is systematically poorly aligned with the target (a customer using a niche language or domain the draft was not trained for, a prompt structure that confuses the draft head, an unusually high sampling temperature). The team detects these workloads by monitoring per-workload realised speedup and per-workload waste ratio (drafted_tokens_wasted / total_drafted_tokens). A waste ratio above 0.5 combined with realised speedup below 1.2 is the operational signature of a workload speculation is failing. The automated disable mechanism in vLLM 0.8+ is two-tier: dynamic K reduces K toward 1 when rolling α drops, and a hard fallback to vanilla decoding fires when K=1 still produces a realised speedup below 1.0 for an extended window. The team should expose a per-request hint mechanism (a `disable_speculation: true` flag the caller can set) so callers who know their workload is speculation-hostile can opt out without waiting for the runtime to detect the regression. The serving topology should also include a vanilla-only lane for workloads that have been identified as persistently speculation-hostile — routing those workloads at the gateway level is cheaper than letting them ride through the speculative scheduler and waste draft tokens.