Green AI 2026: Cut LLM Inference Cost 80% — Quantisation, vLLM, Routing

Q: Where does the 80% inference cost reduction actually come from?

Multiplicative compounding of multiple techniques, not any single trick. Typical contributions when applied together: switching to vLLM (continuous batching + paged attention) gives 3-10x throughput improvement; prefix caching cuts input cost 30-70% on workloads with shared prompts; INT8 or FP8 quantisation cuts compute cost ~40-50% with under 1% quality regression; INT4 quantisation adds another 30-40% with 1-3% regression; model routing (small model for easy queries, large only when needed) cuts 40-70% of remaining cost; speculative decoding adds 1.5-3x speedup; spot/preemptible GPUs cut hardware cost 50-70% on the bulk fleet. A naive baseline at $0.005-0.02 per 1K output tokens typically lands at $0.0005-0.003 per 1K after the full stack — roughly 80-90% reduction. The first four techniques (vLLM, prefix caching, INT8 quant, model routing) deliver most of the savings; the rest is the long tail. Apply them in cost-leverage order rather than hype order.

Q: What quality regression should I expect from INT4 quantisation?

On standard benchmarks (MMLU, GSM8K, HumanEval, MT-Bench) modern INT4 methods (GPTQ, AWQ) typically regress 1-3% from FP16 baseline on modern models. But that benchmark number is a starting prior, not a deployment decision — actual regression on your workload may be smaller or larger depending on the task distribution and the model architecture. Build a workload-specific evaluation set of 50-200 representative production prompts, score the FP16 baseline and the INT4 candidate using LLM-as-judge with a strong judge model (or human eval for high-stakes), and accept the quantised model only if the regression is acceptable for the use case. For chat/general workloads INT4 is almost always acceptable; for code generation it sometimes needs INT8 or FP8 to preserve correctness; for math reasoning it depends heavily on the model. FP8 (Hopper/Blackwell hardware) is a safer middle ground — under 1% regression typically, with most of the throughput gain — when the hardware supports it.

Q: Should I use vLLM, TGI, TensorRT-LLM, or SGLang in 2026?

Default to vLLM unless you have a specific reason otherwise. In 2026 vLLM is the production default for most teams because it combines paged attention, continuous batching, prefix caching, speculative decoding, broad model support, and active community development with a permissive license. TensorRT-LLM is the right answer when you have an Nvidia-only fleet and you are willing to invest in the longer build pipeline for the last 20-30% of throughput on H100/H200/B200. SGLang is the right answer for heavily agentic workloads where the same long system prompt is reused across many requests — its RadixAttention prefix caching is specifically optimised for that pattern. TGI remains popular at teams already deeply in the Hugging Face ecosystem and for managed inference endpoints. llama.cpp is for edge, CPU, and dev/laptop scenarios — not high-traffic production. Custom Triton backends are reserved for specialised research or hardware that the above do not support; rarely the right answer.

Q: How does prefix caching actually work and when does it help?

Prefix caching exploits the fact that most production LLM workloads share a long stable prefix across requests — a 2-10K-token system prompt with instructions, tool descriptions, and knowledge context summary that is identical across thousands of user requests, with only the user message at the end varying. Without prefix caching, every request re-computes the KV cache for the entire prefix from scratch. With prefix caching, the inference server detects the shared prefix via a content hash on the token sequence, reuses the existing KV blocks, and computes only the variable suffix. The cost reduction on input tokens is 80-95% on cache hits and the time-to-first-token drops by the same proportion. vLLM's automatic prefix caching and SGLang's RadixAttention are the two production implementations — both essentially free to enable. The architectural design that maximises hit rate is to structure prompts as [stable system + stable tools + stable knowledge summary] + [variable user], never interleaving variable content into the prefix. Workloads that benefit most: agents (huge stable system prompt), customer-support bots, structured-output endpoints. Workloads that benefit less: pure chat with no system prompt, fully personalised prompts.

Q: When is speculative decoding worth implementing?

Speculative decoding is worth it when you are generating long outputs on a large model and the latency or throughput matters. The mechanism: a small fast draft model (typically 1-7B) proposes multiple tokens and a large verifier model (70B+) accepts or rejects them in parallel during a single forward pass. With a well-matched draft model, 2-4 of every 5 proposed tokens are accepted on average and the end-to-end speedup is 1.5-3x with zero quality loss because the verifier guarantees the output matches what the verifier alone would have produced. Common pairings in 2026: Llama 3.2 1B as draft for Llama 3.3 70B verifier, Qwen 2.5 1.5B for Qwen 2.5 72B. EAGLE-2 and Medusa are alternative schemes using lightweight head networks instead of separate draft models — these eliminate the draft-deployment overhead and are strong choices in 2026, particularly EAGLE-2. Speculative decoding is supported in vLLM, TGI, and TensorRT-LLM. Skip it if your outputs are short (single tokens, classification, JSON keys) because the verification overhead dominates on short generations.

Q: Should I distil my own model?

Distil when you have one or a few high-traffic well-defined stable workloads where the engineering investment pays back in weeks. The classic supervised distillation pipeline: take 10K-100K representative production prompts, generate responses from the teacher model (Claude Opus, GPT-4, Llama 3 405B), fine-tune a smaller student (Llama 3.1 8B, Mistral Small, Qwen 2.5 14B) on the prompt-response pairs, evaluate on a held-out set against the teacher, deploy. The student is typically 5-50x cheaper to serve than the teacher and recovers 90-98% of the teacher's quality on the workload-specific distribution. The economic case is strongest when one workload does 10M+ requests per month against a frontier model — payback is days to weeks of a few engineer-weeks plus a few thousand dollars of teacher inference for data generation. For low-traffic workloads the engineering cost dominates and routing to a smaller model with prompt engineering is the better answer. For workloads that change frequently, distillation is fragile because you have to re-distil every time the prompt or task evolves. Decision rule: distil when high-traffic, well-defined, and stable; route otherwise.

Q: How do I build a model router that does not degrade silently?

A simple router is a small classifier (a fine-tuned 1B model, or a few-shot prompt to a Haiku-class model) that scores each request against complexity buckets and routes accordingly — easy to a small model, standard to a mid model, complex to a large model, specialised to a domain model. The simple router is easy to build; the harder work is the evaluation feedback loop that prevents silent degradation. The loop: periodically (weekly is typical) sample routed requests, generate the response from a stronger model than the one the router chose, score the routed-model response against the stronger-model response using LLM-as-judge or human evaluation, and surface the cases where the router under-routed (sent to a small model when a larger was needed — quality regression) or over-routed (sent to a large model when smaller would have done — wasted cost). Retrain or re-prompt the router based on this signal monthly. Surface drift on a dashboard so on-call engineers see when accuracy degrades before users complain. Without the eval loop the router degrades silently as the workload distribution shifts; with it the router stays calibrated for the lifetime of the product.

Q: Is spot/preemptible GPU usage worth the engineering cost?

Yes for most inference workloads, with the engineering cost being a one-time investment. AWS Spot, GCP preemptible, and Azure Spot offer GPU instances at 50-70% discount with the catch that they can be reclaimed with 30 seconds to 2 minutes notice. The architecture pattern: a small on-demand "always-on" replica fleet that handles baseline traffic and any failover, plus a much larger spot fleet that handles the bulk under normal conditions, with a load balancer that drains spot replicas when reclamation notice arrives. The engineering cost is in the graceful-shutdown logic (drain in-flight requests, signal LB to stop routing), the autoscaler that quickly replaces reclaimed capacity, and the request retry/failover logic for in-flight requests on a reclaimed replica. With this in place the per-request cost on the spot portion is half to a third of on-demand, translating to 30-50% blended cost reduction. Skip spot for workloads with extremely tight latency SLAs that cannot tolerate the brief failover window, or for workloads where capacity churn is unacceptable. Worth it for the vast majority.

Q: How should I sequence these optimisations?

Apply in cost-leverage order, not hype order. The sequence that produces results: (1) Switch to vLLM or TGI — get continuous batching and paged attention for free, typically 3-10x throughput on day one. (2) Enable prefix caching — typically 30-70% input cost reduction immediately if your workload has shared prefixes (most agent workloads do). (3) Quantise to INT8 or FP8 — typically 40-50% additional reduction with under 1% quality regression. Measure quality on the workload-specific eval set before deployment. (4) Add model routing — route easy traffic to a small model. Build the eval loop alongside. Typically 30-60% additional reduction. (5) Quantise to INT4 (GPTQ or AWQ) — additional 30-40% if quality regression is acceptable. (6) Add speculative decoding — additional 1.5-3x on tokens-per-second, lossless. (7) Move to spot for the bulk fleet — 30-50% on hardware. (8) Distil the highest-traffic workloads to a custom small model. (9) Regional and carbon-aware scheduling for batch workloads — final 5-15%. The first four steps deliver 80%+ of the achievable savings; the rest is the long tail. Building the architecture (gateway, eval loop, cost telemetry) before the second pass of optimisations prevents the savings from regressing.

Q: How do I prevent the cost savings from regressing after they ship?

Build the discipline into the architecture rather than treating optimisation as a one-time project. The architecture that prevents drift: (1) A centralised LLM gateway (NestJS, Kong AI, Portkey, or self-built) that owns routing, quantisation deployment, prefix-cache configuration, and cost telemetry — every model call goes through it, no application code talks directly to inference servers; (2) a weekly automated eval loop that re-scores the router and quantisation quality on sampled production traffic and surfaces drift; (3) a per-tenant per-endpoint cost dashboard that makes regression visible to product teams who own the budgets; (4) on-call alerts when cost-per-request crosses a threshold (typically 1.5-2x the rolling 30-day baseline); (5) a deployment policy that requires a quantisation/quality re-evaluation when the underlying model is upgraded; (6) a code review checklist for prompt changes that includes "does this break prefix caching"; (7) periodic chaos testing of spot failover so the failover code does not bit-rot. Without these the optimisation regresses within a quarter as new features bypass the router, prompts break the cache, model upgrades skip re-quantisation, and somebody quietly switches everything to on-demand during a flaky spot window.

Satyam Kumar

返回博客

ai-architecture

Green AI: Cut Inference Cost 80% with Quantisation, Distillation, Speculative Decoding (2026)

April 28, 202620 min read

Frequently Asked Questions

分享这篇文章

Twitter LinkedIn WhatsApp

Satyam

人工智能和云架构师。帮助团队构建可扩展到数百万的系统。

Green AI: Cut Inference Cost 80% with Quantisation, Distillation, Speculative Decoding (2026)

Frequently Asked Questions

分享这篇文章

Comments

Leave a comment