LLM Quantization for Production Inference (2026)

Q: How much accuracy do you lose when quantizing an LLM?

For 8-bit quantization (FP8 or INT8) the accuracy loss is typically near zero and often within measurement noise on standard benchmarks, which is why 8-bit is considered close to free on supported hardware. For 4-bit weight quantization using modern methods like AWQ or GPTQ, the loss is usually in the 1 to 3 percent range on aggregate benchmarks, an excellent trade for roughly a 4x memory reduction. Below 4-bit, at 3-bit and especially 2-bit, accuracy degrades sharply and unpredictably and the model is often unusable for production. The crucial caveat is that aggregate benchmark numbers hide where the damage lands: quantization error tends to concentrate in the hardest queries — long-tail, multilingual, code, and multi-step reasoning tasks — so a model that loses only 2 percent on average can lose far more on your specific hard cases. Always validate on your own evaluation set with production-like inputs rather than trusting a single public benchmark figure.

Q: What is the difference between PTQ and QAT?

Post-training quantization (PTQ) takes a model that has already been fully trained and quantizes its weights directly, optionally using a small calibration dataset of a few hundred samples to choose the scaling factors that map high-precision values into the low-bit range. It requires no retraining, runs in minutes to hours, and is what the overwhelming majority of teams use because it is cheap and usually good enough at 8-bit and 4-bit. Quantization-aware training (QAT) instead simulates the effects of quantization during the training or fine-tuning process, inserting fake-quantization operations so the model learns weights that are robust to being quantized later. QAT recovers more accuracy, particularly at aggressive low bit-widths where PTQ struggles, but it costs a full training or fine-tuning run and the associated compute and data. The practical rule is to start with PTQ at your target bit-width, measure on your eval set, and only escalate to QAT if PTQ loses more accuracy than your budget allows.

Q: What is the difference between AWQ, GPTQ, and GGUF?

All three produce low-bit (commonly 4-bit) models but they target different priorities and runtimes. AWQ, or Activation-aware Weight Quantization, identifies the small fraction of weights that are most salient to the model output and protects them during quantization, which yields strong 4-bit accuracy and fast GPU inference kernels, making it a popular choice for cost-bound GPU serving. GPTQ uses second-order (Hessian-based) information to quantize weights while minimising the introduced error layer by layer, and it has been a reliable 4-bit workhorse for GPU inference for some time. GGUF is the file format used by llama.cpp and is designed for efficient inference on CPUs and consumer GPUs, offering a family of quality levels such as Q4_K_M, Q5_K_M, and Q8_0 that trade size against accuracy, which makes it the default for local, edge, and CPU deployments. In short: AWQ and GPTQ for server-side GPU inference, GGUF for CPU and local runtimes, and you should benchmark the specific variant on your hardware because both accuracy and kernel speed differ.

Q: Does quantization make inference faster or just smaller?

It usually does both, and the speed benefit is often underappreciated. The obvious benefit is memory: 4-bit weights occupy roughly a quarter of the memory of FP16, so larger models fit on fewer or smaller GPUs. The less obvious benefit is throughput. Autoregressive token generation is typically bound by memory bandwidth rather than raw compute, because for each token the hardware must read the model weights from GPU memory; with fewer bytes per weight, less data moves per token and generation speeds up. There is a second, compounding effect at the serving layer: because quantized weights free GPU memory, you can devote more memory to the key-value cache, which lets the server batch more concurrent requests, and higher batch sizes raise aggregate throughput and lower cost per token substantially. The exact gains depend on hardware, kernel quality, and batch size, and at very small batch sizes the dequantization overhead can erode some of the win, so you should benchmark on your serving stack rather than assuming a fixed speedup.

Q: Is FP8 better than INT8 for LLM inference?

It depends on your hardware and workload, and both are excellent 8-bit options that lose very little accuracy. FP8 is a floating-point 8-bit format with native hardware support on NVIDIA H100 and H200-class GPUs and their successors; because it preserves a floating-point exponent it handles the wide dynamic range of activations gracefully, which makes it well suited to quantizing both weights and activations with near-FP16 accuracy and very high throughput on supported GPUs. INT8 is an integer 8-bit format with much broader hardware support, including older GPUs and many accelerators, so it is the more portable choice when you are not on the latest NVIDIA hardware. If you are running on H100/H200-class GPUs, FP8 is usually the better default for high-throughput serving; if you need broad hardware compatibility or are on older accelerators, INT8 is the safer choice. As always, validate the specific format on your model and eval set, since the gap between them is small and workload-dependent.

Q: How big a calibration dataset do I need for quantization?

Post-training quantization methods that use calibration typically need only a small dataset, commonly in the range of 128 to 512 samples, because calibration is used to estimate the distribution of activations and weights so the quantizer can choose good per-channel scaling factors, not to retrain the model. The size matters far less than the representativeness: the calibration samples should resemble your real production traffic in domain, language, and format. A model calibrated on generic English web text can quantize poorly for a system that serves legal documents, source code, or non-English inputs, because the activation statistics differ and the chosen scales will be miscalibrated for the inputs you actually see. So the practical guidance is to use a few hundred samples drawn from, or closely matching, your production distribution, and then confirm the result on a separate evaluation set. If you observe uneven degradation on a particular input type, adding representative samples of that type to the calibration set often helps.

Q: When should I NOT quantize a model?

There are a few situations where quantization is the wrong call or should be applied cautiously. First, when accuracy is contractual or safety-critical and you cannot tolerate even a one to two percent regression on your hardest cases, the risk may outweigh the cost saving, or you should restrict yourself to near-lossless 8-bit formats rather than 4-bit. Second, when the model is already small — for example a 3-billion-parameter model that already fits comfortably on your GPU — the memory saving may not justify the accuracy risk, and you may be better served by a well-chosen small model at full precision. Third, when you are strictly latency-bound at very low batch sizes on hardware where the full-precision or FP8 model already fits in memory, since the throughput advantage of quantization shrinks in that regime and dequantization overhead can even hurt. In all cases the decision should be data-driven: measure the quantized model against your own evaluation set and latency and cost budgets, and quantize only when the trade clearly favours it.

Q: Can I combine quantization with other inference optimizations?

Yes, and combining them is how production teams reach the best cost and latency, but each lossy technique must be evaluated as you stack it. Quantization composes naturally with serving-layer optimizations that are lossless, such as paged-attention key-value cache management and continuous batching: quantized weights free GPU memory, which enlarges the feasible KV cache, which raises the achievable batch size and throughput, so these reinforce each other. Quantization also pairs well with speculative decoding, where a small quantized draft model proposes tokens that a larger model verifies, accelerating generation without changing outputs. Where you must be careful is stacking multiple lossy compression techniques — for example aggressive quantization on top of heavy pruning or distillation — because the accuracy losses compound and can silently push the model below your quality bar. The discipline is to introduce one lossy change at a time, re-run your evaluation set after each, and gate deployment on the result, while freely combining the lossless serving optimizations on top.