GPUs dominate self-hosted inference cost. Quantization cuts memory 2-4x for a small accuracy hit: FP8, INT8, AWQ, GPTQ, GGUF, PTQ vs QAT, and when not to do it.