How to Deploy LLMs on Kubernetes: Production Guide (2026)

Q: What Kubernetes resources do I need to deploy an LLM?

Deploying an LLM on Kubernetes requires GPU node pools with the NVIDIA GPU Operator installed, dedicated node taints and tolerations to prevent non-GPU workloads from consuming GPU nodes, generous system memory requests (at least 2x the model size in bytes for model loading), fast local NVMe storage for model weight caching, and startup probes with 300 to 600 second timeouts to accommodate model loading time. For a 70B model at BF16, expect to request 2 to 4 H100 GPUs, 300 to 400 GB of system memory, and 500 GB of local SSD storage per pod.

Q: Which model serving framework should I use — vLLM, TensorRT-LLM, or Ollama?

vLLM is the default choice for most production deployments in 2026 — it offers PagedAttention memory management, continuous batching, prefix caching, and an OpenAI-compatible API with broad model support. TensorRT-LLM delivers 1.5 to 2x higher throughput for NVIDIA hardware through compiled GPU kernels, but requires 15 to 60 minute compilation per model and is tightly coupled to the NVIDIA ecosystem. Ollama is the simplest option for development environments and low-concurrency internal tools but lacks continuous batching and production features. Choose vLLM for flexibility, TensorRT-LLM for maximum NVIDIA throughput, and Ollama for simplicity when performance is not the primary concern.

Q: How do I autoscale LLM workloads on Kubernetes?

Standard CPU-based Horizontal Pod Autoscaling does not work for LLM workloads because inference is GPU-bound. Use KEDA (Kubernetes Event-Driven Autoscaling) or the Prometheus Adapter to scale on inference-specific metrics: request queue depth, P95 latency, GPU utilisation (via DCGM), and tokens per second. Complement reactive autoscaling with predictive scaling based on historical traffic patterns and scheduled scaling for known events. Because LLM pod startup takes 2 to 5 minutes (model loading), reactive scaling alone always lags behind traffic spikes. Pre-warm GPU node pools and use PVC model caching to reduce cold start times.

Q: How do I handle model updates without downtime on Kubernetes?

Standard Kubernetes rolling updates cause reduced capacity during model loading (2 to 5 minutes per replica). Instead, use blue-green or canary deployment strategies. With canary deployments via Argo Rollouts, Istio, or Linkerd, route 5 to 10 percent of traffic to the new model version while monitoring latency, error rate, and output quality. If metrics are healthy after 30 to 60 minutes, gradually increase traffic to the new version. Use PersistentVolumeClaims with init containers to cache model weights on local SSD, reducing pod startup from minutes to seconds for cached models. Always configure PodDisruptionBudgets to prevent cluster operations from evicting all replicas simultaneously.

Q: How do I monitor LLM performance on Kubernetes?

Deploy the NVIDIA DCGM Exporter as a DaemonSet on all GPU nodes for hardware metrics (GPU utilisation, memory, temperature, power, ECC errors). The serving framework (vLLM or Triton) exposes inference metrics via Prometheus: requests per second, tokens per second, time to first token, end-to-end latency percentiles, batch size, KV cache utilisation, and queue depth. Use Prometheus for metric collection, Grafana for dashboards that overlay GPU and inference metrics, and Kubecost or OpenCost for per-model and per-team GPU cost attribution. Alert on SLA thresholds (TTFT exceeding target), queue depth spikes, and KV cache utilisation above 90 percent.

Q: Should I scale LLM deployments to zero when idle?

Scale-to-zero saves significant cost (a single H100 node costs 20 to 28 dollars per day), but the cold start penalty — node provisioning plus model loading — takes 5 to 15 minutes, making it impractical for production services with availability SLAs. Use scale-to-zero only for development environments, staging, and internal tools where users can tolerate a multi-minute cold start. For production, maintain a minimum replica count of 1 or 2 and accept the baseline GPU cost as the price of availability. If cost is a primary concern, consider smaller models on cheaper GPUs (L40S) for the always-on minimum tier.

Q: How do I serve multiple LLM models on the same Kubernetes cluster?

Three patterns are available. Dedicated deployment per model gives each model its own Kubernetes Deployment, autoscaling, and Service — simplest to operate with clean blast radius isolation. Model multiplexing runs multiple small models on a single GPU via Triton Inference Server, improving utilisation when models are individually small (a single L40S can serve a 7B model, embedding model, and reranker simultaneously). A routing layer (lightweight proxy on CPU nodes) sits in front of all model endpoints, directing requests based on type or complexity, handling fallback between models, and implementing rate limiting. Most production deployments combine all three: dedicated deployments for large models, multiplexing for small models, and a router for traffic management.

Q: What networking configuration does LLM serving require on Kubernetes?

LLM serving requires HTTP/2 and gRPC streaming support in your ingress controller — disable proxy buffering in NGINX Ingress or use Envoy which supports streaming natively. Set request timeouts to 300 seconds or more for long generation requests. Use least-connections load balancing instead of round-robin, since LLM requests have highly variable durations. Expose model endpoints as ClusterIP Services for internal access and route external traffic through an LLM gateway that handles authentication, rate limiting, and usage tracking. Apply Kubernetes NetworkPolicies to restrict access to model serving pods to only the routing layer and monitoring systems.