Enterprise LLM Gateway Architecture: Routing & Rate Limiting 2026

Q: What is an LLM gateway and how is it different from a standard API gateway?

An LLM gateway is a specialised reverse proxy that sits between applications and large language model providers. Unlike a standard API gateway, which routes on HTTP method and path, an LLM gateway routes on semantic criteria — the nature of the request, cost targets, and model capability. It enforces token-based rate limits rather than request counts, integrates semantic caching that matches similar prompts rather than identical URLs, and produces observability data that includes token consumption and model identity alongside standard latency and error metrics.

Q: How much latency does an LLM gateway add to requests?

A well-implemented gateway in the same network region as the application adds between 5 and 20 milliseconds to average request latency. This is typically less than 2 per cent of total request time, which is dominated by model inference latency ranging from 500 milliseconds to several seconds. For cached responses, the gateway actually reduces latency dramatically — returning cached completions in under 10 milliseconds compared to the full inference round-trip.

Q: How does token-based rate limiting work in practice?

Instead of counting requests per second, token-aware rate limiting tracks the accumulated token consumption — prompt tokens plus completion tokens — that each team or application produces in a defined window. The gateway deducts actual usage as requests complete and enforces a budget ceiling. When the budget is exhausted, subsequent requests either receive a structured error, fall back to a lower-cost model tier, or queue for processing after the window resets. This aligns directly with how providers charge and makes AI spending controllable at the team level.

Q: What is semantic caching and when does it deliver meaningful savings?

A semantic cache stores LLM completions alongside vector embeddings of the prompts that generated them. When a new request arrives, the gateway computes the embedding similarity between the incoming prompt and cached prompts. If similarity exceeds a configured threshold — typically 0.92 to 0.95 cosine similarity — the cached response is returned without a model call. This delivers significant savings for use cases with recurring or structurally similar queries: customer service applications, internal knowledge search, and FAQ systems typically achieve cache hit rates of 30 to 60 per cent, directly reducing both API spend and response latency.

Q: How should organisations handle provider outages with an LLM gateway?

A circuit breaker pattern detects when a provider begins returning errors and opens the circuit — stopping traffic to that provider for a configured backoff period. The router then automatically directs traffic to a configured fallback provider without any change to application behaviour. The gateway monitors provider health continuously and gradually restores traffic as the endpoint recovers. Applications should be designed to tolerate the possibility of a different model serving fallback requests, which may produce stylistically different outputs, and handle this gracefully rather than surfacing an error to end users.

Q: Which open-source LLM gateway should we use?

LiteLLM Proxy is the most widely deployed open-source option in 2026, covering provider routing, virtual key management, token budget enforcement, and semantic caching in a single deployment. It is OpenAI-API-compatible, actively maintained, and has commercial support available. PortKey and Traefik AI are alternatives with different trade-off profiles — PortKey emphasises developer experience and prompt analytics; Traefik AI integrates with the broader Traefik infrastructure ecosystem. For organisations already using Kong, the AI Gateway plugin extends an existing Kong deployment with LLM-specific routing capabilities.

Q: How do we manage API keys securely with an LLM gateway?

Each team or application receives a virtual API key issued by the gateway. This key is mapped internally to the gateway's own provider credentials, which teams never hold directly. When a team is offboarded or a key is compromised, a single gateway operation revokes access across all connected providers simultaneously. Provider credentials are stored in a secrets manager such as HashiCorp Vault and injected into the gateway at runtime. This eliminates the security risk of provider keys distributed across dozens of repositories and environment configuration files.

Q: What should a CTO track on an AI gateway dashboard?

Three executive-level metrics summarise gateway health without requiring technical context: cost per successful completion (which captures routing efficiency and cache performance together), AI availability — the percentage of requests served to the caller from any source, whether primary model, fallback, or cache — and error ratio by business unit, which surfaces teams whose integrations are producing disproportionate failure rates. At operational level, the metrics that precede incidents are fallback activation frequency and rate limit budget exhaustion rate by team.