Cost Engineering for LLM Features 2026: From $100k to $1M Monthly Spend

Q: Why does the $100k to $1M monthly LLM spend transition need a deliberate architecture rather than incremental optimisation?

Three structural reasons make incremental optimisation insufficient at this crossing. The drivers compound multiplicatively rather than additively — semantic caching saves 25-35%, model routing saves another 30-40%, prompt compaction saves another 15-25%, provider-side prompt caching saves another 10-20%, output-length discipline saves another 15-25% on the affected routes, and the savings stack against each other rather than overlapping. A team that ships one heroic optimisation (typical: "we switched to a cheaper model and saved 30%") leaves 30-50% on the table relative to the composed architecture; the team that builds the five-layer stack reaches a 60-70% reduction against the unoptimised baseline on the routes the architecture covers. The feedback loop between cost telemetry and routing decisions is what keeps the architecture optimal as workload mix and provider pricing drift, and the loop only exists if the attribution layer is built deliberately rather than bolted on after a billing surprise. The unit-economics threshold for the $1M spend level is qualitatively different from the $100k level — at $100k the LLM bill is a line item the company tolerates because the product is growing, at $1M the bill is a meaningful fraction of gross margin and the per-feature P&L drives product-roadmap decisions. The architectural discipline that supports the per-feature P&L (per-feature attribution, per-feature budget gates, per-feature cost forecasting) is not the kind of thing a team can build under pressure mid-quarter; it has to exist before the spend forces the conversation. The teams that handle the crossing well start the architecture work at the $200-300k spend level when the engineering capacity to build it deliberately still exists.

Q: What does the five-layer architecture deliver and why is the order (budget gate, semantic cache, router, compactor, inference) significant?

The order is significant because each layer's effectiveness depends on the layers above it filtering the traffic before it arrives. L1 the budget gate is first because cost ceilings need to be enforceable as the absolute first action on the request — once the request has consumed embedding cycles and routing decision cycles the partial cost has already been incurred, and the gate has to reject the request before that work happens for the gate to deliver predictable cost containment. L2 the semantic cache is second because the cache hit returns the response with no LLM invocation at all and the cache check is cheap (one embedding + one vector lookup, typically under 50ms and under $0.0001 per call); putting the cache before the router means cache hits do not consume routing capacity. L3 the router is third because by the time the request reaches the router the cache has missed and an LLM call is committed, and the router's job is to choose the cheapest model that satisfies the intent — which requires intent classification, token estimation, and current cost telemetry, all of which are non-trivial work that should not be done speculatively before the cache check. L4 the prompt compactor is fourth because the trimming and summarisation decisions depend on the model the router has chosen (different models have different tokenisers and different prompt-cache breakpoint requirements). L5 the inference call is last and irreducible — by this point every cheaper option has been exhausted. The attribution feedback loop wraps around the architecture and retunes the budgets, cache thresholds, and routing rules based on observed cost telemetry; without the loop the architecture is a static configuration that ages out of optimality as workload mix and provider pricing shift. The composed architecture is what produces the 60-70% reduction relative to the naive "every request to the frontier model" baseline.

Q: What does the 10k RPM unit-economics drill-down actually demonstrate about cost-engineering value?

The drill-down translates the abstract architecture into a concrete number that decides whether a feature is profitable. At 10k requests per minute on a single user-facing route the workload handles 14.4 million requests per day. The naive baseline — every request to a frontier-class model on a 2k-input/500-output token profile — runs about $108k of daily provider spend or $3.2M monthly just for the inference, which at typical SaaS ARPU does not survive the unit-economics math. The cost-engineered version on the same workload composes five decisions: 35% of traffic served from the semantic cache (no LLM call at all on the hit path), 50% of remaining traffic routed to a smaller "simple" model at roughly 1/15th the unit cost of the frontier model, 50% on the frontier model where the workload genuinely needs it, prompt compaction reducing average input from 2000 tokens to 1200, provider-side prompt caching applied at the right breakpoints. The composed maths becomes about $38.7k per day or $1.16M per month — a 64% reduction. The point of the drill-down is not the specific 64% number, which is workload-dependent; the point is that the saving comes from five decisions composing rather than one heroic optimisation, and that the composition is the product of the architecture rather than of any single technique. A team that ships one of the five decisions in isolation typically delivers a 20-30% saving and a team that ships all five composes to a 55-70% saving, with the multiplicative-versus-additive distinction being the difference between unit economics that work and unit economics that almost work. The architectural insight is that the savings stack until they compress against the irreducible cost of the work the workload genuinely cannot push down to a smaller model or a cache, at which point the floor is the floor and further optimisation has to come from procurement (self-hosted, reserved capacity, spot) rather than from the application stack.