Beyond NVIDIA: 2026 AI Accelerator Landscape · Groq · Cerebras · Trainium · TPU · MI300X

Q: Should I move all my AI workloads off NVIDIA in 2026?

No, and that is not the right framing. NVIDIA remains the safe default for any workload you cannot precisely characterise, where the team's CUDA skills are deep, or where the model architecture is novel enough that tooling maturity matters more than per-token economics. The 2026 question is not "should I leave NVIDIA?" but "which specific workloads should I run on which non-NVIDIA accelerator to capture cost or speed advantages that NVIDIA cannot match for those workloads?" The mature posture is a small portfolio: NVIDIA for training and frontier experimentation flexibility, one cost-optimised inference accelerator (Trainium, AMD MI300X, or Groq depending on the workload shape), and possibly TPU for training where JAX is already in the stack. Three accelerators is operationally manageable; six is not. Teams that run everything on one accelerator in 2026 are either overpaying or carrying concentration risk.

Q: What does Groq actually offer that NVIDIA does not?

Speed at the per-stream level — specifically, sub-second time-to-first-token and 250-400 tokens per second per stream for Llama-3.3-70B-class open models. The architectural reason is that Groq's LPU is a tensor-streaming processor with the model weights held in on-chip SRAM rather than HBM; this eliminates the memory-bandwidth bottleneck that limits GPU inference speed. An H100 generating tokens for a 70B model is fundamentally bottlenecked on HBM bandwidth (each token requires reading the entire model from HBM), while the LPU's SRAM-resident weights are read at on-chip speeds. The practical consequence is that Groq is two to four times faster per stream than tuned H100 deployments of the same model, with predictable microsecond-level latency. The sweet spot is real-time conversational interfaces, agentic workflows where many small LLM calls compose into a user-perceptible round trip, voice interfaces requiring sub-300ms first-token latency, and any latency-sensitive serving of open models. Where Groq does not fit: training (it is inference-only), frontier closed models, custom fine-tunes Groq has not deployed, and very high-throughput batched inference where batched H100/B200 throughput dominates per-stream metrics.

Q: Is Cerebras worth considering outside specialised research labs?

Yes for a narrowing-but-real set of production workloads. The WSE-3 is a single-wafer chip with 900,000 cores and 44 GB of on-die SRAM, which means a model that fits in 44 GB lives entirely on one chip with no sharding overhead — and even larger models like 70B-class in FP8 (~70 GB) shard across only a small number of WSE-3 chips. The two production sweet spots are: (1) training and fine-tuning of large models where the wafer-scale architecture eliminates the synchronisation tax that consumes 30-50 per cent of GPU cluster wall-clock time, with meaningfully fewer engineering-hours of sharding work; (2) inference for users prioritising raw tokens-per-second per system above all else, where Cerebras posts numbers GPU clusters cannot match in the same physical footprint. The Cerebras Cloud is the more accessible path for most teams; on-prem WSE-3 systems have months-long procurement and high capital cost that only justifies sustained high-utilisation workloads. Tooling is solid but smaller in surface area than CUDA, so anything off the Cerebras-supported path requires vendor engineering engagement.

Q: How mature is AMD ROCm in 2026 compared to CUDA?

ROCm has reached the maturity threshold where it is no longer an active impediment to mainstream LLM inference adoption, which is a meaningful 2024-2026 change. Most PyTorch models run on MI300X and MI325X with no source changes and competitive performance; the prior years where ROCm caused regular surprise failures are largely behind us for mainstream models. The MI300X's 192 GB of HBM3 (versus 80 GB H100 / 141 GB H200) is a real architectural advantage for inference of large models that would otherwise require sharding on smaller-memory NVIDIA chips, and for long context windows where the KV cache is the binding constraint. Pricing is typically 15-30 per cent below NVIDIA for equivalent capability. Where ROCm is still thinner than CUDA: custom CUDA kernels do not port without rewriting, exotic libraries and niche models require more engineering effort, and the distributed training tooling ecosystem is smaller — most teams use AMD for inference and NVIDIA for training. Supply chain availability is narrower than NVIDIA but improving. AMD is now a serious second-source for inference workloads, particularly large-model and long-context.

Q: When should I pick AWS Trainium over NVIDIA on AWS?

Pick Trainium when you are already heavily on AWS, the workload is cost-bound rather than feature-bound, and you are willing to invest in the Neuron SDK tooling for that workload. The economics are not subtle: Trainium2 is roughly 30-50 per cent cheaper than equivalent H100/H200 capacity on EC2 for training and inference, with no third-party margin. The Neuron SDK in 2026 is mature enough that PyTorch and JAX workloads compile to the Neuron runtime with code changes measured in dozens of lines for most models, rather than weeks of engineering effort. Bedrock's own underlying capacity for several major models runs on Trainium / Inferentia at the cost-leadership end of the price sheet, which validates the platform. Where Trainium is not the right answer: workloads that need exotic kernels or push the framework boundary (more work on Neuron than CUDA), multi-cloud architectures where AWS lock-in is a strategic concern, regions that do not yet have Trainium2 capacity, and frontier experimentation where every paper's reference implementation is CUDA-first. The lock-in is to AWS, not to Trainium hardware specifically — a Trainium-trained model is portable as weights but not as the training pipeline.

Q: Does TPU make sense for teams not already on Google Cloud?

Almost never. TPU is GCP-only — there is no off-cloud TPU option — and the cost of moving an existing AI workload to GCP just to use TPU is rarely justified by the per-token economics, which are competitive with but not categorically cheaper than NVIDIA on the major hyperscalers. For teams already on GCP, TPU is often the most pleasant large-model training experience available, particularly with JAX where it is the most ergonomic large-model training stack on any accelerator. PyTorch on TPU via PyTorch/XLA has reached the maturity where most PyTorch models train with modest changes, but teams writing exotic PyTorch operators still hit XLA boundaries that require workarounds. The TPU's throughput per chip and per-pod is excellent, the ICI fabric scales to thousands of chips for very large training jobs, and the v5p and v6 (Trillium) generations are competitive with the latest NVIDIA training silicon. The architectural decision is upstream: if the team is GCP-resident, TPU is often the default for training; if not, the cost of cross-cloud migration nearly always exceeds the TPU advantage.

Q: Is Tenstorrent ready for production deployments?

For specific use cases, yes. The Wormhole and Blackhole chips with the open-source Metalium and tt-buda software stack are mature enough for workstation-class inference, edge and on-prem deployments where buying the hardware is preferable to renting hyperscaler capacity, sovereign and air-gapped deployments where the open stack matters, and teams that want to own and inspect their AI infrastructure at the metal level. The Tenstorrent Galaxy and similar workstation systems serve open models locally with a developer-friendly experience that no cloud-only option matches. For mainstream cloud-based production inference at hyperscaler-comparable cost, Tenstorrent is not yet the right answer — the operational playbooks are still being written, the production-scale deployment count is smaller than the other entries in this article, and the per-chip performance at the high end is competitive but not class-leading. Tenstorrent's value is in openness, accessibility, and the strategic option of owning your stack — not in raw flops per dollar against the latest NVIDIA generation. For teams that want a turnkey hyperscaler experience, look elsewhere; for teams that want the open option, it is increasingly serious.

Q: How do I avoid getting locked into a single non-NVIDIA accelerator?

Three architectural disciplines. First, write applications at the framework level (PyTorch, JAX) using mainstream operators rather than at the accelerator-specific kernel level — this costs roughly 5-15 per cent of peak performance versus hand-tuned silicon-specific code but preserves the option to retarget the accelerator without rewriting the model, an option worth far more than 5-15 per cent over the lifetime of a serious AI workload. Second, export model weights in portable formats (Hugging Face safetensors, GGUF for quantised models, ONNX for some workloads) and treat the weights as the durable artefact while the inference pipeline is replaceable. Third, maintain a multi-accelerator portfolio rather than a single bet — typically NVIDIA for training and flexibility, one cost-optimised inference accelerator for the bulk of inference, and a third for specific workloads that win on a different axis. Operationally this requires planning for 2-6 weeks of additional engineering investment per non-NVIDIA accelerator brought into production for the first time, plus 3-6 months of operational learning curve before incidents resolve at the same speed as on the established NVIDIA stack. Teams that budget for this are fine; teams that underestimate it are surprised.

Q: Which accelerator wins on dollars per million output tokens for Llama-70B in 2026?

The honest answer is "it depends on the workload shape and where you measure." For batched, throughput-optimised inference of Llama-3.3-70B in 2026, the rough ballpark per million output tokens is: AWS Trainium2 via Bedrock at roughly $0.30-0.55, NVIDIA B200 at roughly $0.40-0.75, Google TPU v5p at roughly $0.45-0.80, AMD MI300X at roughly $0.45-0.75, NVIDIA H100 amortised on-prem at roughly $0.55-0.95, Groq at roughly $0.79 output (its per-stream-speed advantage costs more than batched alternatives), and Cerebras at roughly $0.60-1.20 (its single-system throughput advantage is also priced). These are 2026 ballpark figures from public price sheets, vendor benchmarks adjusted for realistic deployment, and practitioner consensus — they are not warranties, and real numbers depend on batch size, prompt length, output length, quantisation level (FP8 / FP16 / INT4), context window, concurrency, and deployment-specific optimisations. The narrow takeaway is that AWS Trainium2 is the cost leader for AWS-resident teams with throughput workloads; Groq is the speed leader for latency-critical workloads; NVIDIA B200 is the raw-performance leader at the high end; and the right answer for any specific workload requires a real benchmark against your traffic shape rather than a generic comparison.

Q: What about quantisation — does it change the accelerator decision?

Yes, materially. FP8 and INT4 quantisation are now standard production techniques for inference of large open models, and the accelerator support varies meaningfully. NVIDIA H100/H200/B200 have native FP8 and increasingly INT4 support with mature toolchains (TensorRT-LLM, vLLM with quantised kernels). AMD MI300X supports FP8 and INT4 with maturing ROCm tooling. AWS Trainium2 supports FP8 natively via the Neuron SDK. Google TPU v5p/v6 support BF16 and FP8 with JAX and PyTorch/XLA. Groq runs LLMs at high precision (typically FP16 or BF16) by architectural design — its SRAM-resident weights model does not benefit from quantisation in the same way that GPU inference does. Cerebras supports FP8 in its training and inference stack. The practical consequence is that quantisation can reduce inference cost by 30-60 per cent on most accelerators, but the magnitude of the saving and the engineering cost of achieving it differs by accelerator. A workload that quantises well on NVIDIA may lose less of its NVIDIA cost advantage than headline numbers suggest; a workload that does not quantise well on a particular non-NVIDIA accelerator may see less of the cost advantage that the headline numbers imply. Always benchmark with the quantisation level you actually intend to deploy.