A systematic breakdown of the eight failure mode categories that cause the majority of LLM production incidents — prompt reliability, retrieval quality, hallucination, latency, agent safety, guardrails, observability, and cost governance — with root causes, detection signals, and architectural responses for each.