返回博客ai-architectureAI Incident Response Runbook: RCA for LLM Failures (2026)May 12, 202626 min read ai incident response llm reliability sre for llm on-call blameless rca postmortem detection signals eval drift guardrail trips containment primitives model rollback prompt rollback kill switch incident commander ai observability ai architectureFrequently Asked QuestionsWhy does the standard SRE incident response model not transfer cleanly to LLM systems?How do you grade LLM incident severity in a way that operates better than a generic three-tier model?What detection signals do you actually need to instrument for LLM systems beyond the standard infrastructure metrics?What containment primitives should be operable during the first 30-60 minutes of an LLM incident?What does the RCA template look like when extended for LLM failure classes, and how does it differ from a standard SRE postmortem template?How do you handle vendor-side incidents where the proximate cause is an Anthropic, OpenAI, or Google model update you do not control?How do you handle prompt injection and jailbreak incidents that are simultaneously security incidents and reliability incidents?How do you structure the on-call rota for LLM systems given the skill set differs from traditional SRE on-call?How do you communicate LLM incidents to affected users and to the public status page in a way that is both honest and appropriately bounded?What does Stage 4 maturity look like for LLM incident response, and what business outcomes does it produce? 分享这篇文章 Twitter LinkedIn WhatsApp复制链接Download as PDFSatyam人工智能和云架构师。帮助团队构建可扩展到数百万的系统。Comments Leave a commentPost Comment