When AI Fails in Production: A TPM Playbook

AI incidents look different in the ticket queue but they behave like any production incident once you strip the novelty. Something user-visible broke, you need scope, mitigation, communication, and a follow-up that prevents repeat. I run them with the same severity model as a payment outage, because trust outages compound the same way.

Step one is classify the failure mode: wrong output, unavailable service, policy violation, or data leak risk. Each mode has a different owner and a different immediate action. Treating every AI bug as a model bug sends the wrong team into the room.

Step two is containment. Kill switch, feature flag, cohort rollback, or prompt guardrail. I decide which lever exists before launch, not during the fire. At enterprise scale, the rollback path matters more than the root cause deck.

Step three is communication. Executives want certainty you may not have. I give them what I know, what I am doing in the next hour, and when the next update lands. Vague calm reads as cover-up. Specific uncertainty reads as control.

Step four is the learning loop. Post-incident review asks what signal I missed, not who missed it. Model incidents repeat when eval suites only test happy paths. I expand eval coverage every time production finds a class of failure I had not simulated.

Arsenii Samoilov is a Senior Technical Program Manager with 19+ years at Intuit, Atlassian, Adobe, Salesforce, Roku, and Apple.