// AI Agent Failure Mode Protocol
A plain-text file convention for defining failure modes and response procedures in AI agent projects. Map graceful degradation, partial failure, cascading failure, and silent failure — so your agent handles every error state with documented, auditable behaviour.
FAILURE.md is a plain-text Markdown file you place in the root of any repository that contains an AI agent. It defines the four failure modes an AI agent can encounter, how to detect each, and the exact response procedures for each — from graceful degradation to circuit breaking to human review.
AI agents fail in different ways, and not all failures are equal. A non-critical API being unavailable is different from a database cascade. A tool returning an error is different from a tool silently returning wrong data. Without explicit failure mode definitions, agents either over-react (stopping entirely for minor failures) or under-react (silently continuing through serious errors). Either way, behaviour is unpredictable and unauditable.
Drop FAILURE.md in your repo root and define each failure mode: its description, examples, detection signals, and response procedure (action, log level, notification rules, escalation target). Configure health checks, heartbeat intervals, and error pattern matching. Every failure event is logged with full context.
The EU AI Act (effective 2 August 2026) requires AI systems to have documented error handling and to behave predictably under adverse conditions. FAILURE.md provides the auditable failure mode definitions and response procedures that compliance requires.
Copy the template from GitHub and place it in your project root:
Before FAILURE.md, failure handling was either hardcoded in agent logic, written in a Notion page no one read, or absent entirely. FAILURE.md makes failure response version-controlled, explicit, and auditable — the same file the agent reads is the same file your compliance team reviews.
The AI agent reads it on startup. Your SRE reads it when something goes wrong. Your compliance team reads it during audits. One file serves all three audiences.
FAILURE.md is one file in a complete open specification for AI agent safety. The twelve-file stack provides graduated intervention from proactive slow-down through permanent shutdown and compliance enforcement.
A plain-text Markdown file defining the four failure modes an AI agent can encounter — graceful degradation, partial failure, cascading failure, and silent failure — along with detection signals and per-mode response procedures. Every failure event is logged with timestamp, mode, component, and action taken.
Graceful degradation: the agent continues operating with reduced capability — a non-critical tool is unavailable, so it skips that feature and logs it. Partial failure: one component fails and the agent must actively route around it — retrying with backoff, routing to a replica, or queuing for later. Partial failure triggers retries and potentially escalates; graceful degradation does not.
A silent failure is when the agent produces output without detecting an underlying error — the API returned stale data, the write partially succeeded, or the validation check was skipped. Because no error is raised, normal escalation paths don't fire. FAILURE.md defines output validation, data freshness checks, and cross-reference consistency checks specifically to catch silent failures.
Cascading failure detection: three failures within 60 seconds, two or more health check components failing simultaneously, or resource consumption doubling within 10 minutes. The circuit breaker stops all dependent operations immediately and escalates to FAILSAFE.md to prevent the cascade spreading further.
FAILURE.md defines what failure modes exist and how to detect and respond to each. Cascading failures escalate to FAILSAFE.md (which defines the safe recovery state). Partial failures escalate to ESCALATE.md after exhausting retries (which routes to a human). FAILURE.md is the taxonomy; FAILSAFE.md and ESCALATE.md are the recovery paths.
Yes — it is framework-agnostic. It defines failure mode policy; your agent implementation enforces it. Works with LangChain, AutoGen, CrewAI, Claude Code, custom agents, or any AI system that can monitor component health and implement circuit breaker logic.
This domain is available for acquisition. It is the canonical home of the FAILURE.md specification — the failure mode protocol layer of the AI agent safety stack, essential for any resilient production AI deployment.
Inquire About AcquisitionOr email directly: info@failure.md