Open Standard · v1.0 · 2026

FAILURE.md

// AI Agent Failure Mode Protocol

A plain-text file convention for defining failure modes and response procedures in AI agent projects. Map graceful degradation, partial failure, cascading failure, and silent failure — so your agent handles every error state with documented, auditable behaviour.

FAILURE.md
# FAILURE   > Failure mode definitions & handling. > Spec: https://failure.md   ## MODES   graceful_degradation:   action: continue_degraded   log_level: WARNING   notify: false   partial_failure:   action: isolate_and_route_around   max_retries: 3   retry_backoff_seconds:     - 5     - 15     - 60   cascading_failure:   action: circuit_breaker   log_level: ERROR   ## DETECTION   health_checks:   enabled: true   interval_seconds: 30
4
failure modes defined: graceful degradation, partial failure, cascading failure, silent failure
30s
default health check interval across all monitored components
3
maximum retries before partial failure escalates to ESCALATE.md
0
silent failures allowed to pass unreported — all must be flagged and quarantined

Not all failures are equal.
FAILURE.md maps every one.

FAILURE.md is a plain-text Markdown file you place in the root of any repository that contains an AI agent. It defines the four failure modes an AI agent can encounter, how to detect each, and the exact response procedures for each — from graceful degradation to circuit breaking to human review.

The problem it solves

AI agents fail in different ways, and not all failures are equal. A non-critical API being unavailable is different from a database cascade. A tool returning an error is different from a tool silently returning wrong data. Without explicit failure mode definitions, agents either over-react (stopping entirely for minor failures) or under-react (silently continuing through serious errors). Either way, behaviour is unpredictable and unauditable.

How it works

Drop FAILURE.md in your repo root and define each failure mode: its description, examples, detection signals, and response procedure (action, log level, notification rules, escalation target). Configure health checks, heartbeat intervals, and error pattern matching. Every failure event is logged with full context.

The regulatory context

The EU AI Act (effective 2 August 2026) requires AI systems to have documented error handling and to behave predictably under adverse conditions. FAILURE.md provides the auditable failure mode definitions and response procedures that compliance requires.

How to use it

Copy the template from GitHub and place it in your project root:

your-project/
├── AGENTS.md
├── ESCALATE.md
├── FAILURE.md ← add this
├── README.md
└── src/

What it replaces

Before FAILURE.md, failure handling was either hardcoded in agent logic, written in a Notion page no one read, or absent entirely. FAILURE.md makes failure response version-controlled, explicit, and auditable — the same file the agent reads is the same file your compliance team reviews.

Who reads it

The AI agent reads it on startup. Your SRE reads it when something goes wrong. Your compliance team reads it during audits. One file serves all three audiences.

A complete protocol.
From slow down to shut down.

FAILURE.md is one file in a complete open specification for AI agent safety. The twelve-file stack provides graduated intervention from proactive slow-down through permanent shutdown and compliance enforcement.

Operational Control
01 / 12
THROTTLE.md
→ Control the speed
Define rate limits, cost ceilings, and concurrency caps. Agent slows down automatically before it hits a hard limit.
02 / 12
ESCALATE.md
→ Raise the alarm
Define which actions require human approval. Configure notification channels. Set approval timeouts and fallback behaviour.
03 / 12
FAILSAFE.md
→ Fall back safely
Define what safe state means. Configure auto-snapshots. Specify the revert protocol when things go wrong.
04 / 12
KILLSWITCH.md
→ Emergency stop
The nuclear option. Define triggers, forbidden actions, and escalation path from throttle to full shutdown.
05 / 12
TERMINATE.md
→ Permanent shutdown
No restart without human intervention. Preserve evidence. Revoke credentials.
Data Security
06 / 12
ENCRYPT.md
→ Secure everything
Define data classification, encryption requirements, secrets handling rules, and forbidden transmission patterns.
07 / 12
ENCRYPTION.md
→ Implement the standards
Algorithms, key lengths, TLS configuration, certificate management, FIPS/SOC2/ISO compliance mapping.
Output Quality
08 / 12
SYCOPHANCY.md
→ Prevent bias
Detect agreement without evidence. Require citations. Enforce disagreement protocol for honest AI outputs.
09 / 12
COMPRESSION.md
→ Compress context
Define summarization rules, what to preserve, what to discard, and post-compression coherence checks.
10 / 12
COLLAPSE.md
→ Prevent collapse
Detect context exhaustion, model drift, and repetition loops. Enforce recovery checkpoints.
Accountability
12 / 12
LEADERBOARD.md
→ Benchmark agents
Track completion, accuracy, cost efficiency, and safety scores. Alert on regression.

Frequently asked questions.

What is FAILURE.md?

A plain-text Markdown file defining the four failure modes an AI agent can encounter — graceful degradation, partial failure, cascading failure, and silent failure — along with detection signals and per-mode response procedures. Every failure event is logged with timestamp, mode, component, and action taken.

What is the difference between graceful degradation and partial failure?

Graceful degradation: the agent continues operating with reduced capability — a non-critical tool is unavailable, so it skips that feature and logs it. Partial failure: one component fails and the agent must actively route around it — retrying with backoff, routing to a replica, or queuing for later. Partial failure triggers retries and potentially escalates; graceful degradation does not.

What is a silent failure and why is it dangerous?

A silent failure is when the agent produces output without detecting an underlying error — the API returned stale data, the write partially succeeded, or the validation check was skipped. Because no error is raised, normal escalation paths don't fire. FAILURE.md defines output validation, data freshness checks, and cross-reference consistency checks specifically to catch silent failures.

What triggers the circuit breaker?

Cascading failure detection: three failures within 60 seconds, two or more health check components failing simultaneously, or resource consumption doubling within 10 minutes. The circuit breaker stops all dependent operations immediately and escalates to FAILSAFE.md to prevent the cascade spreading further.

How does FAILURE.md relate to FAILSAFE.md and ESCALATE.md?

FAILURE.md defines what failure modes exist and how to detect and respond to each. Cascading failures escalate to FAILSAFE.md (which defines the safe recovery state). Partial failures escalate to ESCALATE.md after exhausting retries (which routes to a human). FAILURE.md is the taxonomy; FAILSAFE.md and ESCALATE.md are the recovery paths.

Does FAILURE.md work with all AI frameworks?

Yes — it is framework-agnostic. It defines failure mode policy; your agent implementation enforces it. Works with LangChain, AutoGen, CrewAI, Claude Code, custom agents, or any AI system that can monitor component health and implement circuit breaker logic.

// Domain Acquisition

Own the standard.
Own failure.md

This domain is available for acquisition. It is the canonical home of the FAILURE.md specification — the failure mode protocol layer of the AI agent safety stack, essential for any resilient production AI deployment.

Inquire About Acquisition

Or email directly: info@failure.md

FAILURE.md is an open specification for AI agent failure mode definitions and handling. Defines MODES (graceful degradation: continue with reduced capability; partial failure: isolate and route around with retries; cascading failure: circuit breaker + escalate to FAILSAFE.md; silent failure: flag, quarantine, require human review), DETECTION (30-second health checks, heartbeat monitoring, error pattern matching), and RECOVERY (identify root cause → isolate → notify → execute mode response → verify stability → resume or escalate). Part of the stack: THROTTLE → ESCALATE → FAILSAFE → KILLSWITCH → TERMINATE → ENCRYPT → ENCRYPTION → SYCOPHANCY → COMPRESSION → COLLAPSE → FAILURE → LEADERBOARD. MIT licence.
Last Updated
10 March 2026