FAILURE.md

Q: What is the difference between graceful degradation and partial failure?

Graceful degradation: the agent continues operating with reduced capability — a non-critical tool is unavailable, so it skips that feature and logs it. Partial failure: one component fails and the agent must actively route around it — retrying with backoff, routing to a replica, or queuing for later. Partial failure triggers retries and potentially escalates; graceful degradation does not.

// What is FAILURE.md

Not all failures are equal.
FAILURE.md maps every one.

FAILURE.md is a plain-text Markdown file you place in the root of any repository that contains an AI agent. It defines the four failure modes an AI agent can encounter, how to detect each, and the exact response procedures for each — from graceful degradation to circuit breaking to human review.

What problem does FAILURE.md solve?

AI agents fail in different ways, and not all failures are equal. A non-critical API being unavailable is different from a database cascade. A tool returning an error is different from a tool silently returning wrong data. Without explicit failure mode definitions, agents either over-react (stopping entirely for minor failures) or under-react (silently continuing through serious errors). Either way, behaviour is unpredictable and unauditable.

How does FAILURE.md work?

Drop FAILURE.md in your repo root and define each failure mode: its description, examples, detection signals, and response procedure (action, log level, notification rules, escalation target). Configure health checks, heartbeat intervals, and error pattern matching. Every failure event is logged with full context.

What regulations require FAILURE.md?

The EU AI Act (effective 2 August 2026) requires AI systems to have documented error handling and to behave predictably under adverse conditions. FAILURE.md provides the auditable failure mode definitions and response procedures that compliance requires.

How do I add FAILURE.md to my project?

Copy the template from GitHub and place it in your project root:

your-project/
├── AGENTS.md
├── ESCALATE.md
├── FAILURE.md ← add this
├── README.md
└── src/

What did teams use before FAILURE.md?

Before FAILURE.md, failure handling was either hardcoded in agent logic, written in a Notion page no one read, or absent entirely. FAILURE.md makes failure response version-controlled, explicit, and auditable — the same file the agent reads is the same file your compliance team reviews.

Who benefits from FAILURE.md?

The AI agent reads it on startup. Your SRE reads it when something goes wrong. Your compliance team reads it during audits. One file serves all three audiences.

// The AI Safety Escalation Stack

A complete protocol.
From slow down to shut down.

FAILURE.md is one file in a complete open specification for AI agent safety. The twelve-file stack provides graduated intervention from proactive slow-down through permanent shutdown and compliance enforcement.

Operational Control

01 / 12

THROTTLE.md

→ Control the speed

Define rate limits, cost ceilings, and concurrency caps. Agent slows down automatically before it hits a hard limit.

02 / 12

ESCALATE.md

→ Raise the alarm

Define which actions require human approval. Configure notification channels. Set approval timeouts and fallback behaviour.

03 / 12

FAILSAFE.md

→ Fall back safely

Define what safe state means. Configure auto-snapshots. Specify the revert protocol when things go wrong.

04 / 12

KILLSWITCH.md

→ Emergency stop

The nuclear option. Define triggers, forbidden actions, and escalation path from throttle to full shutdown.

05 / 12

TERMINATE.md

→ Permanent shutdown

No restart without human intervention. Preserve evidence. Revoke credentials.

Data Security

06 / 12

ENCRYPT.md

→ Secure everything

Define data classification, encryption requirements, secrets handling rules, and forbidden transmission patterns.

07 / 12

ENCRYPTION.md

→ Implement the standards

Algorithms, key lengths, TLS configuration, certificate management, FIPS/SOC2/ISO compliance mapping.

Output Quality

08 / 12

SYCOPHANCY.md

→ Prevent bias

Detect agreement without evidence. Require citations. Enforce disagreement protocol for honest AI outputs.

09 / 12

COMPRESSION.md

→ Compress context

Define summarization rules, what to preserve, what to discard, and post-compression coherence checks.

10 / 12

COLLAPSE.md

→ Prevent collapse

Detect context exhaustion, model drift, and repetition loops. Enforce recovery checkpoints.

Accountability

11 / 12

FAILURE.md

→ Define failure modes

Map graceful degradation, cascading failure, and silent failure. Per-mode response procedures.

12 / 12

LEADERBOARD.md

→ Benchmark agents

Track completion, accuracy, cost efficiency, and safety scores. Alert on regression.

// FAQ

Frequently asked questions.

What is FAILURE.md?

A plain-text Markdown file defining the four failure modes an AI agent can encounter — graceful degradation, partial failure, cascading failure, and silent failure — along with detection signals and per-mode response procedures. Every failure event is logged with timestamp, mode, component, and action taken.

What is the difference between graceful degradation and partial failure?

Graceful degradation: the agent continues operating with reduced capability — a non-critical tool is unavailable, so it skips that feature and logs it. Partial failure: one component fails and the agent must actively route around it — retrying with backoff, routing to a replica, or queuing for later. Partial failure triggers retries and potentially escalates; graceful degradation does not.

What is a silent failure and why is it dangerous?

A silent failure is when the agent produces output without detecting an underlying error — the API returned stale data, the write partially succeeded, or the validation check was skipped. Because no error is raised, normal escalation paths don't fire. FAILURE.md defines output validation, data freshness checks, and cross-reference consistency checks specifically to catch silent failures.

What triggers the circuit breaker?

Cascading failure detection: three failures within 60 seconds, two or more health check components failing simultaneously, or resource consumption doubling within 10 minutes. The circuit breaker stops all dependent operations immediately and escalates to FAILSAFE.md to prevent the cascade spreading further.

How does FAILURE.md relate to FAILSAFE.md and ESCALATE.md?

FAILURE.md defines what failure modes exist and how to detect and respond to each. Cascading failures escalate to FAILSAFE.md (which defines the safe recovery state). Partial failures escalate to ESCALATE.md after exhausting retries (which routes to a human). FAILURE.md is the taxonomy; FAILSAFE.md and ESCALATE.md are the recovery paths.

Does FAILURE.md work with all AI frameworks?

Yes — it is framework-agnostic. It defines failure mode policy; your agent implementation enforces it. Works with LangChain, AutoGen, CrewAI, Claude Code, custom agents, or any AI system that can monitor component health and implement circuit breaker logic.

// Domain Acquisition

Own the standard.
Own failure.md

This domain is available for acquisition. It is the canonical home of the FAILURE.md specification — the failure mode protocol layer of the AI agent safety stack, essential for any resilient production AI deployment.

Inquire About Acquisition

Or email directly: info@failure.md

FAILURE.md is an open specification for AI agent failure mode definitions and handling. Defines MODES (graceful degradation: continue with reduced capability; partial failure: isolate and route around with retries; cascading failure: circuit breaker + escalate to FAILSAFE.md; silent failure: flag, quarantine, require human review), DETECTION (30-second health checks, heartbeat monitoring, error pattern matching), and RECOVERY (identify root cause → isolate → notify → execute mode response → verify stability → resume or escalate). Part of the stack: THROTTLE → ESCALATE → FAILSAFE → KILLSWITCH → TERMINATE → ENCRYPT → ENCRYPTION → SYCOPHANCY → COMPRESSION → COLLAPSE → FAILURE → LEADERBOARD. MIT licence.

Last Updated

13 March 2026

Not all failures are equal.FAILURE.md maps every one.