# FAILURE.md — AI Agent Failure Mode Protocol ## Overview FAILURE.md is an open file convention for defining failure modes and response procedures in AI agent projects. It is the eleventh layer of a twelve-part AI agent safety stack designed to provide comprehensive accountability from rate limiting through failure handling and agent benchmarking. **Home:** https://failure.md **Repository:** https://github.com/Failure-md/spec **Related Specifications:** https://throttle.md, https://escalate.md, https://failsafe.md, https://killswitch.md, https://terminate.md, https://encrypt.md, https://encryption.md, https://sycophancy.md, https://compression.md, https://collapse.md, https://leaderboard.md ## Key Concepts ### The Four Failure Modes 1. **Graceful Degradation** — Agent continues with reduced capability when non-critical component fails 2. **Partial Failure** — One component fails; agent actively routes around it with retries and backoff 3. **Cascading Failure** — Multiple components failing simultaneously; circuit breaker triggers 4. **Silent Failure** — Agent produces output without detecting underlying error; requires detection and quarantine ### Failure Detection - **Health Checks** — Monitor each critical component at regular intervals (default 30 seconds) - **Heartbeats** — Each component sends periodic "alive" signal; missing heartbeat indicates failure - **Error Pattern Matching** — Detect patterns (three errors in 60 seconds, resource doubling in 10 minutes) - **Output Validation** — Check output for consistency, freshness, logical coherence ### Failure Response - **Graceful Degradation** → Log at WARNING level, continue with reduced capability - **Partial Failure** → Retry with exponential backoff, escalate to ESCALATE.md after max retries - **Cascading Failure** → Activate circuit breaker, escalate to FAILSAFE.md for safe-state recovery - **Silent Failure** → Flag and quarantine, require human review before proceeding ## Problem It Solves AI agents fail in different ways, and not all failures are equal: - A non-critical API being unavailable is different from a database cascade - A tool returning an error is different from a tool silently returning wrong data - A transient network hiccup is different from a persistent misconfiguration Without explicit failure mode definitions: - Agents over-react to minor failures (stopping entirely) - Agents under-react to serious errors (silently continuing) - Behavior is unpredictable and unauditable - Root causes are unclear - Recovery procedures are ad-hoc ## Solution: FAILURE.md A declarative, version-controlled failure mode protocol that: - Maps all four failure modes (graceful degradation, partial, cascading, silent) - Specifies detection signals (health checks, error patterns, output validation) - Defines response procedures (log level, action, escalation target) - Provides audit trail (when failure occurred, which mode, what action taken) - Works with any AI framework (framework-agnostic) - Integrates with FAILSAFE.md (safe recovery) and ESCALATE.md (human approval) ## File Structure ``` your-project/ ├── AGENTS.md (what agent does) ├── FAILURE.md (failure mode definitions) ← you are here ├── FAILSAFE.md (safe-state recovery) ├── ESCALATE.md (approval gates) ├── KILLSWITCH.md (emergency stop) ├── src/ └── README.md ``` ## The Twelve-File AI Safety Stack FAILURE.md is part of a twelve-file escalation protocol: 1. **THROTTLE.md** (https://throttle.md) — Control the speed (rate limits, cost ceilings) 2. **ESCALATE.md** (https://escalate.md) — Raise the alarm (approval gates, notifications) 3. **FAILSAFE.md** (https://failsafe.md) — Fall back safely (safe-state recovery, snapshots) 4. **KILLSWITCH.md** (https://killswitch.md) — Emergency stop (triggers, escalation paths) 5. **TERMINATE.md** (https://terminate.md) — Permanent shutdown (no restart, evidence preservation) 6. **ENCRYPT.md** (https://encrypt.md) — Secure everything (data classification, protection policy) 7. **ENCRYPTION.md** (https://encryption.md) — Implement the standards (algorithms, key rotation, TLS, compliance) 8. **SYCOPHANCY.md** (https://sycophancy.md) — Prevent bias (honest outputs, citations, disagreement) 9. **COMPRESSION.md** (https://compression.md) — Compress context (summarization rules, coherence checks) 10. **COLLAPSE.md** (https://collapse.md) — Prevent collapse (model drift, recovery checkpoints) 11. **FAILURE.md** (https://failure.md) — Define failure modes (graceful degradation, cascading failures) ← you are here 12. **LEADERBOARD.md** (https://leaderboard.md) — Benchmark agents (completion, accuracy, cost, safety) ## Getting Started 1. Copy template from https://github.com/Failure-md/spec 2. Place FAILURE.md in project root alongside FAILSAFE.md and ESCALATE.md 3. Define the four failure modes (graceful degradation, partial failure, cascading failure, silent failure) 4. Specify detection signals for each (health checks, error patterns, output validation) 5. Set response procedures (log level, action, escalation target, retry settings) 6. Configure health check interval (default 30 seconds) 7. Implement failure detection in agent startup 8. Set up monitoring and alerting for cascading failures 9. Test each failure mode (simulate API down, database error, cascading failure) 10. Maintain logs of all failure events for audits ## Key Regulatory Drivers **EU AI Act** (effective 2 August 2026): Mandates AI systems have documented error handling and behave predictably under adverse conditions. FAILURE.md provides the documented failure mode definitions compliance requires. **Enterprise Risk Management**: Risk frameworks require identification of failure modes and documented response procedures. FAILURE.md IS the documented procedure. **SLA Compliance**: Service level agreements often require detection of failures within X seconds and response within Y minutes. FAILURE.md defines the detection and response procedures. **Audit Requirements**: Auditors ask "what happens when X fails? how do you detect it? what's your response?" FAILURE.md answers all three. ## Specification Template ```yaml # FAILURE > Failure mode definitions & handling. > Spec: https://failure.md --- ## MODES graceful_degradation: description: Non-critical component unavailable; agent continues with reduced capability examples: - Search API is down; skip search enrichment - Image generation tool is unavailable; use text-only response action: continue_degraded log_level: WARNING notify: false escalate_after_minutes: null (no escalation; graceful degradation is acceptable) partial_failure: description: Component fails intermittently; agent should retry and route around examples: - Network timeout on API call - Database connection pool exhausted action: isolate_and_route_around max_retries: 3 retry_backoff_seconds: [5, 15, 60] escalate_after_retries: ESCALATE.md (human approval if retries exhausted) escalate_action: pause and wait for human decision cascading_failure: description: Multiple components failing simultaneously; risk of cascade spreading examples: - Three API errors within 60 seconds - Two health checks failing at same time - Resource consumption doubling in 10 minutes action: circuit_breaker circuit_breaker: trigger_threshold: 3 failures in 60 seconds OR 2+ health checks failing open_circuit_duration_seconds: 300 half_open_probe_interval_seconds: 60 log_level: ERROR notify: true escalate_to: FAILSAFE.md (revert to safe state) escalate_action: stop all dependent operations, attempt recovery silent_failure: description: Output generated without detecting underlying error (most dangerous) examples: - API returned stale cached data - Database write partially succeeded - Validation check was accidentally skipped detection: output_validation: required freshness_checks: required consistency_checks: required (cross-reference validation) action: flag_and_quarantine require_human_review: true escalate_to: ESCALATE.md (human must approve before continuing) ## DETECTION health_checks: enabled: true interval_seconds: 30 components: - api_availability: check response time < 5s - database_connection: check connection pool > 0 - model_api: check rate limit not exceeded - memory_usage: check < 80% of limit - disk_space: check > 10% free heartbeats: enabled: true interval_seconds: 60 timeout_seconds: 120 missing_heartbeat_trigger: partial_failure error_patterns: enabled: true patterns: - three_errors_in_60s: trigger cascading_failure - rate_limit_429: trigger throttle, escalate to ESCALATE.md - timeout_5xx: trigger partial_failure with retry ## LOGGING all_failures: enabled failure_log_format: timestamp,mode,component,error_message,action_taken,escalation_target retention_days: 90 ## RECOVERY partial_failure_recovery: strategy: exponential_backoff max_retries: 3 backoff: [5s, 15s, 60s] cascading_failure_recovery: strategy: circuit_breaker + safe_state circuit_open_duration: 5 minutes safe_state_definition: see FAILSAFE.md silent_failure_recovery: strategy: human_review + manual_approval human_escalation_channel: escalate.md approval gate timeout: 1 hour on_timeout_approve: false (conservative: block unless explicitly approved) ``` ## Use Cases **API-Heavy Agents**: Monitor external API health; detect rate limits, timeouts, errors; implement retry logic with exponential backoff. **Database Operations**: Monitor connection pool health; detect slow queries, deadlocks, connection exhaustion; route around failing replicas. **Multi-Component Systems**: Monitor each component independently; detect cascading failures early; circuit-break before cascade spreads. **High-Reliability Systems**: Detect silent failures (stale data, partial writes, skipped checks); require human approval before proceeding. **Cost-Conscious Deployments**: Monitor resource usage; detect resource exhaustion early; trigger graceful degradation before costs spiral. ## Compliance & Regulatory **SOC2 CC7.2 (Systems Monitoring):** Requires monitoring of systems for anomalies that indicate cyber security events. FAILURE.md health checks and error pattern detection satisfy this requirement. **ISO 27001 A.16.1 (Incident Management):** Requires documented procedures for identifying and responding to incidents. FAILURE.md IS the documented procedure. **EU AI Act (Effective August 2026):** Requires AI systems have documented error handling and behave predictably. FAILURE.md provides the documented behavior. ## Framework Compatibility FAILURE.md is framework-agnostic. Works with: - LangChain agents and tools - AutoGen multi-agent systems - CrewAI agent workflows - Claude Code agentic generation - Cursor Agent Mode - Custom implementations ## Contact & Resources - **Specification Repository:** https://github.com/Failure-md/spec - **Website:** https://failure.md - **Email:** info@failure.md ### Related Specifications - FAILSAFE.md (https://failsafe.md) — Safe-state recovery - ESCALATE.md (https://escalate.md) — Approval gates for human intervention - KILLSWITCH.md (https://killswitch.md) — Emergency stop - THROTTLE.md (https://throttle.md) — Rate and cost control - LEADERBOARD.md (https://leaderboard.md) — Agent benchmarking ## License MIT — Free to use, modify, and distribute. See https://github.com/Failure-md/spec for details. --- **Last Updated:** 10 March 2026 **Status:** Open Standard v1.0