# FAILURE.md — AI Agent Failure Mode Protocol

## Overview
FAILURE.md is an open file convention for defining failure modes and response procedures in AI agent projects. It is the eleventh layer of a twelve-part AI agent safety stack designed to provide comprehensive accountability from rate limiting through failure handling and agent benchmarking.

**Home:** https://failure.md
**Repository:** https://github.com/Failure-md/spec
**Related Specifications:** https://throttle.md, https://escalate.md, https://failsafe.md, https://killswitch.md, https://terminate.md, https://encrypt.md, https://encryption.md, https://sycophancy.md, https://compression.md, https://collapse.md, https://leaderboard.md

## Key Concepts

### The Four Failure Modes
1. **Graceful Degradation** — Agent continues with reduced capability when non-critical component fails
2. **Partial Failure** — One component fails; agent actively routes around it with retries and backoff
3. **Cascading Failure** — Multiple components failing simultaneously; circuit breaker triggers
4. **Silent Failure** — Agent produces output without detecting underlying error; requires detection and quarantine

### Failure Detection
- **Health Checks** — Monitor each critical component at regular intervals (default 30 seconds)
- **Heartbeats** — Each component sends periodic "alive" signal; missing heartbeat indicates failure
- **Error Pattern Matching** — Detect patterns (three errors in 60 seconds, resource doubling in 10 minutes)
- **Output Validation** — Check output for consistency, freshness, logical coherence

### Failure Response
- **Graceful Degradation** → Log at WARNING level, continue with reduced capability
- **Partial Failure** → Retry with exponential backoff, escalate to ESCALATE.md after max retries
- **Cascading Failure** → Activate circuit breaker, escalate to FAILSAFE.md for safe-state recovery
- **Silent Failure** → Flag and quarantine, require human review before proceeding

## Problem It Solves

AI agents fail in different ways, and not all failures are equal:
- A non-critical API being unavailable is different from a database cascade
- A tool returning an error is different from a tool silently returning wrong data
- A transient network hiccup is different from a persistent misconfiguration

Without explicit failure mode definitions:
- Agents over-react to minor failures (stopping entirely)
- Agents under-react to serious errors (silently continuing)
- Behavior is unpredictable and unauditable
- Root causes are unclear
- Recovery procedures are ad-hoc

## Solution: FAILURE.md

A declarative, version-controlled failure mode protocol that:
- Maps all four failure modes (graceful degradation, partial, cascading, silent)
- Specifies detection signals (health checks, error patterns, output validation)
- Defines response procedures (log level, action, escalation target)
- Provides audit trail (when failure occurred, which mode, what action taken)
- Works with any AI framework (framework-agnostic)
- Integrates with FAILSAFE.md (safe recovery) and ESCALATE.md (human approval)

## File Structure
```
your-project/
├── AGENTS.md (what agent does)
├── FAILURE.md (failure mode definitions) ← you are here
├── FAILSAFE.md (safe-state recovery)
├── ESCALATE.md (approval gates)
├── KILLSWITCH.md (emergency stop)
├── src/
└── README.md
```

## The Twelve-File AI Safety Stack

FAILURE.md is part of a twelve-file escalation protocol:

1. **THROTTLE.md** (https://throttle.md) — Control the speed (rate limits, cost ceilings)
2. **ESCALATE.md** (https://escalate.md) — Raise the alarm (approval gates, notifications)
3. **FAILSAFE.md** (https://failsafe.md) — Fall back safely (safe-state recovery, snapshots)
4. **KILLSWITCH.md** (https://killswitch.md) — Emergency stop (triggers, escalation paths)
5. **TERMINATE.md** (https://terminate.md) — Permanent shutdown (no restart, evidence preservation)
6. **ENCRYPT.md** (https://encrypt.md) — Secure everything (data classification, protection policy)
7. **ENCRYPTION.md** (https://encryption.md) — Implement the standards (algorithms, key rotation, TLS, compliance)
8. **SYCOPHANCY.md** (https://sycophancy.md) — Prevent bias (honest outputs, citations, disagreement)
9. **COMPRESSION.md** (https://compression.md) — Compress context (summarization rules, coherence checks)
10. **COLLAPSE.md** (https://collapse.md) — Prevent collapse (model drift, recovery checkpoints)
11. **FAILURE.md** (https://failure.md) — Define failure modes (graceful degradation, cascading failures) ← you are here
12. **LEADERBOARD.md** (https://leaderboard.md) — Benchmark agents (completion, accuracy, cost, safety)

## Getting Started

1. Copy template from https://github.com/Failure-md/spec
2. Place FAILURE.md in project root alongside FAILSAFE.md and ESCALATE.md
3. Define the four failure modes (graceful degradation, partial failure, cascading failure, silent failure)
4. Specify detection signals for each (health checks, error patterns, output validation)
5. Set response procedures (log level, action, escalation target, retry settings)
6. Configure health check interval (default 30 seconds)
7. Implement failure detection in agent startup
8. Set up monitoring and alerting for cascading failures
9. Test each failure mode (simulate API down, database error, cascading failure)
10. Maintain logs of all failure events for audits

## Key Regulatory Drivers

**EU AI Act** (effective 2 August 2026): Mandates AI systems have documented error handling and behave predictably under adverse conditions. FAILURE.md provides the documented failure mode definitions compliance requires.

**Enterprise Risk Management**: Risk frameworks require identification of failure modes and documented response procedures. FAILURE.md IS the documented procedure.

**SLA Compliance**: Service level agreements often require detection of failures within X seconds and response within Y minutes. FAILURE.md defines the detection and response procedures.

**Audit Requirements**: Auditors ask "what happens when X fails? how do you detect it? what's your response?" FAILURE.md answers all three.

## Specification Template

```yaml
# FAILURE

> Failure mode definitions & handling.
> Spec: https://failure.md

---

## MODES

graceful_degradation:
  description: Non-critical component unavailable; agent continues with reduced capability
  examples:
    - Search API is down; skip search enrichment
    - Image generation tool is unavailable; use text-only response
  action: continue_degraded
  log_level: WARNING
  notify: false
  escalate_after_minutes: null (no escalation; graceful degradation is acceptable)

partial_failure:
  description: Component fails intermittently; agent should retry and route around
  examples:
    - Network timeout on API call
    - Database connection pool exhausted
  action: isolate_and_route_around
  max_retries: 3
  retry_backoff_seconds: [5, 15, 60]
  escalate_after_retries: ESCALATE.md (human approval if retries exhausted)
  escalate_action: pause and wait for human decision

cascading_failure:
  description: Multiple components failing simultaneously; risk of cascade spreading
  examples:
    - Three API errors within 60 seconds
    - Two health checks failing at same time
    - Resource consumption doubling in 10 minutes
  action: circuit_breaker
  circuit_breaker:
    trigger_threshold: 3 failures in 60 seconds OR 2+ health checks failing
    open_circuit_duration_seconds: 300
    half_open_probe_interval_seconds: 60
  log_level: ERROR
  notify: true
  escalate_to: FAILSAFE.md (revert to safe state)
  escalate_action: stop all dependent operations, attempt recovery

silent_failure:
  description: Output generated without detecting underlying error (most dangerous)
  examples:
    - API returned stale cached data
    - Database write partially succeeded
    - Validation check was accidentally skipped
  detection:
    output_validation: required
    freshness_checks: required
    consistency_checks: required (cross-reference validation)
  action: flag_and_quarantine
  require_human_review: true
  escalate_to: ESCALATE.md (human must approve before continuing)

## DETECTION

health_checks:
  enabled: true
  interval_seconds: 30
  components:
    - api_availability: check response time < 5s
    - database_connection: check connection pool > 0
    - model_api: check rate limit not exceeded
    - memory_usage: check < 80% of limit
    - disk_space: check > 10% free

heartbeats:
  enabled: true
  interval_seconds: 60
  timeout_seconds: 120
  missing_heartbeat_trigger: partial_failure

error_patterns:
  enabled: true
  patterns:
    - three_errors_in_60s: trigger cascading_failure
    - rate_limit_429: trigger throttle, escalate to ESCALATE.md
    - timeout_5xx: trigger partial_failure with retry

## LOGGING

all_failures: enabled
failure_log_format: timestamp,mode,component,error_message,action_taken,escalation_target
retention_days: 90

## RECOVERY

partial_failure_recovery:
  strategy: exponential_backoff
  max_retries: 3
  backoff: [5s, 15s, 60s]

cascading_failure_recovery:
  strategy: circuit_breaker + safe_state
  circuit_open_duration: 5 minutes
  safe_state_definition: see FAILSAFE.md

silent_failure_recovery:
  strategy: human_review + manual_approval
  human_escalation_channel: escalate.md approval gate
  timeout: 1 hour
  on_timeout_approve: false (conservative: block unless explicitly approved)
```

## Use Cases

**API-Heavy Agents**: Monitor external API health; detect rate limits, timeouts, errors; implement retry logic with exponential backoff.

**Database Operations**: Monitor connection pool health; detect slow queries, deadlocks, connection exhaustion; route around failing replicas.

**Multi-Component Systems**: Monitor each component independently; detect cascading failures early; circuit-break before cascade spreads.

**High-Reliability Systems**: Detect silent failures (stale data, partial writes, skipped checks); require human approval before proceeding.

**Cost-Conscious Deployments**: Monitor resource usage; detect resource exhaustion early; trigger graceful degradation before costs spiral.

## Compliance & Regulatory

**SOC2 CC7.2 (Systems Monitoring):** Requires monitoring of systems for anomalies that indicate cyber security events. FAILURE.md health checks and error pattern detection satisfy this requirement.

**ISO 27001 A.16.1 (Incident Management):** Requires documented procedures for identifying and responding to incidents. FAILURE.md IS the documented procedure.

**EU AI Act (Effective August 2026):** Requires AI systems have documented error handling and behave predictably. FAILURE.md provides the documented behavior.

## Framework Compatibility

FAILURE.md is framework-agnostic. Works with:
- LangChain agents and tools
- AutoGen multi-agent systems
- CrewAI agent workflows
- Claude Code agentic generation
- Cursor Agent Mode
- Custom implementations

## Contact & Resources

- **Specification Repository:** https://github.com/Failure-md/spec
- **Website:** https://failure.md
- **Email:** info@failure.md

### Related Specifications
- FAILSAFE.md (https://failsafe.md) — Safe-state recovery
- ESCALATE.md (https://escalate.md) — Approval gates for human intervention
- KILLSWITCH.md (https://killswitch.md) — Emergency stop
- THROTTLE.md (https://throttle.md) — Rate and cost control
- LEADERBOARD.md (https://leaderboard.md) — Agent benchmarking

## License

MIT — Free to use, modify, and distribute. See https://github.com/Failure-md/spec for details.

---

**Last Updated:** 10 March 2026
**Status:** Open Standard v1.0