Skip to main content

Source: ocean/docs/adr/0031-circuit-breaker-pattern.md | ✏️ Edit on GitHub

0031. Circuit Breaker Pattern for External API Calls

Date: 2025-08-14

Status

Accepted

Context

Our system relies heavily on Stripe API for billing operations. During Stripe outages or degraded performance, our application would:

  • Continue attempting failed requests
  • Experience cascading failures
  • Show errors to users with no graceful degradation
  • Potentially overwhelm Stripe's API during recovery

We needed a way to:

  • Fail fast when external services are down
  • Provide graceful degradation
  • Automatically recover when services are restored
  • Prevent thundering herd during recovery

Decision

We implemented the Circuit Breaker pattern for all Stripe API calls with three states:

  1. CLOSED: Normal operation, requests pass through
  2. OPEN: Service is down, requests fail immediately
  3. HALF_OPEN: Testing if service has recovered

Configuration

{
failureThreshold: 0.5, // Open circuit at 50% failure rate
resetTimeout: 30000, // Try recovery after 30 seconds
requestTimeout: 10000, // Individual request timeout
volumeThreshold: 5, // Minimum 5 requests before calculating
errorFilter: (error) => {
// Only count 5xx errors, not user errors (4xx)
return error.statusCode >= 500
}
}

Integration with Retry Logic

Circuit breaker wraps retry logic, providing defense in depth:

circuitBreaker.execute(() => retryWithExponentialBackoff(stripeOperation))

Consequences

Positive

  1. Fault Isolation: Stripe failures don't crash our system
  2. Fast Failures: Users get immediate feedback instead of waiting
  3. Automatic Recovery: Service resumes when Stripe recovers
  4. Resource Protection: Prevents overwhelming failed services
  5. Monitoring: Clear metrics on external service health

Negative

  1. Complexity: Additional state management
  2. False Positives: Might open circuit on temporary issues
  3. Configuration Tuning: Requires monitoring to optimize thresholds

Metrics Observed

  • Stripe outage impact: 100% errors → 5% errors (graceful degradation)
  • Recovery time: 5-10 minutes → 30 seconds
  • User experience: Hard failures → "Service temporarily unavailable"

Alternatives Considered

  1. Retry Only

    • Rejected: No protection against extended outages
  2. Rate Limiting

    • Rejected: Doesn't address service failures
  3. Fallback to Cache

    • Partially implemented: Used for read operations only

Implementation Details

  • Separate circuit breakers for different Stripe resources
  • Fallback responses for critical operations
  • Health check endpoint exposing circuit states
  • Metrics integration for monitoring

References