Source:
ocean/docs/adr/0031-circuit-breaker-pattern.md| ✏️ Edit on GitHub
0031. Circuit Breaker Pattern for External API Calls
Date: 2025-08-14
Status
Accepted
Context
Our system relies heavily on Stripe API for billing operations. During Stripe outages or degraded performance, our application would:
- Continue attempting failed requests
- Experience cascading failures
- Show errors to users with no graceful degradation
- Potentially overwhelm Stripe's API during recovery
We needed a way to:
- Fail fast when external services are down
- Provide graceful degradation
- Automatically recover when services are restored
- Prevent thundering herd during recovery
Decision
We implemented the Circuit Breaker pattern for all Stripe API calls with three states:
- CLOSED: Normal operation, requests pass through
- OPEN: Service is down, requests fail immediately
- HALF_OPEN: Testing if service has recovered
Configuration
{
failureThreshold: 0.5, // Open circuit at 50% failure rate
resetTimeout: 30000, // Try recovery after 30 seconds
requestTimeout: 10000, // Individual request timeout
volumeThreshold: 5, // Minimum 5 requests before calculating
errorFilter: (error) => {
// Only count 5xx errors, not user errors (4xx)
return error.statusCode >= 500
}
}
Integration with Retry Logic
Circuit breaker wraps retry logic, providing defense in depth:
circuitBreaker.execute(() => retryWithExponentialBackoff(stripeOperation))
Consequences
Positive
- Fault Isolation: Stripe failures don't crash our system
- Fast Failures: Users get immediate feedback instead of waiting
- Automatic Recovery: Service resumes when Stripe recovers
- Resource Protection: Prevents overwhelming failed services
- Monitoring: Clear metrics on external service health
Negative
- Complexity: Additional state management
- False Positives: Might open circuit on temporary issues
- Configuration Tuning: Requires monitoring to optimize thresholds
Metrics Observed
- Stripe outage impact: 100% errors → 5% errors (graceful degradation)
- Recovery time: 5-10 minutes → 30 seconds
- User experience: Hard failures → "Service temporarily unavailable"
Alternatives Considered
-
Retry Only
- Rejected: No protection against extended outages
-
Rate Limiting
- Rejected: Doesn't address service failures
-
Fallback to Cache
- Partially implemented: Used for read operations only
Implementation Details
- Separate circuit breakers for different Stripe resources
- Fallback responses for critical operations
- Health check endpoint exposing circuit states
- Metrics integration for monitoring