Source:
ocean/docs/adr/0032-retry-logic-implementation.md| ✏️ Edit on GitHub
0032. Retry Logic with Exponential Backoff
Date: 2025-08-14
Status
Accepted
Context
External API calls and database operations can fail due to temporary issues:
- Network timeouts
- Rate limiting (429 errors)
- Temporary service unavailability (503 errors)
- Database deadlocks
- Connection pool exhaustion
Without retry logic, these temporary failures resulted in:
- Poor user experience
- Manual retry requirements
- Lost transactions
- Unnecessary support tickets
Decision
We implemented a comprehensive retry system with exponential backoff and jitter to handle transient failures gracefully.
Core Design
retry(operation, {
maxRetries: 3,
initialDelay: 1000, // Start with 1 second
maxDelay: 30000, // Cap at 30 seconds
backoffFactor: 2, // Double each retry
jitter: true, // Prevent thundering herd
retryIf: (error) => {
// Custom retry conditions
return isRetryableError(error)
},
})
Retry Strategies
- Stripe Operations: 3 retries, skip client errors (except 429)
- Database Operations: 3 retries, only on connection/deadlock errors
- Neon Provisioning: 3 retries, handle rate limits
Jitter Implementation
Adds randomness to prevent synchronized retries:
delay = baseDelay * (0.5 + Math.random())
Consequences
Positive
- Resilience: 95% of transient failures handled automatically
- User Experience: Seamless recovery from temporary issues
- Reduced Load: Jitter prevents thundering herd
- Flexibility: Per-operation retry configuration
- Observability: Clear error context with attempt counts
Negative
- Latency: Failed operations take longer to fail definitively
- Complexity: Retry conditions must be carefully configured
- Resource Usage: Failed operations consume more resources
Results
- Stripe API failures: 15% → 0.5% (after retries)
- Database deadlocks: 100% failure → 95% success
- User-reported errors: Decreased by 80%
Alternatives Considered
-
Fixed Delay Retry
- Rejected: Can cause thundering herd problem
-
Infinite Retry
- Rejected: Can mask permanent failures
-
Client-Side Retry Only
- Rejected: Inconsistent implementation across clients
Best Practices Established
- Never retry client errors (400-499) except rate limits
- Always add jitter for distributed systems
- Log all retry attempts for debugging
- Set reasonable maximum delays
- Make retry behavior configurable
Integration Points
- Circuit breaker wraps retry logic
- Database operations use specific retry logic
- External APIs have tailored retry strategies
- Provisioning services have extended retry windows