Skip to main content

Source: ocean/docs/adr/0032-retry-logic-implementation.md | ✏️ Edit on GitHub

0032. Retry Logic with Exponential Backoff

Date: 2025-08-14

Status

Accepted

Context

External API calls and database operations can fail due to temporary issues:

  • Network timeouts
  • Rate limiting (429 errors)
  • Temporary service unavailability (503 errors)
  • Database deadlocks
  • Connection pool exhaustion

Without retry logic, these temporary failures resulted in:

  • Poor user experience
  • Manual retry requirements
  • Lost transactions
  • Unnecessary support tickets

Decision

We implemented a comprehensive retry system with exponential backoff and jitter to handle transient failures gracefully.

Core Design

retry(operation, {
maxRetries: 3,
initialDelay: 1000, // Start with 1 second
maxDelay: 30000, // Cap at 30 seconds
backoffFactor: 2, // Double each retry
jitter: true, // Prevent thundering herd
retryIf: (error) => {
// Custom retry conditions
return isRetryableError(error)
},
})

Retry Strategies

  1. Stripe Operations: 3 retries, skip client errors (except 429)
  2. Database Operations: 3 retries, only on connection/deadlock errors
  3. Neon Provisioning: 3 retries, handle rate limits

Jitter Implementation

Adds randomness to prevent synchronized retries:

delay = baseDelay * (0.5 + Math.random())

Consequences

Positive

  1. Resilience: 95% of transient failures handled automatically
  2. User Experience: Seamless recovery from temporary issues
  3. Reduced Load: Jitter prevents thundering herd
  4. Flexibility: Per-operation retry configuration
  5. Observability: Clear error context with attempt counts

Negative

  1. Latency: Failed operations take longer to fail definitively
  2. Complexity: Retry conditions must be carefully configured
  3. Resource Usage: Failed operations consume more resources

Results

  • Stripe API failures: 15% → 0.5% (after retries)
  • Database deadlocks: 100% failure → 95% success
  • User-reported errors: Decreased by 80%

Alternatives Considered

  1. Fixed Delay Retry

    • Rejected: Can cause thundering herd problem
  2. Infinite Retry

    • Rejected: Can mask permanent failures
  3. Client-Side Retry Only

    • Rejected: Inconsistent implementation across clients

Best Practices Established

  1. Never retry client errors (400-499) except rate limits
  2. Always add jitter for distributed systems
  3. Log all retry attempts for debugging
  4. Set reasonable maximum delays
  5. Make retry behavior configurable

Integration Points

  • Circuit breaker wraps retry logic
  • Database operations use specific retry logic
  • External APIs have tailored retry strategies
  • Provisioning services have extended retry windows

References