Skip to main content

Source: ocean/docs/adr/0035-atomic-provisioning.md | ✏️ Edit on GitHub

ADR 0035: Atomic Resource Provisioning

Status

Implemented

Context

The previous provisioning system used Promise.allSettled with individual catch blocks, allowing partial provisioning when external services (Stripe, Neon) were down. This resulted in organizations with incomplete resources, requiring manual intervention or complex recovery logic.

The fire-and-forget approach meant:

  • Organizations could have Stripe customers but no Neon database
  • No visibility into provisioning failures
  • No automatic recovery mechanism
  • Poor user experience when services were temporarily unavailable

Decision

Implement atomic provisioning with the following components:

  1. Pre-flight checks: Verify service availability before attempting provisioning
  2. Circuit breakers: Prevent cascading failures and provide service state
  3. Transactional provisioning: All-or-nothing resource creation
  4. Rollback capability: Clean up partial resources on failure
  5. Event tracking: Detailed provisioning state in database
  6. Manual retry: GraphQL mutation for admins to retry provisioning

Implementation

1. Circuit Breaker Integration

const stripeCircuitBreaker = createStripeCircuitBreaker(() => {
throw new Error('Stripe service is currently unavailable')
})

2. Pre-flight Checks

export async function preflightChecks(): Promise<{
canProvision: boolean
unavailableServices: string[]
}> {
// Check circuit breaker states
// Return early if services are down
}

3. Atomic Provisioning Flow

export async function atomicProvisionResources() {
// 1. Pre-flight checks
// 2. Create resources sequentially
// 3. Rollback on any failure
// 4. Track state in provisioning_events table
}

4. Provisioning Events Table

CREATE TABLE provisioning_events (
id UUID PRIMARY KEY,
organization_id UUID NOT NULL,
event_type TEXT NOT NULL,
status TEXT NOT NULL,
metadata JSONB DEFAULT '{}',
created_at TIMESTAMPTZ DEFAULT NOW()
);

5. Manual Retry Mutation

mutation RetryProvisioning($organizationId: ID!) {
retryProvisioning(organizationId: $organizationId) {
success
errors
}
}

Consequences

Positive

  • No partial provisioning: Organizations have all resources or none
  • Better reliability: Circuit breakers prevent hammering down services
  • Visibility: Complete audit trail of provisioning attempts
  • Recovery: Easy manual retry when services recover
  • User experience: Clear error messages when provisioning blocked

Negative

  • Delayed provisioning: Users must wait if any service is down
  • Complexity: More code to maintain
  • Database overhead: Additional provisioning_events table

Trade-offs

We chose reliability over availability. Users may need to wait for provisioning, but they won't end up with broken organizations requiring support intervention.

Metrics

  • Provisioning success rate
  • Circuit breaker state changes
  • Time to successful provisioning
  • Number of manual retries needed

Future Considerations

  1. Automatic retry with exponential backoff
  2. Webhook notifications for provisioning status
  3. Alternative provisioning strategies (queue-based)
  4. Service health dashboard for users