Source:
ocean/docs/adr/0035-atomic-provisioning.md| ✏️ Edit on GitHub
ADR 0035: Atomic Resource Provisioning
Status
Implemented
Context
The previous provisioning system used Promise.allSettled with individual catch blocks, allowing partial provisioning when external services (Stripe, Neon) were down. This resulted in organizations with incomplete resources, requiring manual intervention or complex recovery logic.
The fire-and-forget approach meant:
- Organizations could have Stripe customers but no Neon database
- No visibility into provisioning failures
- No automatic recovery mechanism
- Poor user experience when services were temporarily unavailable
Decision
Implement atomic provisioning with the following components:
- Pre-flight checks: Verify service availability before attempting provisioning
- Circuit breakers: Prevent cascading failures and provide service state
- Transactional provisioning: All-or-nothing resource creation
- Rollback capability: Clean up partial resources on failure
- Event tracking: Detailed provisioning state in database
- Manual retry: GraphQL mutation for admins to retry provisioning
Implementation
1. Circuit Breaker Integration
const stripeCircuitBreaker = createStripeCircuitBreaker(() => {
throw new Error('Stripe service is currently unavailable')
})
2. Pre-flight Checks
export async function preflightChecks(): Promise<{
canProvision: boolean
unavailableServices: string[]
}> {
// Check circuit breaker states
// Return early if services are down
}
3. Atomic Provisioning Flow
export async function atomicProvisionResources() {
// 1. Pre-flight checks
// 2. Create resources sequentially
// 3. Rollback on any failure
// 4. Track state in provisioning_events table
}
4. Provisioning Events Table
CREATE TABLE provisioning_events (
id UUID PRIMARY KEY,
organization_id UUID NOT NULL,
event_type TEXT NOT NULL,
status TEXT NOT NULL,
metadata JSONB DEFAULT '{}',
created_at TIMESTAMPTZ DEFAULT NOW()
);
5. Manual Retry Mutation
mutation RetryProvisioning($organizationId: ID!) {
retryProvisioning(organizationId: $organizationId) {
success
errors
}
}
Consequences
Positive
- No partial provisioning: Organizations have all resources or none
- Better reliability: Circuit breakers prevent hammering down services
- Visibility: Complete audit trail of provisioning attempts
- Recovery: Easy manual retry when services recover
- User experience: Clear error messages when provisioning blocked
Negative
- Delayed provisioning: Users must wait if any service is down
- Complexity: More code to maintain
- Database overhead: Additional provisioning_events table
Trade-offs
We chose reliability over availability. Users may need to wait for provisioning, but they won't end up with broken organizations requiring support intervention.
Metrics
- Provisioning success rate
- Circuit breaker state changes
- Time to successful provisioning
- Number of manual retries needed
Future Considerations
- Automatic retry with exponential backoff
- Webhook notifications for provisioning status
- Alternative provisioning strategies (queue-based)
- Service health dashboard for users