Source: ocean/docs/adr/027-cleanup-automation-implementation.md | ✏️ Edit on GitHub

ADR-027: Cleanup Automation Implementation

Status

Accepted

Context

Our multi-tenant SaaS platform provisions external resources (Stripe customers, Neon databases) during organization creation. The baseline analysis identified that we had no automated cleanup for:

Failed provisioning attempts leaving orphaned resources
Unverified accounts that never complete signup
Stuck provisioning processes that timeout

This leads to:

Resource waste (paying for unused Neon databases)
Cluttered Stripe customer list
Potential security issues with abandoned accounts
Manual cleanup burden on operations team

Decision

We implemented a comprehensive automated cleanup system with the following components:

1. Database Layer

Function-based approach: Using PostgreSQL functions for complex queries
Event-driven tracking: All cleanup attempts recorded in provisioning_events
Monitoring views: Real-time visibility into cleanup needs

2. Edge Functions

cleanup-failed-provisions: Dedicated function for failed provisions
Enhanced scheduled-cleanup: Unified handler for all cleanup types
Existing cleanup-resources: Reused for actual resource deletion

3. Safety Features

Dry-run mode: Test cleanup logic without deletion
Configurable thresholds: Prevent premature cleanup
Audit trail: Complete history of all cleanup attempts
Batch limiting: Prevent overwhelming external APIs

Architecture

Database Functions

-- Identifies failed provisions
get_failed_provisions(hours_threshold, include_incomplete)
  -> Returns organizations with failed/stuck provisioning

-- Finds unverified accounts
get_cleanup_candidates(hours_threshold)
  -> Returns users who never verified email

-- Performs cleanup
cleanup_unverified_accounts(hours_threshold, dry_run)
  -> Deletes users and empty organizations

-- Records attempts
record_cleanup_attempt(org_id, type, success, resources, error)
  -> Creates audit trail in provisioning_events

Request Flow

Cron Job / Manual Trigger
    ↓
scheduled-cleanup (router)
    ├─→ cleanup-failed-provisions (failed provisions)
    │       ↓
    │   get_failed_provisions()
    │       ↓
    │   cleanup-resources (delete Stripe/Neon)
    │       ↓
    │   record_cleanup_attempt()
    │
    └─→ get_cleanup_candidates() (unverified accounts)
            ↓
        cleanup_unverified_accounts()
            ↓
        record_cleanup_attempt()

Implementation Details

Provisioning State Detection

We identify failed provisions through multiple signals:

Explicit failure: provisioning_failed event
Stuck state: provisioning_started but no completion after threshold
Partial resources: Has Stripe XOR Neon (should have both)
Rollback state: provisioning_rollback_started events

Cleanup Strategy

Conservative defaults: 24-hour threshold for failed provisions
External first: Delete Stripe/Neon before database records
Atomic operations: Use transactions where possible
Graceful degradation: Continue on individual failures

Scheduling Options

We support multiple scheduling approaches:

Supabase pg_cron: Native PostgreSQL scheduling
GitHub Actions: CI/CD based scheduling
External services: Cron-job.org, Vercel Cron, etc.

Consequences

Positive

Cost savings: Automatic cleanup of unused Neon databases
Operational efficiency: No manual cleanup needed
Better security: Abandoned accounts are removed
Audit compliance: Complete cleanup history
Flexible scheduling: Multiple deployment options
Safe testing: Dry-run mode prevents accidents

Negative

Additional complexity: More moving parts to monitor
Potential for data loss: If thresholds too aggressive
External API dependencies: Cleanup can fail if APIs down
Monitoring overhead: Need to watch cleanup metrics

Neutral

Database migrations: New functions added to schema
No breaking changes: Enhances existing system
Resource usage: Minimal compute for cleanup jobs

Risk Mitigation

Conservative thresholds: Start with 24-72 hour delays
Dry-run testing: Always test before production
Monitoring alerts: Alert on cleanup failures
Manual override: Can still cleanup individually
Rollback plan: Cleanup functions can be disabled

Monitoring

Key metrics to track:

Failed provisions count
Cleanup success rate
Resources deleted (Stripe/Neon)
Error rates by type
Oldest unresolved provision

SQL monitoring query:

SELECT * FROM cleanup_monitoring;

Future Enhancements

Smart thresholds: ML-based detection of abandonment
Partial cleanup: Clean only specific resources
Notification system: Email before deletion
Recovery mechanism: Restore recently deleted
Cost tracking: Calculate savings from cleanup

References

Issue Resolution Project: /docs/ISSUE_RESOLUTION_PROJECT.md
Cleanup Documentation: /docs/cleanup-automation.md
Migration: /supabase/migrations/20250827_add_cleanup_automation.sql
Edge Functions:
- /supabase/functions/cleanup-failed-provisions/
- /supabase/functions/scheduled-cleanup/

Review Date

September 2025 - Review cleanup effectiveness and adjust thresholds

Status​

Context​

Decision​

1. Database Layer​

2. Edge Functions​

3. Safety Features​

Architecture​

Database Functions​

Request Flow​

Implementation Details​

Provisioning State Detection​

Cleanup Strategy​

Scheduling Options​

Consequences​

Positive​

Negative​

Neutral​

Risk Mitigation​

Monitoring​

Future Enhancements​

References​

Review Date​