Skip to main content

Source: ocean/docs/adr/027-cleanup-automation-implementation.md | ✏️ Edit on GitHub

ADR-027: Cleanup Automation Implementation

Status

Accepted

Context

Our multi-tenant SaaS platform provisions external resources (Stripe customers, Neon databases) during organization creation. The baseline analysis identified that we had no automated cleanup for:

  1. Failed provisioning attempts leaving orphaned resources
  2. Unverified accounts that never complete signup
  3. Stuck provisioning processes that timeout

This leads to:

  • Resource waste (paying for unused Neon databases)
  • Cluttered Stripe customer list
  • Potential security issues with abandoned accounts
  • Manual cleanup burden on operations team

Decision

We implemented a comprehensive automated cleanup system with the following components:

1. Database Layer

  • Function-based approach: Using PostgreSQL functions for complex queries
  • Event-driven tracking: All cleanup attempts recorded in provisioning_events
  • Monitoring views: Real-time visibility into cleanup needs

2. Edge Functions

  • cleanup-failed-provisions: Dedicated function for failed provisions
  • Enhanced scheduled-cleanup: Unified handler for all cleanup types
  • Existing cleanup-resources: Reused for actual resource deletion

3. Safety Features

  • Dry-run mode: Test cleanup logic without deletion
  • Configurable thresholds: Prevent premature cleanup
  • Audit trail: Complete history of all cleanup attempts
  • Batch limiting: Prevent overwhelming external APIs

Architecture

Database Functions

-- Identifies failed provisions
get_failed_provisions(hours_threshold, include_incomplete)
-> Returns organizations with failed/stuck provisioning

-- Finds unverified accounts
get_cleanup_candidates(hours_threshold)
-> Returns users who never verified email

-- Performs cleanup
cleanup_unverified_accounts(hours_threshold, dry_run)
-> Deletes users and empty organizations

-- Records attempts
record_cleanup_attempt(org_id, type, success, resources, error)
-> Creates audit trail in provisioning_events

Request Flow

Cron Job / Manual Trigger

scheduled-cleanup (router)
├─→ cleanup-failed-provisions (failed provisions)
│ ↓
│ get_failed_provisions()
│ ↓
│ cleanup-resources (delete Stripe/Neon)
│ ↓
│ record_cleanup_attempt()

└─→ get_cleanup_candidates() (unverified accounts)

cleanup_unverified_accounts()

record_cleanup_attempt()

Implementation Details

Provisioning State Detection

We identify failed provisions through multiple signals:

  1. Explicit failure: provisioning_failed event
  2. Stuck state: provisioning_started but no completion after threshold
  3. Partial resources: Has Stripe XOR Neon (should have both)
  4. Rollback state: provisioning_rollback_started events

Cleanup Strategy

  1. Conservative defaults: 24-hour threshold for failed provisions
  2. External first: Delete Stripe/Neon before database records
  3. Atomic operations: Use transactions where possible
  4. Graceful degradation: Continue on individual failures

Scheduling Options

We support multiple scheduling approaches:

  1. Supabase pg_cron: Native PostgreSQL scheduling
  2. GitHub Actions: CI/CD based scheduling
  3. External services: Cron-job.org, Vercel Cron, etc.

Consequences

Positive

  1. Cost savings: Automatic cleanup of unused Neon databases
  2. Operational efficiency: No manual cleanup needed
  3. Better security: Abandoned accounts are removed
  4. Audit compliance: Complete cleanup history
  5. Flexible scheduling: Multiple deployment options
  6. Safe testing: Dry-run mode prevents accidents

Negative

  1. Additional complexity: More moving parts to monitor
  2. Potential for data loss: If thresholds too aggressive
  3. External API dependencies: Cleanup can fail if APIs down
  4. Monitoring overhead: Need to watch cleanup metrics

Neutral

  1. Database migrations: New functions added to schema
  2. No breaking changes: Enhances existing system
  3. Resource usage: Minimal compute for cleanup jobs

Risk Mitigation

  1. Conservative thresholds: Start with 24-72 hour delays
  2. Dry-run testing: Always test before production
  3. Monitoring alerts: Alert on cleanup failures
  4. Manual override: Can still cleanup individually
  5. Rollback plan: Cleanup functions can be disabled

Monitoring

Key metrics to track:

  • Failed provisions count
  • Cleanup success rate
  • Resources deleted (Stripe/Neon)
  • Error rates by type
  • Oldest unresolved provision

SQL monitoring query:

SELECT * FROM cleanup_monitoring;

Future Enhancements

  1. Smart thresholds: ML-based detection of abandonment
  2. Partial cleanup: Clean only specific resources
  3. Notification system: Email before deletion
  4. Recovery mechanism: Restore recently deleted
  5. Cost tracking: Calculate savings from cleanup

References

  • Issue Resolution Project: /docs/ISSUE_RESOLUTION_PROJECT.md
  • Cleanup Documentation: /docs/cleanup-automation.md
  • Migration: /supabase/migrations/20250827_add_cleanup_automation.sql
  • Edge Functions:
    • /supabase/functions/cleanup-failed-provisions/
    • /supabase/functions/scheduled-cleanup/

Review Date

September 2025 - Review cleanup effectiveness and adjust thresholds