Source:
ocean/docs/adr/027-cleanup-automation-implementation.md| ✏️ Edit on GitHub
ADR-027: Cleanup Automation Implementation
Status
Accepted
Context
Our multi-tenant SaaS platform provisions external resources (Stripe customers, Neon databases) during organization creation. The baseline analysis identified that we had no automated cleanup for:
- Failed provisioning attempts leaving orphaned resources
- Unverified accounts that never complete signup
- Stuck provisioning processes that timeout
This leads to:
- Resource waste (paying for unused Neon databases)
- Cluttered Stripe customer list
- Potential security issues with abandoned accounts
- Manual cleanup burden on operations team
Decision
We implemented a comprehensive automated cleanup system with the following components:
1. Database Layer
- Function-based approach: Using PostgreSQL functions for complex queries
- Event-driven tracking: All cleanup attempts recorded in
provisioning_events - Monitoring views: Real-time visibility into cleanup needs
2. Edge Functions
cleanup-failed-provisions: Dedicated function for failed provisions- Enhanced
scheduled-cleanup: Unified handler for all cleanup types - Existing
cleanup-resources: Reused for actual resource deletion
3. Safety Features
- Dry-run mode: Test cleanup logic without deletion
- Configurable thresholds: Prevent premature cleanup
- Audit trail: Complete history of all cleanup attempts
- Batch limiting: Prevent overwhelming external APIs
Architecture
Database Functions
-- Identifies failed provisions
get_failed_provisions(hours_threshold, include_incomplete)
-> Returns organizations with failed/stuck provisioning
-- Finds unverified accounts
get_cleanup_candidates(hours_threshold)
-> Returns users who never verified email
-- Performs cleanup
cleanup_unverified_accounts(hours_threshold, dry_run)
-> Deletes users and empty organizations
-- Records attempts
record_cleanup_attempt(org_id, type, success, resources, error)
-> Creates audit trail in provisioning_events
Request Flow
Cron Job / Manual Trigger
↓
scheduled-cleanup (router)
├─→ cleanup-failed-provisions (failed provisions)
│ ↓
│ get_failed_provisions()
│ ↓
│ cleanup-resources (delete Stripe/Neon)
│ ↓
│ record_cleanup_attempt()
│
└─→ get_cleanup_candidates() (unverified accounts)
↓
cleanup_unverified_accounts()
↓
record_cleanup_attempt()
Implementation Details
Provisioning State Detection
We identify failed provisions through multiple signals:
- Explicit failure:
provisioning_failedevent - Stuck state:
provisioning_startedbut no completion after threshold - Partial resources: Has Stripe XOR Neon (should have both)
- Rollback state:
provisioning_rollback_startedevents
Cleanup Strategy
- Conservative defaults: 24-hour threshold for failed provisions
- External first: Delete Stripe/Neon before database records
- Atomic operations: Use transactions where possible
- Graceful degradation: Continue on individual failures
Scheduling Options
We support multiple scheduling approaches:
- Supabase pg_cron: Native PostgreSQL scheduling
- GitHub Actions: CI/CD based scheduling
- External services: Cron-job.org, Vercel Cron, etc.
Consequences
Positive
- Cost savings: Automatic cleanup of unused Neon databases
- Operational efficiency: No manual cleanup needed
- Better security: Abandoned accounts are removed
- Audit compliance: Complete cleanup history
- Flexible scheduling: Multiple deployment options
- Safe testing: Dry-run mode prevents accidents
Negative
- Additional complexity: More moving parts to monitor
- Potential for data loss: If thresholds too aggressive
- External API dependencies: Cleanup can fail if APIs down
- Monitoring overhead: Need to watch cleanup metrics
Neutral
- Database migrations: New functions added to schema
- No breaking changes: Enhances existing system
- Resource usage: Minimal compute for cleanup jobs
Risk Mitigation
- Conservative thresholds: Start with 24-72 hour delays
- Dry-run testing: Always test before production
- Monitoring alerts: Alert on cleanup failures
- Manual override: Can still cleanup individually
- Rollback plan: Cleanup functions can be disabled
Monitoring
Key metrics to track:
- Failed provisions count
- Cleanup success rate
- Resources deleted (Stripe/Neon)
- Error rates by type
- Oldest unresolved provision
SQL monitoring query:
SELECT * FROM cleanup_monitoring;
Future Enhancements
- Smart thresholds: ML-based detection of abandonment
- Partial cleanup: Clean only specific resources
- Notification system: Email before deletion
- Recovery mechanism: Restore recently deleted
- Cost tracking: Calculate savings from cleanup
References
- Issue Resolution Project:
/docs/ISSUE_RESOLUTION_PROJECT.md - Cleanup Documentation:
/docs/cleanup-automation.md - Migration:
/supabase/migrations/20250827_add_cleanup_automation.sql - Edge Functions:
/supabase/functions/cleanup-failed-provisions//supabase/functions/scheduled-cleanup/
Review Date
September 2025 - Review cleanup effectiveness and adjust thresholds