Source:
ocean/docs/observability-guide.md| ✏️ Edit on GitHub
Observability Guide for Tenant Provisioning
Overview
This guide explains the cost-optimized observability setup for monitoring tenant provisioning using Sentry, Prometheus, and Grafana Cloud.
Architecture
Edge Functions
├── Sentry (Errors & Exceptions)
├── Structured Logs (JSON format)
└── Prometheus Metrics (/metrics endpoint)
↓
Grafana Cloud
↓
Dashboards & Alerts
Cost Optimization Strategies
1. Sentry Configuration
- Sampling: Only 10% of successful transactions, 100% of errors
- Filtering: Skip CORS errors, validation errors
- Data Scrubbing: Remove passwords, secrets, PII
- Cost: Free tier (5K errors/month) should be sufficient
2. Prometheus Metrics
- Low Cardinality: Use fixed labels (region, service) not dynamic IDs
- Aggregation: Collect summaries, not individual tenant metrics
- Scrape Interval: 5 minutes instead of 15 seconds
- Cost: Free tier (10K series) easily sufficient
3. Health Checks
- Sampling: Only 20% of health checks in production
- Batching: Aggregate results hourly
- Cost: Reduces metric volume by 80%
Metrics Collected
Core Business Metrics
# Provisioning attempts
tenant_provisioning_total{status="success|failed", service="neon|graphql|ssl", region="us-east-1"}
# Provisioning duration (histogram)
tenant_provisioning_duration_seconds{service="neon"}
# API performance
neon_api_calls_total{method="createProject", status="success|error"}
# Health checks (sampled)
tenant_health_check_total{status="healthy|degraded|unhealthy", check_type="connectivity"}
What We DON'T Track (to save costs)
- ❌ Per-tenant metrics (would explode cardinality)
- ❌ Real-time request counts (use logs instead)
- ❌ Detailed performance metrics (sample instead)
Implementation
1. Edge Function Setup
// Every edge function starts with:
import { initSentry, Logger, tenantMetrics } from '../_shared/observability.ts'
initSentry('function-name')
const logger = new Logger({ functionName: 'function-name' })
// Track operations
await trackOperation('provision_neon', async () => {
// Your provisioning code
})
// Log important events
logger.info('Tenant provisioned', { organizationId, region })
// Track metrics
tenantMetrics.provisioningCounter.inc({
status: 'success',
service: 'neon',
region,
})
2. Expose Metrics Endpoint
Each function exposes /metrics:
https://YOUR_PROJECT.supabase.co/functions/v1/provision-tenant-resources/metrics
3. Grafana Cloud Setup
-
Get credentials from Grafana Cloud portal
-
Run setup script:
export GRAFANA_CLOUD_API_KEY="your-key"
export GRAFANA_CLOUD_URL="https://prometheus-us-central1.grafana.net"
export GRAFANA_CLOUD_USER="123456"
./scripts/setup-grafana-monitoring.sh
Monitoring Dashboards
1. Provisioning Overview
- Success rate gauge
- Provisioning duration (P50, P95, P99)
- Failed provisions by service
- Active tenants by region
2. System Health
- Neon API availability
- Edge function error rate
- Sentry event volume
- Cost tracking metrics
Alerts
Critical Alerts (Page immediately)
- Provisioning success rate < 90%
- Neon API down > 5 minutes
- Any SSL provisioning failures
Warning Alerts (Notify team)
- Provisioning duration > 5 minutes (P95)
- High retry rate
- Approaching rate limits
Debugging Guide
1. Tenant Provisioning Failed
- Check Sentry for error details
- Search logs:
organizationId:"xxx" level:error - Check metrics:
tenant_provisioning_total{status="failed"} - Review provisioning_events table
2. Performance Issues
- Check P95 latency in Grafana
- Look for Neon API latency spikes
- Review transaction traces in Sentry
- Check for resource limits
3. Cost Spikes
- Review cardinality in Grafana (Billing → Usage)
- Check Sentry event volume
- Verify sampling is working
- Look for metric explosion
Cost Monitoring
Monthly Budget (estimated)
- Sentry: $0 (free tier)
- Grafana Cloud: $0-20 (free tier usually sufficient)
- Total: < $20/month for most cases
When You'll Hit Limits
- Sentry: >5K errors/month (add more filtering)
- Grafana: >10K unique series (reduce cardinality)
- Logs: >50GB/month (sample more aggressively)
Best Practices
- Always Sample Success: Full tracking only for errors
- Use Fixed Labels: Never use dynamic IDs as labels
- Aggregate Early: Sum/average before sending metrics
- Set Retention: 7 days for details, 30 days for summaries
- Monitor Costs: Set up billing alerts in all services
Emergency Procedures
High Cost Alert
- Immediately increase sampling rates
- Drop non-essential metrics
- Reduce scrape frequency
- Clear historical data if needed
Monitoring Outage
- Errors still go to Supabase logs
- Check Edge Function logs directly
- Use SQL queries for backup monitoring
- Fall back to manual checks
Future Optimizations
- OpenTelemetry Migration: When Deno support improves
- Custom Aggregator: Build tenant metrics aggregator
- ClickHouse: For long-term metric storage
- Adaptive Sampling: Increase during incidents