Skip to main content

Source: ocean/docs/observability-guide.md | ✏️ Edit on GitHub

Observability Guide for Tenant Provisioning

Overview

This guide explains the cost-optimized observability setup for monitoring tenant provisioning using Sentry, Prometheus, and Grafana Cloud.

Architecture

Edge Functions
├── Sentry (Errors & Exceptions)
├── Structured Logs (JSON format)
└── Prometheus Metrics (/metrics endpoint)

Grafana Cloud

Dashboards & Alerts

Cost Optimization Strategies

1. Sentry Configuration

  • Sampling: Only 10% of successful transactions, 100% of errors
  • Filtering: Skip CORS errors, validation errors
  • Data Scrubbing: Remove passwords, secrets, PII
  • Cost: Free tier (5K errors/month) should be sufficient

2. Prometheus Metrics

  • Low Cardinality: Use fixed labels (region, service) not dynamic IDs
  • Aggregation: Collect summaries, not individual tenant metrics
  • Scrape Interval: 5 minutes instead of 15 seconds
  • Cost: Free tier (10K series) easily sufficient

3. Health Checks

  • Sampling: Only 20% of health checks in production
  • Batching: Aggregate results hourly
  • Cost: Reduces metric volume by 80%

Metrics Collected

Core Business Metrics

# Provisioning attempts
tenant_provisioning_total{status="success|failed", service="neon|graphql|ssl", region="us-east-1"}

# Provisioning duration (histogram)
tenant_provisioning_duration_seconds{service="neon"}

# API performance
neon_api_calls_total{method="createProject", status="success|error"}

# Health checks (sampled)
tenant_health_check_total{status="healthy|degraded|unhealthy", check_type="connectivity"}

What We DON'T Track (to save costs)

  • ❌ Per-tenant metrics (would explode cardinality)
  • ❌ Real-time request counts (use logs instead)
  • ❌ Detailed performance metrics (sample instead)

Implementation

1. Edge Function Setup

// Every edge function starts with:
import { initSentry, Logger, tenantMetrics } from '../_shared/observability.ts'

initSentry('function-name')
const logger = new Logger({ functionName: 'function-name' })

// Track operations
await trackOperation('provision_neon', async () => {
// Your provisioning code
})

// Log important events
logger.info('Tenant provisioned', { organizationId, region })

// Track metrics
tenantMetrics.provisioningCounter.inc({
status: 'success',
service: 'neon',
region,
})

2. Expose Metrics Endpoint

Each function exposes /metrics:

https://YOUR_PROJECT.supabase.co/functions/v1/provision-tenant-resources/metrics

3. Grafana Cloud Setup

  1. Get credentials from Grafana Cloud portal

  2. Run setup script:

    export GRAFANA_CLOUD_API_KEY="your-key"
    export GRAFANA_CLOUD_URL="https://prometheus-us-central1.grafana.net"
    export GRAFANA_CLOUD_USER="123456"
    ./scripts/setup-grafana-monitoring.sh

Monitoring Dashboards

1. Provisioning Overview

  • Success rate gauge
  • Provisioning duration (P50, P95, P99)
  • Failed provisions by service
  • Active tenants by region

2. System Health

  • Neon API availability
  • Edge function error rate
  • Sentry event volume
  • Cost tracking metrics

Alerts

Critical Alerts (Page immediately)

  • Provisioning success rate < 90%
  • Neon API down > 5 minutes
  • Any SSL provisioning failures

Warning Alerts (Notify team)

  • Provisioning duration > 5 minutes (P95)
  • High retry rate
  • Approaching rate limits

Debugging Guide

1. Tenant Provisioning Failed

  1. Check Sentry for error details
  2. Search logs: organizationId:"xxx" level:error
  3. Check metrics: tenant_provisioning_total{status="failed"}
  4. Review provisioning_events table

2. Performance Issues

  1. Check P95 latency in Grafana
  2. Look for Neon API latency spikes
  3. Review transaction traces in Sentry
  4. Check for resource limits

3. Cost Spikes

  1. Review cardinality in Grafana (Billing → Usage)
  2. Check Sentry event volume
  3. Verify sampling is working
  4. Look for metric explosion

Cost Monitoring

Monthly Budget (estimated)

  • Sentry: $0 (free tier)
  • Grafana Cloud: $0-20 (free tier usually sufficient)
  • Total: < $20/month for most cases

When You'll Hit Limits

  • Sentry: >5K errors/month (add more filtering)
  • Grafana: >10K unique series (reduce cardinality)
  • Logs: >50GB/month (sample more aggressively)

Best Practices

  1. Always Sample Success: Full tracking only for errors
  2. Use Fixed Labels: Never use dynamic IDs as labels
  3. Aggregate Early: Sum/average before sending metrics
  4. Set Retention: 7 days for details, 30 days for summaries
  5. Monitor Costs: Set up billing alerts in all services

Emergency Procedures

High Cost Alert

  1. Immediately increase sampling rates
  2. Drop non-essential metrics
  3. Reduce scrape frequency
  4. Clear historical data if needed

Monitoring Outage

  1. Errors still go to Supabase logs
  2. Check Edge Function logs directly
  3. Use SQL queries for backup monitoring
  4. Fall back to manual checks

Future Optimizations

  1. OpenTelemetry Migration: When Deno support improves
  2. Custom Aggregator: Build tenant metrics aggregator
  3. ClickHouse: For long-term metric storage
  4. Adaptive Sampling: Increase during incidents