Skip to main content

Source: ocean/docs/OPERATIONS_RUNBOOK.md | ✏️ Edit on GitHub

Operations Runbook

Quick Reference: Debug guide for common issues, deployment procedures, and emergency responses.

Common Issues & Solutions

Authentication Issues

"Database error saving new user"

Debug steps:

# 1. Check trigger function
psql postgresql://postgres:postgres@localhost:54322/postgres
\df handle_new_user

# 2. Test user creation manually
pnpm run debug:user "test@example.com" "Test Org"

# 3. Check provisioning events
SELECT * FROM provisioning_events
WHERE organization_id = 'org-id'
ORDER BY created_at DESC;

# 4. Check Supabase logs
docker logs supabase_db_ocean --tail 100 | grep ERROR

JWT Token Invalid

# Check token expiry
jwt decode <token>

# Force refresh session
const { data, error } = await supabase.auth.refreshSession()

# Verify Supabase keys match
echo $VITE_SUPABASE_PUBLISHABLE_KEY
cat .env.local | grep PUBLISHABLE

RLS Policy Blocking Access

-- Check current user
SELECT auth.uid();

-- Test RLS policies
SET LOCAL role TO 'authenticated';
SET LOCAL request.jwt.claims.sub TO 'user-uuid';
SELECT * FROM organizations;

-- Debug specific policy
SELECT * FROM pg_policies WHERE tablename = 'organizations';

Database Issues

Migration Conflicts

# Fix duplicate timestamps
./scripts/fix-migration-names.sh

# Check migration status
supabase migration list

# Reset and reapply
supabase db reset --no-seed

Connection Pool Exhausted

-- Check active connections
SELECT count(*) FROM pg_stat_activity;

-- Kill idle connections
SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE state = 'idle'
AND state_change < NOW() - INTERVAL '5 minutes';

Slow Queries

-- Enable query logging
ALTER DATABASE postgres SET log_statement = 'all';
ALTER DATABASE postgres SET log_duration = on;

-- Find slow queries
SELECT query, mean_exec_time, calls
FROM pg_stat_statements
ORDER BY mean_exec_time DESC
LIMIT 10;

-- Explain query plan
EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM organizations WHERE owner_id = 'uuid';

Stripe Integration Issues

Webhook Signature Verification Failed

# Verify webhook secret
echo $STRIPE_WEBHOOK_SECRET

# Test webhook locally
stripe listen --forward-to localhost:54321/functions/v1/handle-stripe-webhook

# Check webhook logs in Stripe dashboard
# https://dashboard.stripe.com/test/webhooks

Subscription Creation Failed

// Debug Stripe API calls
const stripe = new Stripe(process.env.STRIPE_SECRET_KEY, {
apiVersion: '2024-11-20',
maxNetworkRetries: 2,
telemetry: false,
})

// Enable debug logging
stripe.on('request', (event) => {
console.log('Stripe Request:', event)
})

Payment Method Attachment Failed

# Check customer exists
stripe customers retrieve cus_xxx

# List payment methods
stripe payment_methods list --customer cus_xxx

# Manually attach
stripe payment_methods attach pm_xxx --customer cus_xxx

Neon Database Issues

Tenant Database Connection Failed

Debug commands:

# Test connection string
psql "postgresql://user:pass@xxx.neon.tech/neondb?sslmode=require"

# Check provisioning status
SELECT * FROM organization_databases
WHERE organization_id = 'org-id';

# Verify Neon project exists
curl -H "Authorization: Bearer $NEON_API_KEY" \
https://console.neon.tech/api/v2/projects

Performance Issues

High API Response Times

# Check Edge Function logs
supabase functions logs graphql-v2 --limit 100

# Monitor cold starts
grep "cold start" logs.txt | wc -l

# Check bundle size
pnpm run analyze:bundle

# Profile specific endpoints
curl -w "@curl-format.txt" -o /dev/null -s \
https://ocean.goldfish.io/api/graphql

Frontend Performance Degradation

# Run Lighthouse audit
npx lighthouse https://ocean.goldfish.io

# Check bundle size regression
pnpm run perf:check

# Analyze specific chunks
npx source-map-explorer dist/assets/*.js

Deployment Procedures

Standard Production Deploy

Pre-deployment checklist:

  • All tests passing locally
  • TypeScript compilation successful
  • No ESLint errors
  • Bundle size within limits
  • Migrations tested locally

Emergency Hotfix Deploy

# 1. Create hotfix branch
git checkout -b hotfix/critical-issue

# 2. Make fix and test
# ... make changes ...
pnpm run validate
pnpm test

# 3. Deploy directly to production
git push origin hotfix/critical-issue
# Create PR and merge with admin override

# 4. Monitor deployment
pnpm run health:check

Database Migration Deploy

# 1. Test migration locally
supabase migration new fix_issue
# ... write migration ...
supabase db reset --no-seed

# 2. Review migration
cat supabase/migrations/xxx_fix_issue.sql

# 3. Deploy to staging first
git push origin feature/migration

# 4. Apply to production
supabase migration up --project-ref prod-ref

Rollback Procedures

Application Rollback

# Interactive rollback with safety checks
pnpm run rollback:prod

# Manual rollback via Vercel
vercel ls ocean-platform --prod
vercel promote [deployment-id]

# Verify rollback
curl https://ocean.goldfish.io/health

Database Rollback

-- Create restore point before risky operations
BEGIN;
SAVEPOINT before_changes;

-- Make changes
ALTER TABLE organizations ADD COLUMN risky_field TEXT;

-- If issues, rollback
ROLLBACK TO SAVEPOINT before_changes;

-- If good, commit
COMMIT;

Stripe Configuration Rollback

# Cannot rollback Stripe changes directly
# Must manually revert via dashboard or API

# Revert subscription price
stripe subscriptions update sub_xxx \
--items[0][price]=old_price_id \
--proration_behavior=none

Monitoring & Alerts

Health Check Endpoints

# Frontend health
curl https://ocean.goldfish.io/health

# API health
curl https://ocean.goldfish.io/api/health

# GraphQL health
curl -X POST https://ocean.goldfish.io/api/graphql \
-H "Content-Type: application/json" \
-d '{"query": "{ system_status { healthy } }"}'

Log Locations

Quick log access:

# Vercel logs
vercel logs --prod --follow

# Supabase Edge Function logs
supabase functions logs graphql-v2 --project-ref prod-ref

# Local database logs
docker logs supabase_db_ocean --follow

Alert Response Procedures

High Error Rate Alert

  1. Check Sentry for error spike details
  2. Identify affected endpoints/components
  3. Check recent deployments
  4. Rollback if deployment-related
  5. Apply hotfix if code issue
  6. Scale resources if load-related

Database Connection Alert

  1. Check Supabase status page
  2. Verify connection pool settings
  3. Check for long-running queries
  4. Kill idle connections
  5. Restart connection pool if needed
  6. Contact Supabase support if persistent

Payment Processing Alert

  1. Check Stripe dashboard for failures
  2. Verify webhook processing
  3. Check payment method issues
  4. Review recent Stripe changes
  5. Contact Stripe support if needed

Security Incident Response

Suspected API Key Leak

# 1. Rotate immediately
# Supabase: Dashboard > Settings > API
# Stripe: Dashboard > Developers > API keys

# 2. Update environment variables
vercel env pull
# Update keys
vercel env add SUPABASE_SERVICE_ROLE_KEY

# 3. Audit usage
# Check Supabase logs for unusual activity
# Check Stripe logs for unexpected charges

# 4. Update monitoring
pnpm run security:check

SQL Injection Attempt

-- Check for suspicious queries
SELECT query, calls
FROM pg_stat_statements
WHERE query LIKE '%UNION%'
OR query LIKE '%<script>%'
ORDER BY calls DESC;

-- Review RLS policies
SELECT * FROM pg_policies;

-- Ensure parameterized queries everywhere
-- Never use string concatenation for queries

DDoS Attack

Debugging Tools

Database Inspection

# Connect to production (read-only)
psql $PRODUCTION_DATABASE_URL

# Useful queries
\dt # List tables
\d+ organizations # Table details
\df # List functions
SELECT * FROM pg_indexes; # Check indexes

API Testing

# GraphQL introspection
curl -X POST http://localhost:54321/functions/v1/graphql-v2 \
-H "Content-Type: application/json" \
-d '{"query": "{ __schema { types { name } } }"}'

# Test with authentication
curl -X POST http://localhost:54321/functions/v1/graphql-v2 \
-H "Authorization: Bearer $JWT_TOKEN" \
-H "Content-Type: application/json" \
-d '{"query": "{ my_organizations { id name } }"}'

Performance Profiling

# Bundle analysis
pnpm run analyze:bundle

# Database query analysis
EXPLAIN (ANALYZE, BUFFERS) SELECT ...;

# Edge Function timing
console.time('operation')
// ... code ...
console.timeEnd('operation')

Disaster Recovery

Full Database Corruption

  1. Stop all writes immediately

  2. Backup current state (even if corrupted)

  3. Restore from Supabase backups:

    # Contact Supabase support for point-in-time restore
    # Backups available: 7 days (Pro plan)
  4. Verify data integrity after restore

  5. Run test suite before reopening

Complete Service Outage

  1. Update status page immediately
  2. Check all dependencies:
    • Vercel status
    • Supabase status
    • Stripe status
    • Neon status
  3. Implement read-only mode if partial service possible
  4. Communicate with users via email/social
  5. Post-mortem after resolution

Critical Data Loss

  1. Identify scope of data loss
  2. Check all backup sources:
    • Supabase backups
    • Neon backups
    • Stripe data (source of truth for billing)
  3. Restore from most recent backup
  4. Reconcile any gaps manually
  5. Implement additional backup strategy

Contact Information

Service Providers

  • Supabase Support: support.Supabase.io
  • Vercel Support: vercel.com/support
  • Stripe Support: support.stripe.com
  • Neon Support: neon.tech/support

Internal Escalation

Since this is a two-person operation:

  1. Check documented solutions first
  2. Search error in GitHub issues
  3. Check service provider status pages
  4. Contact service provider support
  5. Post in relevant Discord/Slack communities