Source:
ocean/docs/adr/0048-logical-replication-implementation-review.md| ✏️ Edit on GitHub
ADR-0048: Logical Replication Implementation Review and Improvements
Date: 2025-01-02
Status
Accepted
Context
We have implemented PostgreSQL logical replication to synchronize data from the Ebisu master database (Crunchy Bridge) to tenant databases (Neon) as described in ADR-0047. A comprehensive review of the implementation has identified several critical issues that need to be addressed before production deployment.
Decision
We will proceed with the current implementation as a foundation but must address the following critical issues before production use:
Critical Issues to Fix
-
Security Hardening
- Use PostgreSQL password files or environment variables instead of embedding passwords in connection strings
- Add validation for all vault encryption/decryption operations
- Implement proper credential rotation mechanisms
-
Error Handling and Validation
- Add comprehensive input validation (UUID formats, action types)
- Implement proper error handling for all external service calls
- Add connection string format validation
-
Concurrency Control
- Implement distributed locking for tenant provisioning operations
- Add idempotency checks to prevent duplicate subscriptions
- Handle race conditions in concurrent edge function executions
-
Monitoring and Observability
- Add replication slot monitoring
- Implement structured logging with correlation IDs
- Export metrics for monitoring systems
- Track WAL lag in bytes, not just time
Acceptable Limitations for MVP
-
Manual Network Configuration
- Firewall rules must be configured manually in Crunchy Bridge and Neon dashboards
- This is acceptable as it's a one-time setup task
-
Sequential Monitoring
- The monitor checks tenants sequentially, which may be slow at scale
- Acceptable for MVP with <100 tenants, will need optimization later
-
Basic Recovery
- Manual intervention required for failed replications
- Acceptable for MVP with proper alerting in place
Implementation Priority
Phase 1: Security and Reliability (Immediate)
// Example: Secure connection string handling
async function createSubscription(
db: DrizzleDb,
tenantId: string,
config: ReplicationConfig
): Promise<void> {
// Validate inputs
if (!isValidUUID(tenantId)) {
throw new ValidationError('Invalid tenant ID format')
}
// Use environment variable for password
const pgPassword = Deno.env.get('EBISU_REPLICATOR_PASSWORD')
if (!pgPassword) {
throw new ConfigurationError('Replicator password not configured')
}
// Create subscription with proper error handling
try {
await db.execute(sql`
CREATE SUBSCRIPTION ${sql.identifier(subscriptionName)}
CONNECTION ${sql`'host=${config.host} port=${config.port} dbname=${config.database} user=replicator'`}
PUBLICATION ebisu_master_data
WITH (password_required = false)
`)
} catch (error) {
if (error.code === '42710') {
// duplicate_object
logger.info('Subscription already exists', { tenantId })
return
}
throw new ReplicationError('Failed to create subscription', { cause: error })
}
}
Phase 2: Monitoring Enhancements (Next Sprint)
// Example: Enhanced monitoring
interface ReplicationMetrics {
tenantId: string
lagBytes: bigint
lagTime: number
slotSize: bigint
state: 'streaming' | 'catchup' | 'stopped'
}
async function collectReplicationMetrics(): Promise<ReplicationMetrics[]> {
// Implementation with proper metrics collection
}
Phase 3: Automation and Scaling (Future)
- Implement automatic retry with exponential backoff
- Add support for parallel monitoring
- Create schema evolution tools
- Implement zero-downtime credential rotation
Consequences
Positive
- Current implementation provides a solid foundation
- Core functionality works correctly when properly configured
- Monitoring and alerting are in place
- Clear path for improvements identified
Negative
- Cannot be deployed to production without Phase 1 security fixes
- Manual network configuration adds operational overhead
- Limited scalability without Phase 3 improvements
Risks
- Data Loss Risk: Low - logical replication is reliable once established
- Security Risk: High until Phase 1 is complete - credentials exposed in logs
- Operational Risk: Medium - manual intervention required for failures
Action Items
-
Immediate (Before any production use):
- Implement secure credential handling
- Add comprehensive input validation
- Add distributed locking for provisioning
- Update documentation with security considerations
-
Short-term (Within 2 weeks):
- Enhance monitoring with slot size tracking
- Add structured logging
- Implement basic retry logic
- Create runbook for common issues
-
Medium-term (Within 1 month):
- Implement automated recovery mechanisms
- Add performance optimizations
- Create integration test suite
- Implement credential rotation
Review Notes
The implementation successfully demonstrates the logical replication pattern but requires security hardening before production use. The architecture is sound, and the identified issues are fixable without major restructuring.
Decision Makers
- ryan-taylor
- Engineering Team