Source:
ebisu/docs/adr/0058-rfmo-data-integrity-findings.md| ✏️ Edit on GitHub
ADR-058: RFMO Data Integrity Findings and Resolution
Status
Investigated and Resolved
Context
Investigation revealed massive data loss in the RFMO import pipeline:
- Overall: 33.7% data loss (34,905 cleaned → 23,132 imported)
- ICCAT: 59.7% loss (14,619 → 5,884)
- NEAFC: 68.8% loss (2,236 → 698)
- WCPFC: 15.9% loss (3,436 → 2,891)
Root Cause Analysis
- Premature Vessel Matching: System attempted to match vessels during individual RFMO imports
- Name Collision Issues: Multiple vessels with same name incorrectly merged (e.g., 559 ICCAT vessels had name conflicts)
- Import Hangs: Batch processing failed when encountering duplicate vessel_uuid values
- Wrong Mental Model: System designed as traditional MDM instead of intelligence platform
Decision
Implemented data integrity checking and discovered the need for staged intelligence import (see ADR-056).
Key Findings
ICCAT Analysis
- 14,617 records but only 6,578 distinct vessel_uuid values
- Same vessel name mapped to up to 45 different ICCAT serial numbers
- Vessels like "Salwa" had different owners but same vessel_uuid
Cross-RFMO Issues
- Import order dependency (first RFMO becomes "truth")
- No cross-source validation before matching
- Intelligence value lost through premature deduplication
Implementation
- Created
/scripts/check_rfmo_data_integrity.shfor comprehensive pipeline analysis - Discovered staged approach preserves 100% of data
- Proved hypothesis that system was losing critical intelligence
Metrics
Before (Traditional Matching):
- ICCAT: 5,884 vessels imported (59.7% loss)
- Data quality: Unknown due to lost records
After (Staged Intelligence):
- ICCAT: 14,617 records preserved (0% loss)
- Data quality: 80.8% completeness score
- Ready for intelligent cross-source analysis
Lessons Learned
- Always measure data pipeline loss - assumptions about "working" imports were wrong
- Question the approach when seeing high failure rates
- Intelligence requires different thinking than traditional data management