Skip to main content

Source: ebisu/docs/adr/0058-rfmo-data-integrity-findings.md | ✏️ Edit on GitHub

ADR-058: RFMO Data Integrity Findings and Resolution

Status

Investigated and Resolved

Context

Investigation revealed massive data loss in the RFMO import pipeline:

  • Overall: 33.7% data loss (34,905 cleaned → 23,132 imported)
  • ICCAT: 59.7% loss (14,619 → 5,884)
  • NEAFC: 68.8% loss (2,236 → 698)
  • WCPFC: 15.9% loss (3,436 → 2,891)

Root Cause Analysis

  1. Premature Vessel Matching: System attempted to match vessels during individual RFMO imports
  2. Name Collision Issues: Multiple vessels with same name incorrectly merged (e.g., 559 ICCAT vessels had name conflicts)
  3. Import Hangs: Batch processing failed when encountering duplicate vessel_uuid values
  4. Wrong Mental Model: System designed as traditional MDM instead of intelligence platform

Decision

Implemented data integrity checking and discovered the need for staged intelligence import (see ADR-056).

Key Findings

ICCAT Analysis

  • 14,617 records but only 6,578 distinct vessel_uuid values
  • Same vessel name mapped to up to 45 different ICCAT serial numbers
  • Vessels like "Salwa" had different owners but same vessel_uuid

Cross-RFMO Issues

  • Import order dependency (first RFMO becomes "truth")
  • No cross-source validation before matching
  • Intelligence value lost through premature deduplication

Implementation

  1. Created /scripts/check_rfmo_data_integrity.sh for comprehensive pipeline analysis
  2. Discovered staged approach preserves 100% of data
  3. Proved hypothesis that system was losing critical intelligence

Metrics

Before (Traditional Matching):

  • ICCAT: 5,884 vessels imported (59.7% loss)
  • Data quality: Unknown due to lost records

After (Staged Intelligence):

  • ICCAT: 14,617 records preserved (0% loss)
  • Data quality: 80.8% completeness score
  • Ready for intelligent cross-source analysis

Lessons Learned

  1. Always measure data pipeline loss - assumptions about "working" imports were wrong
  2. Question the approach when seeing high failure rates
  3. Intelligence requires different thinking than traditional data management