Source:
ebisu/docs/adr/0001-complete-vessel-import-scripts.md| ✏️ Edit on GitHub
ADR-0001: Complete Vessel Import Scripts for RFMO Data Sources
Status
Proposed
Context
The Ebisu database is designed to integrate vessel data from 40+ sources to create comprehensive vessel profiles. Currently, only 3 of 13 RFMO vessel importers are implemented (NPFC, NEAFC, SPRFMO), leaving significant gaps in vessel coverage. The vessel identity resolution system is complex, requiring careful matching across sources to avoid duplicates while preserving source attribution.
Current State
- Implemented: 3/13 RFMO importers (NPFC, NEAFC, SPRFMO)
- Data Available: 7 RFMOs have CSV/XLSX files ready
- Missing Data: 3 RFMOs (GFCM, SEAFO PDF only, CCAMLR)
- Other Categories: BADDIE, COUNTRY, COUNTRY_EU, CIVIL_SOCIETY, INTERGOV sources pending
Vessel Matching Complexity
The system uses hierarchical matching:
- IMO (7-digit) - strongest identifier
- IRCS (International Radio Call Sign)
- MMSI (9-digit Maritime Mobile Service Identity)
- Vessel Name + Flag - weakest identifier
Decision
We will complete the vessel import scripts following these principles
-
Use NPFC as the Template
- Most comprehensive implementation (33 columns)
- Well-tested matching logic
- Clear separation of cleaning/loading
-
Implement in Priority Order
- Phase 1: Major tuna RFMOs (ICCAT, IOTC, WCPFC, IATTC)
- Phase 2: Regional RFMOs (CCSBT, FFA, NAFO)
- Phase 3: Data extraction needed (GFCM, SEAFO)
- Future: Other vessel categories per Emily's timeline
-
Standardize Vessel Matching
- Create shared vessel matching functions
- Implement consistent duplicate detection
- Add confidence scoring for matches
- Log matching decisions for audit
-
Maintain Data Integrity
- Preserve all source data
- Track conflicts without resolution
- Enable source attribution queries
- Support temporal analysis
Implementation Approach
1. Create Shared Infrastructure
-- Standardized vessel matching function
CREATE OR REPLACE FUNCTION find_or_create_vessel(
p_vessel_name TEXT,
p_imo TEXT,
p_ircs TEXT,
p_mmsi TEXT,
p_flag_uuid UUID,
p_source_id INTEGER
) RETURNS UUID AS $$
-- Implement hierarchical matching logic
$$ LANGUAGE plpgsql;
2. Script Structure for Each RFMO
scripts/import/vessels/
├── clean_[rfmo]_vessels.sh # CSV cleaning and standardization
├── setup_[rfmo]_loading.sql # RFMO-specific mappings
└── load_[rfmo]_vessels.sh # Database import logic
3. Data Quality Tracking
- Log match confidence scores
- Track validation warnings
- Monitor duplicate rates
- Report unmapped values
Consequences
Positive
- Complete vessel coverage from major RFMOs
- Consistent vessel identity resolution
- Improved data quality tracking
- Reusable import infrastructure
- Better AI/ML readiness through standardization
Negative
- Time investment for 10+ importers
- Risk of introducing duplicates if matching fails
- Maintenance burden for multiple scripts
- Need to handle evolving data formats
Risks & Mitigation
-
Duplicate Vessels
- Mitigation: Implement pre-import duplicate analysis
- Add vessel similarity scoring
- Create duplicate merge capabilities
-
Data Conflicts
- Mitigation: Enhanced conflict tracking
- Source quality scoring
- Temporal precedence rules
-
Schema Evolution
- Mitigation: Flexible JSONB attributes
- Versioned import scripts
- Schema migration strategy
Alternatives Considered
-
Wait for Emily's Implementation
- Rejected: Need complete vessel data for AI readiness
-
Generic CSV Importer
- Rejected: Each RFMO has unique formats and mappings
-
Manual Data Entry
- Rejected: Not scalable for 40+ sources
Implementation Timeline
- Week 1: Shared infrastructure and ICCAT
- Week 2: IOTC, WCPFC, IATTC
- Week 3: CCSBT, FFA, NAFO
- Week 4: Testing and quality assurance
Success Metrics
- All RFMOs with data files have working importers
- < 5% duplicate rate across sources
- 95%+ vessel matching accuracy
- Complete audit trail for identity resolution
References
- Existing implementations: NPFC, NEAFC, SPRFMO scripts
- Database schema:
/migrations/0000_schema.sql - Vessel tables documentation:
/docs_general/database-schema.md