Source: ebisu/docs/adr/0001-complete-vessel-import-scripts.md | ✏️ Edit on GitHub

ADR-0001: Complete Vessel Import Scripts for RFMO Data Sources

Status

Proposed

Context

The Ebisu database is designed to integrate vessel data from 40+ sources to create comprehensive vessel profiles. Currently, only 3 of 13 RFMO vessel importers are implemented (NPFC, NEAFC, SPRFMO), leaving significant gaps in vessel coverage. The vessel identity resolution system is complex, requiring careful matching across sources to avoid duplicates while preserving source attribution.

Current State

Implemented: 3/13 RFMO importers (NPFC, NEAFC, SPRFMO)
Data Available: 7 RFMOs have CSV/XLSX files ready
Missing Data: 3 RFMOs (GFCM, SEAFO PDF only, CCAMLR)
Other Categories: BADDIE, COUNTRY, COUNTRY_EU, CIVIL_SOCIETY, INTERGOV sources pending

Vessel Matching Complexity

The system uses hierarchical matching:

IMO (7-digit) - strongest identifier
IRCS (International Radio Call Sign)
MMSI (9-digit Maritime Mobile Service Identity)
Vessel Name + Flag - weakest identifier

Decision

We will complete the vessel import scripts following these principles

Use NPFC as the Template
- Most comprehensive implementation (33 columns)
- Well-tested matching logic
- Clear separation of cleaning/loading
Implement in Priority Order
- Phase 1: Major tuna RFMOs (ICCAT, IOTC, WCPFC, IATTC)
- Phase 2: Regional RFMOs (CCSBT, FFA, NAFO)
- Phase 3: Data extraction needed (GFCM, SEAFO)
- Future: Other vessel categories per Emily's timeline
Standardize Vessel Matching
- Create shared vessel matching functions
- Implement consistent duplicate detection
- Add confidence scoring for matches
- Log matching decisions for audit
Maintain Data Integrity
- Preserve all source data
- Track conflicts without resolution
- Enable source attribution queries
- Support temporal analysis

Implementation Approach

1. Create Shared Infrastructure

-- Standardized vessel matching function
CREATE OR REPLACE FUNCTION find_or_create_vessel(
    p_vessel_name TEXT,
    p_imo TEXT,
    p_ircs TEXT,
    p_mmsi TEXT,
    p_flag_uuid UUID,
    p_source_id INTEGER
) RETURNS UUID AS $$
    -- Implement hierarchical matching logic
$$ LANGUAGE plpgsql;

2. Script Structure for Each RFMO

scripts/import/vessels/
├── clean_[rfmo]_vessels.sh      # CSV cleaning and standardization
├── setup_[rfmo]_loading.sql     # RFMO-specific mappings
└── load_[rfmo]_vessels.sh       # Database import logic

3. Data Quality Tracking

Log match confidence scores
Track validation warnings
Monitor duplicate rates
Report unmapped values

Consequences

Positive

Complete vessel coverage from major RFMOs
Consistent vessel identity resolution
Improved data quality tracking
Reusable import infrastructure
Better AI/ML readiness through standardization

Negative

Time investment for 10+ importers
Risk of introducing duplicates if matching fails
Maintenance burden for multiple scripts
Need to handle evolving data formats

Risks & Mitigation

Duplicate Vessels
- Mitigation: Implement pre-import duplicate analysis
- Add vessel similarity scoring
- Create duplicate merge capabilities
Data Conflicts
- Mitigation: Enhanced conflict tracking
- Source quality scoring
- Temporal precedence rules
Schema Evolution
- Mitigation: Flexible JSONB attributes
- Versioned import scripts
- Schema migration strategy

Alternatives Considered

Wait for Emily's Implementation
- Rejected: Need complete vessel data for AI readiness
Generic CSV Importer
- Rejected: Each RFMO has unique formats and mappings
Manual Data Entry
- Rejected: Not scalable for 40+ sources

Implementation Timeline

Week 1: Shared infrastructure and ICCAT
Week 2: IOTC, WCPFC, IATTC
Week 3: CCSBT, FFA, NAFO
Week 4: Testing and quality assurance

Success Metrics

All RFMOs with data files have working importers
< 5% duplicate rate across sources
95%+ vessel matching accuracy
Complete audit trail for identity resolution

References

Existing implementations: NPFC, NEAFC, SPRFMO scripts
Database schema: /migrations/0000_schema.sql
Vessel tables documentation: /docs_general/database-schema.md

Status​

Context​

Current State​

Vessel Matching Complexity​

Decision​

We will complete the vessel import scripts following these principles​

Implementation Approach​

1. Create Shared Infrastructure​

2. Script Structure for Each RFMO​

3. Data Quality Tracking​

Consequences​

Positive​

Negative​

Risks & Mitigation​

Alternatives Considered​

Implementation Timeline​

Success Metrics​

References​