Skip to main content

Source: ebisu/docs/adr/0001-complete-vessel-import-scripts.md | ✏️ Edit on GitHub

ADR-0001: Complete Vessel Import Scripts for RFMO Data Sources

Status

Proposed

Context

The Ebisu database is designed to integrate vessel data from 40+ sources to create comprehensive vessel profiles. Currently, only 3 of 13 RFMO vessel importers are implemented (NPFC, NEAFC, SPRFMO), leaving significant gaps in vessel coverage. The vessel identity resolution system is complex, requiring careful matching across sources to avoid duplicates while preserving source attribution.

Current State

  • Implemented: 3/13 RFMO importers (NPFC, NEAFC, SPRFMO)
  • Data Available: 7 RFMOs have CSV/XLSX files ready
  • Missing Data: 3 RFMOs (GFCM, SEAFO PDF only, CCAMLR)
  • Other Categories: BADDIE, COUNTRY, COUNTRY_EU, CIVIL_SOCIETY, INTERGOV sources pending

Vessel Matching Complexity

The system uses hierarchical matching:

  1. IMO (7-digit) - strongest identifier
  2. IRCS (International Radio Call Sign)
  3. MMSI (9-digit Maritime Mobile Service Identity)
  4. Vessel Name + Flag - weakest identifier

Decision

We will complete the vessel import scripts following these principles

  1. Use NPFC as the Template

    • Most comprehensive implementation (33 columns)
    • Well-tested matching logic
    • Clear separation of cleaning/loading
  2. Implement in Priority Order

    • Phase 1: Major tuna RFMOs (ICCAT, IOTC, WCPFC, IATTC)
    • Phase 2: Regional RFMOs (CCSBT, FFA, NAFO)
    • Phase 3: Data extraction needed (GFCM, SEAFO)
    • Future: Other vessel categories per Emily's timeline
  3. Standardize Vessel Matching

    • Create shared vessel matching functions
    • Implement consistent duplicate detection
    • Add confidence scoring for matches
    • Log matching decisions for audit
  4. Maintain Data Integrity

    • Preserve all source data
    • Track conflicts without resolution
    • Enable source attribution queries
    • Support temporal analysis

Implementation Approach

1. Create Shared Infrastructure

-- Standardized vessel matching function
CREATE OR REPLACE FUNCTION find_or_create_vessel(
p_vessel_name TEXT,
p_imo TEXT,
p_ircs TEXT,
p_mmsi TEXT,
p_flag_uuid UUID,
p_source_id INTEGER
) RETURNS UUID AS $$
-- Implement hierarchical matching logic
$$ LANGUAGE plpgsql;

2. Script Structure for Each RFMO

scripts/import/vessels/
├── clean_[rfmo]_vessels.sh # CSV cleaning and standardization
├── setup_[rfmo]_loading.sql # RFMO-specific mappings
└── load_[rfmo]_vessels.sh # Database import logic

3. Data Quality Tracking

  • Log match confidence scores
  • Track validation warnings
  • Monitor duplicate rates
  • Report unmapped values

Consequences

Positive

  • Complete vessel coverage from major RFMOs
  • Consistent vessel identity resolution
  • Improved data quality tracking
  • Reusable import infrastructure
  • Better AI/ML readiness through standardization

Negative

  • Time investment for 10+ importers
  • Risk of introducing duplicates if matching fails
  • Maintenance burden for multiple scripts
  • Need to handle evolving data formats

Risks & Mitigation

  1. Duplicate Vessels

    • Mitigation: Implement pre-import duplicate analysis
    • Add vessel similarity scoring
    • Create duplicate merge capabilities
  2. Data Conflicts

    • Mitigation: Enhanced conflict tracking
    • Source quality scoring
    • Temporal precedence rules
  3. Schema Evolution

    • Mitigation: Flexible JSONB attributes
    • Versioned import scripts
    • Schema migration strategy

Alternatives Considered

  1. Wait for Emily's Implementation

    • Rejected: Need complete vessel data for AI readiness
  2. Generic CSV Importer

    • Rejected: Each RFMO has unique formats and mappings
  3. Manual Data Entry

    • Rejected: Not scalable for 40+ sources

Implementation Timeline

  1. Week 1: Shared infrastructure and ICCAT
  2. Week 2: IOTC, WCPFC, IATTC
  3. Week 3: CCSBT, FFA, NAFO
  4. Week 4: Testing and quality assurance

Success Metrics

  • All RFMOs with data files have working importers
  • < 5% duplicate rate across sources
  • 95%+ vessel matching accuracy
  • Complete audit trail for identity resolution

References

  • Existing implementations: NPFC, NEAFC, SPRFMO scripts
  • Database schema: /migrations/0000_schema.sql
  • Vessel tables documentation: /docs_general/database-schema.md