Skip to main content

Source: ebisu/docs/adr/0064-vessel-import-directory-organization.md | ✏️ Edit on GitHub

ADR-0064: Vessel Import Directory Organization

Status: Accepted
Date: 2025-01-11
Stakeholders: Development team, Data team

Context

The vessel import system was becoming difficult to manage with various data sources updating at different frequencies. We needed a standardized approach to:

  • Organize data files by source type and individual source
  • Handle updates independently for each source
  • Maintain version history
  • Prevent accidental reimports
  • Track data lineage

Decision

Implement a hierarchical directory structure with standardized import scripts for each data source.

Directory Structure

import/vessels/vessel_data/
├── RFMO/ # Regional Fisheries Management Organizations
├── COUNTRY/ # National vessel registries
├── INTERGOV/ # Inter-governmental organizations
├── BADDIE/ # Sanctions and IUU lists
└── CIVIL_SOCIETY/ # NGO and civil society sources

Each source has:
SOURCE_NAME/
├── raw/ # Original files as received
├── cleaned/ # Processed CSV files ready for import
└── archive/ # Historical versions

Import Script Pattern

Each data source has its own import script that:

  1. Automatically finds the latest file in /raw/
  2. Checks file hash to prevent reimports
  3. Creates import batch and lineage records
  4. Loads to staging tables
  5. Converts to intelligence reports
  6. Updates cross-source confirmations

File Naming Convention

  • Include date: SOURCE_NAME_YYYY-MM-DD.ext
  • Preserves original filename structure where possible
  • Clear indication of data collection date

Implementation

1. Directory Creation

mkdir -p vessel_data/{TYPE}/{SOURCE}/{raw,cleaned,archive}

2. Import Script Template

/scripts/import/vessels/data/{TYPE}/load_{source}.sh

3. Status Monitoring

/scripts/import/vessels/check_import_status.sh

Consequences

Positive

  • Independent Updates: Each source can be updated without affecting others
  • Clear Organization: Easy to find data for any source
  • Version Control: Archive directory preserves history
  • Automated Processing: Scripts handle most import logic
  • Audit Trail: Complete lineage tracking

Negative

  • More Scripts: One script per source (maintenance overhead)
  • Directory Sprawl: Many subdirectories to manage
  • Storage: Keeping raw and cleaned versions uses more space

Neutral

  • Learning Curve: Team needs to understand new structure
  • Migration Work: Existing data needs reorganization

Example Workflow

Adding New Data

  1. Download latest vessel list from source
  2. Place in /raw/ directory with date
  3. Run import script
  4. Move previous version to /archive/

Monthly Update Example

# 1. Add new PNA TUNA file
cp PNA_TUNA_2025-02-08.csv import/vessels/vessel_data/INTERGOV/PNA_TUNA/raw/

# 2. Run import
docker exec -i ebisu-importer /app/scripts/import/vessels/data/INTERGOV/load_pna_tuna.sh

# 3. Archive old file
mv import/vessels/vessel_data/INTERGOV/PNA_TUNA/raw/PNA_TUNA_2025-01-08.csv \
import/vessels/vessel_data/INTERGOV/PNA_TUNA/archive/

Security Considerations

  • No Credentials: Never store API keys or passwords in import directories
  • File Validation: Always verify file integrity before import
  • Access Control: Limit write access to import directories

Monitoring

The check_import_status.sh script provides:

  • Recent import history
  • Source coverage statistics
  • Cross-source confirmation metrics
  • Data quality indicators

References

  • ADR-0061: Intelligence Confirmations Are Not Duplicates
  • ADR-0062: Phase 1 Data Isolation Architecture
  • ADR-0063: Missing Vessel Registries Import Strategy
  • IMPORT_GUIDE.md: Detailed import instructions