Source: ebisu/docs/adr/0052-vessel-organization.md | ✏️ Edit on GitHub

ADR-003: Vessel Data Organization and Processing Structure

Status: Accepted
Date: 2025-08-15
Authors: Development Team
Reviewers: Data Architecture Team

Context

The Ebisu database project needs to process and import 40+ vessel datasets from various sources including RFMOs (Regional Fisheries Management Organizations), national registries, and other vessel databases. This requires a well-organized structure for managing the complex vessel data pipeline with clear separation of concerns.

Decision

We will organize vessel-related import scripts and data files using a domain-driven approach with mirrored categorization between scripts and data for consistency and ease of navigation.

Approved Directory Structures

Scripts Organization:

scripts/import/vessels/
├── create_sources_vessels.sql                     # Vessel sources schema
├── load_sources_vessels.sh                        # Vessel sources import
├── README_vessel_sources_import.md                # Documentation
└── vessel_data/                                   # 40+ dataset processing organized by category
    ├── RFMO/                                      # RFMO vessel datasets
    │   ├── clean_iccat_vessels.sh
    │   ├── load_iccat_vessels.sh
    │   ├── clean_iotc_vessels.sh
    │   ├── load_iotc_vessels.sh
    │   └── ... [other RFMO datasets]
    ├── COUNTRY/                                   # National vessel registries  
    │   ├── clean_usa_national_registry.sh
    │   ├── load_usa_national_registry.sh
    │   ├── clean_canada_vessel_registry.sh
    │   ├── load_canada_vessel_registry.sh
    │   └── ... [other country datasets]
    ├── OTHER/                                     # Organizations, MSC, certification bodies
    │   ├── clean_msc_certified_vessels.sh
    │   ├── load_msc_certified_vessels.sh
    │   ├── clean_noaa_fisheries.sh
    │   ├── load_noaa_fisheries.sh
    │   └── ... [other organization datasets]
    └── BADDIE/                                    # IUU and enforcement datasets
        ├── clean_global_iuu_list.sh
        ├── load_global_iuu_list.sh
        ├── clean_rfmo_iuu_combined.sh
        ├── load_rfmo_iuu_combined.sh
        └── ... [other enforcement datasets]

Data Organization (Mirrored Structure):

/import/vessels/
├── original_sources_vessels.csv                  # Vessel source metadata  
└── vessel_data/                                  # 40+ datasets organized by category
    ├── RFMO/                                     # RFMO vessel datasets
    │   ├── raw/                                  # Original unprocessed files
    │   ├── cleaned/                              # Post-cleaning, pre-import
    │   └── staged/                               # Ready for database import
    ├── COUNTRY/                                  # National vessel registries
    │   ├── raw/
    │   ├── cleaned/
    │   └── staged/
    ├── OTHER/                                    # Organizations, MSC, etc.
    │   ├── raw/
    │   ├── cleaned/
    │   └── staged/
    └── BADDIE/                                   # IUU and enforcement datasets
        ├── raw/
        ├── cleaned/
        └── staged/

Integration with Existing Architecture

Schema Management: Main vessel table schemas remain in migrations/0002_vessel_tables_migration.sql
Domain Separation: Vessels isolated from other data domains (species, reference data, etc.)
Workflow Phases: Supports the defined 4-phase vessel data workflow:
1. Import + load vessel sources
2. Clean vessel data (40+ datasets)
3. Import + load vessel data (40+ datasets) to database
4. Ensure all FK/PK relationships and UUIDs are created

Rationale

Why This Structure Works

Domain Isolation: Clear separation between vessel data and other import domains
Team Ownership: Vessel team owns and maintains the entire /vessels/ folder
Categorical Organization: 40+ datasets organized by source type for easy navigation
- RFMO/ - Regional Fisheries Management Organizations (ICCAT, IOTC, WCPFC, etc.)
- COUNTRY/ - National vessel registries (USA, Canada, EU, etc.)
- OTHER/ - Organizations, certification bodies, MSC (NOAA, MSC, NGOs, etc.)
- BADDIE/ - IUU vessel lists and enforcement data (blacklists, violations, etc.)
Scalability: Can accommodate 40+ datasets with clear categorization
Clear Dependencies: Sources must be loaded before individual vessel datasets
Existing Integration: Leverages existing migrations/ folder for schema management
Enterprise Readiness: Follows enterprise best practices for large-scale data processing

Why We Rejected Alternatives

Flat Directory Structure: Rejected having all 40+ scripts in a single vessel_data/ folder because it becomes unmanageable and difficult to navigate when looking for specific datasets.

Processing Phase Folders (clean/, load/, validate/): Rejected organizing by processing phase because dataset type is more important for navigation - developers need to find scripts by data source, not processing step.

Setup/Schema Folders: Rejected because we already have a migrations/ folder handling vessel table schemas. Adding another schema management location would create confusion and duplication.

Mixed Processing Folder: Rejected a single folder mixing cleaning and loading scripts because it becomes unmanageable with 40+ datasets and doesn't clearly separate workflow phases.

Implementation Guidelines

File Naming Conventions

Sources: load_sources_vessels.sh, create_sources_vessels.sql
Individual Datasets: Organized by category with consistent naming:
- RFMO: RFMO/clean_[rfmo_name]_vessels.sh, RFMO/load_[rfmo_name]_vessels.sh
  - Examples: clean_iccat_vessels.sh, clean_iotc_vessels.sh
- COUNTRY: COUNTRY/clean_[country]_[registry_type].sh, COUNTRY/load_[country]_[registry_type].sh
  - Examples: clean_usa_national_registry.sh, clean_canada_vessel_registry.sh
- OTHER: OTHER/clean_[organization]_[dataset].sh, OTHER/load_[organization]_[dataset].sh
  - Examples: clean_msc_certified_vessels.sh, clean_noaa_fisheries.sh
- BADDIE: BADDIE/clean_[enforcement_type].sh, BADDIE/load_[enforcement_type].sh
  - Examples: clean_global_iuu_list.sh, clean_rfmo_iuu_combined.sh
Documentation: README_vessel_sources_import.md for comprehensive documentation

Processing Workflow

Phase 1: Run vessel sources import once: ./load_sources_vessels.sh

Phase 2a: Clean datasets by category:

cd vessel_data/RFMO && ./clean_*.sh
cd vessel_data/COUNTRY && ./clean_*.sh  
cd vessel_data/OTHER && ./clean_*.sh
cd vessel_data/BADDIE && ./clean_*.sh

Phase 2b: Load datasets by category:

cd vessel_data/RFMO && ./load_*.sh
cd vessel_data/COUNTRY && ./load_*.sh
cd vessel_data/OTHER && ./load_*.sh  
cd vessel_data/BADDIE && ./load_*.sh

Phase 3: Validate relationships and data integrity across all categories

Development Standards

Error Handling: All scripts must use set -euo pipefail for robust error handling
Logging: Consistent timestamped logging using shared utility functions
Documentation: Each script includes header comments explaining purpose and dependencies
Testing: Individual datasets can be tested independently without full pipeline runs

Technical Considerations

Database Integration

Foreign Keys: All vessel data references original_sources_vessels via UUID foreign keys
RFMO Integration: Automatic mapping from source names to rfmos table UUIDs
Country Integration: Automatic mapping from alpha3 codes to country_iso table UUIDs
Source Types: Enum validation for vessel source types (RFMO, COUNTRY, ORGANIZATION, etc.)

Data Organization and Lineage

Mirrored Structure: Data files organized identically to processing scripts for consistency
Processing Phases: Clear separation of raw → cleaned → staged data phases
Data Immutability: Raw files preserved exactly as received from sources
Traceability: Full lineage tracking from source files through processing to database
Category Alignment: Data categories match vessel source type enums and script organization

Data Quality

Source Tracking: Every piece of vessel data linked to its original source
Validation Scripts: Comprehensive FK/PK relationship validation after imports
Error Isolation: Individual dataset failures don't affect other datasets
Progress Monitoring: Clear visibility into processing status across 40+ datasets
Quality Metrics: Validation reports and error logs for each dataset

Consequences

Positive

Clear Ownership: Vessel team has full control over both scripts and data organization
Easy Navigation: 40+ datasets organized by logical categories (RFMO, COUNTRY, OTHER, BADDIE) in both scripts and data
Mirrored Consistency: Data files organized identically to processing scripts for intuitive navigation
Scalable: Structure handles current and future vessel datasets efficiently with clear categorization
Maintainable: Easy to add, modify, or troubleshoot individual datasets within their category
Parallel Development: Multiple team members can work on different categories/datasets simultaneously
Error Isolation: Dataset failures are contained within categories and don't break the entire pipeline
Data Lineage: Clear traceability from raw data through processing phases to database
Enterprise Ready: Follows patterns that scale to enterprise-level complexity
Intuitive: New team members can quickly understand where to find both scripts and data for specific dataset types

Negative

Initial Setup: Requires creating both script and data directory structures and organizing existing files
Learning Curve: Team needs to understand the new organization, workflow phases, and data processing stages
Coordination: Must ensure proper sequencing of vessel sources before individual datasets
Storage Management: Need to monitor disk usage across multiple data processing phases
Data Duplication: Raw, cleaned, and staged versions of datasets require additional storage space

Neutral

Flexibility: Future reorganization is possible if needs change significantly
Documentation: Requires maintaining documentation as structure evolves

Monitoring and Success Metrics

Import Success Rate: Percentage of 40+ datasets that import successfully
Processing Time: Time to complete full vessel data pipeline
Error Rate: Number of FK/PK relationship validation errors
Developer Productivity: Time to add new vessel datasets to the pipeline

migrations/0002_vessel_tables_migration.sql - Main vessel table schemas
scripts/import/vessels/README_vessel_sources_import.md - Implementation documentation
drizzle-schemas/vessels/sources.ts - TypeScript schema definitions

Future Considerations

Parallel Processing: May add orchestration for parallel dataset cleaning/loading
Data Quality Dashboard: Potential integration with monitoring systems
Automated Testing: Could add automated validation for new dataset additions
Performance Optimization: May optimize for specific bottlenecks as scale increases

This ADR establishes the foundation for scalable, maintainable vessel data processing that can grow with the project's needs while maintaining clarity and team productivity.

Context​

Decision​

Approved Directory Structures​

Integration with Existing Architecture​

Rationale​

Why This Structure Works​

Why We Rejected Alternatives​

Implementation Guidelines​

File Naming Conventions​

Processing Workflow​

Development Standards​

Technical Considerations​

Database Integration​

Data Organization and Lineage​

Data Quality​

Consequences​

Positive​

Negative​

Neutral​

Monitoring and Success Metrics​

Related Documents​

Future Considerations​