Source:
ebisu/docs/adr/0052-vessel-organization.md| ✏️ Edit on GitHub
ADR-003: Vessel Data Organization and Processing Structure
Status: Accepted
Date: 2025-08-15
Authors: Development Team
Reviewers: Data Architecture Team
Context
The Ebisu database project needs to process and import 40+ vessel datasets from various sources including RFMOs (Regional Fisheries Management Organizations), national registries, and other vessel databases. This requires a well-organized structure for managing the complex vessel data pipeline with clear separation of concerns.
Decision
We will organize vessel-related import scripts and data files using a domain-driven approach with mirrored categorization between scripts and data for consistency and ease of navigation.
Approved Directory Structures
Scripts Organization:
scripts/import/vessels/
├── create_sources_vessels.sql # Vessel sources schema
├── load_sources_vessels.sh # Vessel sources import
├── README_vessel_sources_import.md # Documentation
└── vessel_data/ # 40+ dataset processing organized by category
├── RFMO/ # RFMO vessel datasets
│ ├── clean_iccat_vessels.sh
│ ├── load_iccat_vessels.sh
│ ├── clean_iotc_vessels.sh
│ ├── load_iotc_vessels.sh
│ └── ... [other RFMO datasets]
├── COUNTRY/ # National vessel registries
│ ├── clean_usa_national_registry.sh
│ ├── load_usa_national_registry.sh
│ ├── clean_canada_vessel_registry.sh
│ ├── load_canada_vessel_registry.sh
│ └── ... [other country datasets]
├── OTHER/ # Organizations, MSC, certification bodies
│ ├── clean_msc_certified_vessels.sh
│ ├── load_msc_certified_vessels.sh
│ ├── clean_noaa_fisheries.sh
│ ├── load_noaa_fisheries.sh
│ └── ... [other organization datasets]
└── BADDIE/ # IUU and enforcement datasets
├── clean_global_iuu_list.sh
├── load_global_iuu_list.sh
├── clean_rfmo_iuu_combined.sh
├── load_rfmo_iuu_combined.sh
└── ... [other enforcement datasets]
Data Organization (Mirrored Structure):
/import/vessels/
├── original_sources_vessels.csv # Vessel source metadata
└── vessel_data/ # 40+ datasets organized by category
├── RFMO/ # RFMO vessel datasets
│ ├── raw/ # Original unprocessed files
│ ├── cleaned/ # Post-cleaning, pre-import
│ └── staged/ # Ready for database import
├── COUNTRY/ # National vessel registries
│ ├── raw/
│ ├── cleaned/
│ └── staged/
├── OTHER/ # Organizations, MSC, etc.
│ ├── raw/
│ ├── cleaned/
│ └── staged/
└── BADDIE/ # IUU and enforcement datasets
├── raw/
├── cleaned/
└── staged/
Integration with Existing Architecture
- Schema Management: Main vessel table schemas remain in
migrations/0002_vessel_tables_migration.sql - Domain Separation: Vessels isolated from other data domains (species, reference data, etc.)
- Workflow Phases: Supports the defined 4-phase vessel data workflow:
- Import + load vessel sources
- Clean vessel data (40+ datasets)
- Import + load vessel data (40+ datasets) to database
- Ensure all FK/PK relationships and UUIDs are created
Rationale
Why This Structure Works
- Domain Isolation: Clear separation between vessel data and other import domains
- Team Ownership: Vessel team owns and maintains the entire
/vessels/folder - Categorical Organization: 40+ datasets organized by source type for easy navigation
RFMO/- Regional Fisheries Management Organizations (ICCAT, IOTC, WCPFC, etc.)COUNTRY/- National vessel registries (USA, Canada, EU, etc.)OTHER/- Organizations, certification bodies, MSC (NOAA, MSC, NGOs, etc.)BADDIE/- IUU vessel lists and enforcement data (blacklists, violations, etc.)
- Scalability: Can accommodate 40+ datasets with clear categorization
- Clear Dependencies: Sources must be loaded before individual vessel datasets
- Existing Integration: Leverages existing
migrations/folder for schema management - Enterprise Readiness: Follows enterprise best practices for large-scale data processing
Why We Rejected Alternatives
Flat Directory Structure: Rejected having all 40+ scripts in a single vessel_data/ folder because it becomes unmanageable and difficult to navigate when looking for specific datasets.
Processing Phase Folders (clean/, load/, validate/): Rejected organizing by processing phase because dataset type is more important for navigation - developers need to find scripts by data source, not processing step.
Setup/Schema Folders: Rejected because we already have a migrations/ folder handling vessel table schemas. Adding another schema management location would create confusion and duplication.
Mixed Processing Folder: Rejected a single folder mixing cleaning and loading scripts because it becomes unmanageable with 40+ datasets and doesn't clearly separate workflow phases.
Implementation Guidelines
File Naming Conventions
- Sources:
load_sources_vessels.sh,create_sources_vessels.sql - Individual Datasets: Organized by category with consistent naming:
- RFMO:
RFMO/clean_[rfmo_name]_vessels.sh,RFMO/load_[rfmo_name]_vessels.sh- Examples:
clean_iccat_vessels.sh,clean_iotc_vessels.sh
- Examples:
- COUNTRY:
COUNTRY/clean_[country]_[registry_type].sh,COUNTRY/load_[country]_[registry_type].sh- Examples:
clean_usa_national_registry.sh,clean_canada_vessel_registry.sh
- Examples:
- OTHER:
OTHER/clean_[organization]_[dataset].sh,OTHER/load_[organization]_[dataset].sh- Examples:
clean_msc_certified_vessels.sh,clean_noaa_fisheries.sh
- Examples:
- BADDIE:
BADDIE/clean_[enforcement_type].sh,BADDIE/load_[enforcement_type].sh- Examples:
clean_global_iuu_list.sh,clean_rfmo_iuu_combined.sh
- Examples:
- RFMO:
- Documentation:
README_vessel_sources_import.mdfor comprehensive documentation
Processing Workflow
-
Phase 1: Run vessel sources import once:
./load_sources_vessels.sh -
Phase 2a: Clean datasets by category:
cd vessel_data/RFMO && ./clean_*.sh
cd vessel_data/COUNTRY && ./clean_*.sh
cd vessel_data/OTHER && ./clean_*.sh
cd vessel_data/BADDIE && ./clean_*.sh -
Phase 2b: Load datasets by category:
cd vessel_data/RFMO && ./load_*.sh
cd vessel_data/COUNTRY && ./load_*.sh
cd vessel_data/OTHER && ./load_*.sh
cd vessel_data/BADDIE && ./load_*.sh -
Phase 3: Validate relationships and data integrity across all categories
Development Standards
- Error Handling: All scripts must use
set -euo pipefailfor robust error handling - Logging: Consistent timestamped logging using shared utility functions
- Documentation: Each script includes header comments explaining purpose and dependencies
- Testing: Individual datasets can be tested independently without full pipeline runs
Technical Considerations
Database Integration
- Foreign Keys: All vessel data references
original_sources_vesselsvia UUID foreign keys - RFMO Integration: Automatic mapping from source names to
rfmostable UUIDs - Country Integration: Automatic mapping from alpha3 codes to
country_isotable UUIDs - Source Types: Enum validation for vessel source types (RFMO, COUNTRY, ORGANIZATION, etc.)
Data Organization and Lineage
- Mirrored Structure: Data files organized identically to processing scripts for consistency
- Processing Phases: Clear separation of raw → cleaned → staged data phases
- Data Immutability: Raw files preserved exactly as received from sources
- Traceability: Full lineage tracking from source files through processing to database
- Category Alignment: Data categories match vessel source type enums and script organization
Data Quality
- Source Tracking: Every piece of vessel data linked to its original source
- Validation Scripts: Comprehensive FK/PK relationship validation after imports
- Error Isolation: Individual dataset failures don't affect other datasets
- Progress Monitoring: Clear visibility into processing status across 40+ datasets
- Quality Metrics: Validation reports and error logs for each dataset
Consequences
Positive
- Clear Ownership: Vessel team has full control over both scripts and data organization
- Easy Navigation: 40+ datasets organized by logical categories (RFMO, COUNTRY, OTHER, BADDIE) in both scripts and data
- Mirrored Consistency: Data files organized identically to processing scripts for intuitive navigation
- Scalable: Structure handles current and future vessel datasets efficiently with clear categorization
- Maintainable: Easy to add, modify, or troubleshoot individual datasets within their category
- Parallel Development: Multiple team members can work on different categories/datasets simultaneously
- Error Isolation: Dataset failures are contained within categories and don't break the entire pipeline
- Data Lineage: Clear traceability from raw data through processing phases to database
- Enterprise Ready: Follows patterns that scale to enterprise-level complexity
- Intuitive: New team members can quickly understand where to find both scripts and data for specific dataset types
Negative
- Initial Setup: Requires creating both script and data directory structures and organizing existing files
- Learning Curve: Team needs to understand the new organization, workflow phases, and data processing stages
- Coordination: Must ensure proper sequencing of vessel sources before individual datasets
- Storage Management: Need to monitor disk usage across multiple data processing phases
- Data Duplication: Raw, cleaned, and staged versions of datasets require additional storage space
Neutral
- Flexibility: Future reorganization is possible if needs change significantly
- Documentation: Requires maintaining documentation as structure evolves
Monitoring and Success Metrics
- Import Success Rate: Percentage of 40+ datasets that import successfully
- Processing Time: Time to complete full vessel data pipeline
- Error Rate: Number of FK/PK relationship validation errors
- Developer Productivity: Time to add new vessel datasets to the pipeline
Related Documents
migrations/0002_vessel_tables_migration.sql- Main vessel table schemasscripts/import/vessels/README_vessel_sources_import.md- Implementation documentationdrizzle-schemas/vessels/sources.ts- TypeScript schema definitions
Future Considerations
- Parallel Processing: May add orchestration for parallel dataset cleaning/loading
- Data Quality Dashboard: Potential integration with monitoring systems
- Automated Testing: Could add automated validation for new dataset additions
- Performance Optimization: May optimize for specific bottlenecks as scale increases
This ADR establishes the foundation for scalable, maintainable vessel data processing that can grow with the project's needs while maintaining clarity and team productivity.