Source:
ebisu/docs/guides/import/vessel-data-organization-guide.md| ✏️ Edit on GitHub
Vessel Data Organization Best Practices
This guide outlines enterprise best practices for organizing vessel data files within the /import/vessels/ directory to match the modular script structure.
Recommended Data Directory Structure
Option 1: Mirrored Structure (Recommended)
Mirror the script organization for consistency and easy navigation:
/import/vessels/
├── original_sources_vessels.csv # Vessel source metadata
└── vessel_data/ # 40+ datasets organized by category
├── RFMO/ # RFMO vessel datasets
│ ├── raw/ # Original unprocessed files
│ │ ├── iccat_vessels_2024.csv
│ │ ├── iotc_vessels_2024.csv
│ │ ├── wcpfc_vessels_2024.csv
│ │ └── nafo_vessels_2024.csv
│ ├── cleaned/ # Post-cleaning, pre-import
│ │ ├── iccat_vessels_cleaned.csv
│ │ ├── iotc_vessels_cleaned.csv
│ │ └── wcpfc_vessels_cleaned.csv
│ └── staged/ # Ready for database import
│ ├── iccat_vessels_staged.csv
│ └── iotc_vessels_staged.csv
├── COUNTRY/ # National vessel registries
│ ├── raw/
│ │ ├── usa_national_registry_2024.csv
│ │ ├── canada_vessel_registry_2024.csv
│ │ └── eu_fishing_fleet_2024.csv
│ ├── cleaned/
│ │ ├── usa_national_registry_cleaned.csv
│ │ └── canada_vessel_registry_cleaned.csv
│ └── staged/
│ └── usa_national_registry_staged.csv
├── OTHER/ # Organizations, MSC, certification bodies
│ ├── raw/
│ │ ├── msc_certified_vessels_2024.csv
│ │ ├── noaa_fisheries_2024.csv
│ │ └── wwf_vessel_data_2024.csv
│ ├── cleaned/
│ └── staged/
└── BADDIE/ # IUU and enforcement datasets
├── raw/
│ ├── global_iuu_list_2024.csv
│ ├── iccat_iuu_vessels_2024.csv
│ └── interpol_purple_notices_2024.csv
├── cleaned/
└── staged/
Option 2: Processing-Phase Structure (Alternative)
Organize by processing phase if workflow is more important than source type:
/import/vessels/
├── original_sources_vessels.csv # Vessel source metadata
└── vessel_data/
├── 01_raw/ # Original files from all sources
│ ├── RFMO/
│ ├── COUNTRY/
│ ├── OTHER/
│ └── BADDIE/
├── 02_cleaned/ # Post-cleaning files
│ ├── RFMO/
│ ├── COUNTRY/
│ ├── OTHER/
│ └── BADDIE/
└── 03_staged/ # Ready for import
├── RFMO/
├── COUNTRY/
├── OTHER/
└── BADDIE/
Best Practices
1. File Naming Conventions
Consistent Pattern: [organization]_[dataset_type]_[year/date]_[status].csv
Examples:
- Raw Files:
iccat_vessels_2024.csv,usa_national_registry_20240815.csv - Cleaned Files:
iccat_vessels_cleaned.csv,usa_national_registry_cleaned.csv - Staged Files:
iccat_vessels_staged.csv,usa_national_registry_staged.csv
Special Cases:
- Versioned:
iccat_vessels_2024_v2.csv(if multiple versions exist) - Dated:
global_iuu_list_20240815.csv(for frequently updated lists) - Combined:
rfmo_iuu_combined_2024.csv(for merged datasets)
2. Data Processing Phases
Phase 1: Raw Data
- Original files exactly as received from sources
- Never modify raw files - treat as immutable
- Include source URLs and download dates in metadata
- Preserve original file names when possible
Phase 2: Cleaned Data
- Post-cleaning, standardized format
- Consistent column names and data types
- Duplicate removal and data validation complete
- Ready for matching and deduplication across sources
Phase 3: Staged Data
- Final format for database import
- All FK relationships resolved
- UUIDs generated where needed
- Split into appropriate table structures (vessels, vessel_info, etc.)
3. Metadata and Documentation
Data Provenance File: Create _metadata.json in each category folder:
{
"category": "RFMO",
"datasets": {
"iccat_vessels_2024.csv": {
"source_url": "https://www.iccat.int/VesselList/...",
"download_date": "2024-08-15",
"source_contact": "iccat@iccat.int",
"file_size_mb": 12.5,
"record_count": 8543,
"last_updated": "2024-07-30",
"processing_status": "cleaned",
"notes": "Contains Atlantic bluefin tuna vessels"
}
}
}
README Files: Include README.md in each category explaining:
- Data sources and update frequencies
- Processing requirements and dependencies
- Known data quality issues
- Contact information for data providers
4. Data Quality and Validation
Validation Files: Store validation results alongside data:
[dataset]_validation_report.json- Data quality metrics[dataset]_errors.log- Processing errors and warnings[dataset]_statistics.json- Record counts, completion rates, etc.
Example Validation Report:
{
"dataset": "iccat_vessels_2024.csv",
"validation_date": "2024-08-15T10:30:00Z",
"total_records": 8543,
"valid_records": 8456,
"error_records": 87,
"warnings": 245,
"quality_score": 0.98,
"common_issues": [
"Missing IMO numbers: 87 records",
"Invalid flag codes: 12 records",
"Duplicate vessel names: 245 records"
]
}
5. Backup and Archival Strategy
Versioning: Keep previous versions for rollback capability:
/import/vessels/archive/
├── 2024-07/
├── 2024-06/
└── 2024-05/
Compression: Use compression for archived data:
tar.gzfor folder archiveszipfor individual large CSV files- Document compression ratios for storage planning
6. Access Control and Security
Sensitive Data: Some vessel data may be sensitive:
- IUU lists may contain enforcement-sensitive information
- National registries may have privacy restrictions
- RFMO data may have commercial sensitivity
Recommended Approach:
- Use environment variables for sensitive file paths
- Implement file permission controls (600 for sensitive files)
- Consider encryption for highly sensitive datasets
- Document data classification levels
Integration with Scripts
Path References in Scripts
Absolute Paths: Use consistent absolute paths in scripts:
# Raw data input
RAW_DATA_PATH="/import/vessels/vessel_data/RFMO/raw"
CLEANED_DATA_PATH="/import/vessels/vessel_data/RFMO/cleaned"
STAGED_DATA_PATH="/import/vessels/vessel_data/RFMO/staged"
# Process ICCAT vessels
python3 clean_iccat.py \
--input "$RAW_DATA_PATH/iccat_vessels_2024.csv" \
--output "$CLEANED_DATA_PATH/iccat_vessels_cleaned.csv"
Environment Variables: For flexibility across environments:
# In script header
VESSEL_DATA_ROOT="${VESSEL_DATA_ROOT:-/import/vessels/vessel_data}"
CATEGORY="${CATEGORY:-RFMO}"
# In processing
INPUT_FILE="$VESSEL_DATA_ROOT/$CATEGORY/raw/iccat_vessels_2024.csv"
OUTPUT_FILE="$VESSEL_DATA_ROOT/$CATEGORY/cleaned/iccat_vessels_cleaned.csv"
Configuration Files
Dataset Configuration: Create config files for each dataset:
# configs/iccat_vessels.yaml
dataset:
name: "ICCAT Vessels"
category: "RFMO"
source_url: "https://www.iccat.int/VesselList/"
update_frequency: "MONTHLY"
files:
raw: "RFMO/raw/iccat_vessels_2024.csv"
cleaned: "RFMO/cleaned/iccat_vessels_cleaned.csv"
staged: "RFMO/staged/iccat_vessels_staged.csv"
processing:
cleaning_script: "scripts/import/vessels/vessel_data/RFMO/clean_iccat_vessels.sh"
loading_script: "scripts/import/vessels/vessel_data/RFMO/load_iccat_vessels.sh"
validation:
required_columns: ["vessel_name", "flag_state", "imo"]
unique_columns: ["imo"]
max_error_rate: 0.05
Monitoring and Maintenance
Disk Space Management
- Monitor
/import/vessels/disk usage - Implement automated cleanup of old processed files
- Set retention policies for each processing phase
- Alert when disk usage exceeds thresholds
Data Freshness Tracking
- Track last update dates for each dataset
- Alert when datasets become stale
- Automate download of frequently updated datasets
- Maintain update schedules aligned with source refresh cycles
Recommendation: Use Option 1 (Mirrored Structure)
Why Option 1 is Best for Your Use Case:
- Consistency: Data organization mirrors script organization
- Navigation: Easy to find data files when working on specific scripts
- Team Efficiency: RFMO specialists know exactly where to find RFMO data
- Scalability: Easy to add new datasets to appropriate categories
- Maintenance: Clear separation makes backup and archival easier
- Documentation: Each category can have its own specific documentation
This approach will make your 40+ vessel datasets much more manageable while maintaining clear data lineage and processing workflows.