Source: ebisu/docs/guides/import/vessel-data-organization-guide.md | ✏️ Edit on GitHub

Vessel Data Organization Best Practices

This guide outlines enterprise best practices for organizing vessel data files within the /import/vessels/ directory to match the modular script structure.

Recommended Data Directory Structure

Option 1: Mirrored Structure (Recommended)

Mirror the script organization for consistency and easy navigation:

/import/vessels/
├── original_sources_vessels.csv              # Vessel source metadata
└── vessel_data/                              # 40+ datasets organized by category
    ├── RFMO/                                 # RFMO vessel datasets
    │   ├── raw/                              # Original unprocessed files
    │   │   ├── iccat_vessels_2024.csv
    │   │   ├── iotc_vessels_2024.csv
    │   │   ├── wcpfc_vessels_2024.csv
    │   │   └── nafo_vessels_2024.csv
    │   ├── cleaned/                          # Post-cleaning, pre-import
    │   │   ├── iccat_vessels_cleaned.csv
    │   │   ├── iotc_vessels_cleaned.csv
    │   │   └── wcpfc_vessels_cleaned.csv
    │   └── staged/                           # Ready for database import
    │       ├── iccat_vessels_staged.csv
    │       └── iotc_vessels_staged.csv
    ├── COUNTRY/                              # National vessel registries
    │   ├── raw/
    │   │   ├── usa_national_registry_2024.csv
    │   │   ├── canada_vessel_registry_2024.csv
    │   │   └── eu_fishing_fleet_2024.csv
    │   ├── cleaned/
    │   │   ├── usa_national_registry_cleaned.csv
    │   │   └── canada_vessel_registry_cleaned.csv
    │   └── staged/
    │       └── usa_national_registry_staged.csv
    ├── OTHER/                                # Organizations, MSC, certification bodies
    │   ├── raw/
    │   │   ├── msc_certified_vessels_2024.csv
    │   │   ├── noaa_fisheries_2024.csv
    │   │   └── wwf_vessel_data_2024.csv
    │   ├── cleaned/
    │   └── staged/
    └── BADDIE/                               # IUU and enforcement datasets
        ├── raw/
        │   ├── global_iuu_list_2024.csv
        │   ├── iccat_iuu_vessels_2024.csv
        │   └── interpol_purple_notices_2024.csv
        ├── cleaned/
        └── staged/

Option 2: Processing-Phase Structure (Alternative)

Organize by processing phase if workflow is more important than source type:

/import/vessels/
├── original_sources_vessels.csv              # Vessel source metadata
└── vessel_data/
    ├── 01_raw/                               # Original files from all sources
    │   ├── RFMO/
    │   ├── COUNTRY/  
    │   ├── OTHER/
    │   └── BADDIE/
    ├── 02_cleaned/                           # Post-cleaning files
    │   ├── RFMO/
    │   ├── COUNTRY/
    │   ├── OTHER/
    │   └── BADDIE/
    └── 03_staged/                            # Ready for import
        ├── RFMO/
        ├── COUNTRY/
        ├── OTHER/
        └── BADDIE/

Best Practices

1. File Naming Conventions

Consistent Pattern: [organization]_[dataset_type]_[year/date]_[status].csv

Examples:

Raw Files: iccat_vessels_2024.csv, usa_national_registry_20240815.csv
Cleaned Files: iccat_vessels_cleaned.csv, usa_national_registry_cleaned.csv
Staged Files: iccat_vessels_staged.csv, usa_national_registry_staged.csv

Special Cases:

Versioned: iccat_vessels_2024_v2.csv (if multiple versions exist)
Dated: global_iuu_list_20240815.csv (for frequently updated lists)
Combined: rfmo_iuu_combined_2024.csv (for merged datasets)

2. Data Processing Phases

Phase 1: Raw Data

Original files exactly as received from sources
Never modify raw files - treat as immutable
Include source URLs and download dates in metadata
Preserve original file names when possible

Phase 2: Cleaned Data

Post-cleaning, standardized format
Consistent column names and data types
Duplicate removal and data validation complete
Ready for matching and deduplication across sources

Phase 3: Staged Data

Final format for database import
All FK relationships resolved
UUIDs generated where needed
Split into appropriate table structures (vessels, vessel_info, etc.)

3. Metadata and Documentation

Data Provenance File: Create _metadata.json in each category folder:

{
  "category": "RFMO",
  "datasets": {
    "iccat_vessels_2024.csv": {
      "source_url": "https://www.iccat.int/VesselList/...",
      "download_date": "2024-08-15",
      "source_contact": "iccat@iccat.int",
      "file_size_mb": 12.5,
      "record_count": 8543,
      "last_updated": "2024-07-30",
      "processing_status": "cleaned",
      "notes": "Contains Atlantic bluefin tuna vessels"
    }
  }
}

README Files: Include README.md in each category explaining:

Data sources and update frequencies
Processing requirements and dependencies
Known data quality issues
Contact information for data providers

4. Data Quality and Validation

Validation Files: Store validation results alongside data:

[dataset]_validation_report.json - Data quality metrics
[dataset]_errors.log - Processing errors and warnings
[dataset]_statistics.json - Record counts, completion rates, etc.

Example Validation Report:

{
  "dataset": "iccat_vessels_2024.csv",
  "validation_date": "2024-08-15T10:30:00Z",
  "total_records": 8543,
  "valid_records": 8456,
  "error_records": 87,
  "warnings": 245,
  "quality_score": 0.98,
  "common_issues": [
    "Missing IMO numbers: 87 records",
    "Invalid flag codes: 12 records",
    "Duplicate vessel names: 245 records"
  ]
}

5. Backup and Archival Strategy

Versioning: Keep previous versions for rollback capability:

/import/vessels/archive/
├── 2024-07/
├── 2024-06/
└── 2024-05/

Compression: Use compression for archived data:

tar.gz for folder archives
zip for individual large CSV files
Document compression ratios for storage planning

6. Access Control and Security

Sensitive Data: Some vessel data may be sensitive:

IUU lists may contain enforcement-sensitive information
National registries may have privacy restrictions
RFMO data may have commercial sensitivity

Recommended Approach:

Use environment variables for sensitive file paths
Implement file permission controls (600 for sensitive files)
Consider encryption for highly sensitive datasets
Document data classification levels

Integration with Scripts

Path References in Scripts

Absolute Paths: Use consistent absolute paths in scripts:

# Raw data input
RAW_DATA_PATH="/import/vessels/vessel_data/RFMO/raw"
CLEANED_DATA_PATH="/import/vessels/vessel_data/RFMO/cleaned"
STAGED_DATA_PATH="/import/vessels/vessel_data/RFMO/staged"

# Process ICCAT vessels
python3 clean_iccat.py \
    --input "$RAW_DATA_PATH/iccat_vessels_2024.csv" \
    --output "$CLEANED_DATA_PATH/iccat_vessels_cleaned.csv"

Environment Variables: For flexibility across environments:

# In script header
VESSEL_DATA_ROOT="${VESSEL_DATA_ROOT:-/import/vessels/vessel_data}"
CATEGORY="${CATEGORY:-RFMO}"

# In processing
INPUT_FILE="$VESSEL_DATA_ROOT/$CATEGORY/raw/iccat_vessels_2024.csv"
OUTPUT_FILE="$VESSEL_DATA_ROOT/$CATEGORY/cleaned/iccat_vessels_cleaned.csv"

Configuration Files

Dataset Configuration: Create config files for each dataset:

# configs/iccat_vessels.yaml
dataset:
  name: "ICCAT Vessels"
  category: "RFMO"
  source_url: "https://www.iccat.int/VesselList/"
  update_frequency: "MONTHLY"
  
files:
  raw: "RFMO/raw/iccat_vessels_2024.csv"
  cleaned: "RFMO/cleaned/iccat_vessels_cleaned.csv"
  staged: "RFMO/staged/iccat_vessels_staged.csv"
  
processing:
  cleaning_script: "scripts/import/vessels/vessel_data/RFMO/clean_iccat_vessels.sh"
  loading_script: "scripts/import/vessels/vessel_data/RFMO/load_iccat_vessels.sh"
  
validation:
  required_columns: ["vessel_name", "flag_state", "imo"]
  unique_columns: ["imo"]
  max_error_rate: 0.05

Monitoring and Maintenance

Disk Space Management

Monitor /import/vessels/ disk usage
Implement automated cleanup of old processed files
Set retention policies for each processing phase
Alert when disk usage exceeds thresholds

Data Freshness Tracking

Track last update dates for each dataset
Alert when datasets become stale
Automate download of frequently updated datasets
Maintain update schedules aligned with source refresh cycles

Recommendation: Use Option 1 (Mirrored Structure)

Why Option 1 is Best for Your Use Case:

Consistency: Data organization mirrors script organization
Navigation: Easy to find data files when working on specific scripts
Team Efficiency: RFMO specialists know exactly where to find RFMO data
Scalability: Easy to add new datasets to appropriate categories
Maintenance: Clear separation makes backup and archival easier
Documentation: Each category can have its own specific documentation

This approach will make your 40+ vessel datasets much more manageable while maintaining clear data lineage and processing workflows.

Recommended Data Directory Structure​

Option 1: Mirrored Structure (Recommended)​

Option 2: Processing-Phase Structure (Alternative)​

Best Practices​

1. File Naming Conventions​

2. Data Processing Phases​

3. Metadata and Documentation​

4. Data Quality and Validation​

5. Backup and Archival Strategy​

6. Access Control and Security​

Integration with Scripts​

Path References in Scripts​

Configuration Files​

Monitoring and Maintenance​

Disk Space Management​

Data Freshness Tracking​

Recommendation: Use Option 1 (Mirrored Structure)​