Skip to main content

Source: ebisu/docs/guides/import/vessel-data-organization-guide.md | ✏️ Edit on GitHub

Vessel Data Organization Best Practices

This guide outlines enterprise best practices for organizing vessel data files within the /import/vessels/ directory to match the modular script structure.

Mirror the script organization for consistency and easy navigation:

/import/vessels/
├── original_sources_vessels.csv # Vessel source metadata
└── vessel_data/ # 40+ datasets organized by category
├── RFMO/ # RFMO vessel datasets
│ ├── raw/ # Original unprocessed files
│ │ ├── iccat_vessels_2024.csv
│ │ ├── iotc_vessels_2024.csv
│ │ ├── wcpfc_vessels_2024.csv
│ │ └── nafo_vessels_2024.csv
│ ├── cleaned/ # Post-cleaning, pre-import
│ │ ├── iccat_vessels_cleaned.csv
│ │ ├── iotc_vessels_cleaned.csv
│ │ └── wcpfc_vessels_cleaned.csv
│ └── staged/ # Ready for database import
│ ├── iccat_vessels_staged.csv
│ └── iotc_vessels_staged.csv
├── COUNTRY/ # National vessel registries
│ ├── raw/
│ │ ├── usa_national_registry_2024.csv
│ │ ├── canada_vessel_registry_2024.csv
│ │ └── eu_fishing_fleet_2024.csv
│ ├── cleaned/
│ │ ├── usa_national_registry_cleaned.csv
│ │ └── canada_vessel_registry_cleaned.csv
│ └── staged/
│ └── usa_national_registry_staged.csv
├── OTHER/ # Organizations, MSC, certification bodies
│ ├── raw/
│ │ ├── msc_certified_vessels_2024.csv
│ │ ├── noaa_fisheries_2024.csv
│ │ └── wwf_vessel_data_2024.csv
│ ├── cleaned/
│ └── staged/
└── BADDIE/ # IUU and enforcement datasets
├── raw/
│ ├── global_iuu_list_2024.csv
│ ├── iccat_iuu_vessels_2024.csv
│ └── interpol_purple_notices_2024.csv
├── cleaned/
└── staged/

Option 2: Processing-Phase Structure (Alternative)

Organize by processing phase if workflow is more important than source type:

/import/vessels/
├── original_sources_vessels.csv # Vessel source metadata
└── vessel_data/
├── 01_raw/ # Original files from all sources
│ ├── RFMO/
│ ├── COUNTRY/
│ ├── OTHER/
│ └── BADDIE/
├── 02_cleaned/ # Post-cleaning files
│ ├── RFMO/
│ ├── COUNTRY/
│ ├── OTHER/
│ └── BADDIE/
└── 03_staged/ # Ready for import
├── RFMO/
├── COUNTRY/
├── OTHER/
└── BADDIE/

Best Practices

1. File Naming Conventions

Consistent Pattern: [organization]_[dataset_type]_[year/date]_[status].csv

Examples:

  • Raw Files: iccat_vessels_2024.csv, usa_national_registry_20240815.csv
  • Cleaned Files: iccat_vessels_cleaned.csv, usa_national_registry_cleaned.csv
  • Staged Files: iccat_vessels_staged.csv, usa_national_registry_staged.csv

Special Cases:

  • Versioned: iccat_vessels_2024_v2.csv (if multiple versions exist)
  • Dated: global_iuu_list_20240815.csv (for frequently updated lists)
  • Combined: rfmo_iuu_combined_2024.csv (for merged datasets)

2. Data Processing Phases

Phase 1: Raw Data

  • Original files exactly as received from sources
  • Never modify raw files - treat as immutable
  • Include source URLs and download dates in metadata
  • Preserve original file names when possible

Phase 2: Cleaned Data

  • Post-cleaning, standardized format
  • Consistent column names and data types
  • Duplicate removal and data validation complete
  • Ready for matching and deduplication across sources

Phase 3: Staged Data

  • Final format for database import
  • All FK relationships resolved
  • UUIDs generated where needed
  • Split into appropriate table structures (vessels, vessel_info, etc.)

3. Metadata and Documentation

Data Provenance File: Create _metadata.json in each category folder:

{
"category": "RFMO",
"datasets": {
"iccat_vessels_2024.csv": {
"source_url": "https://www.iccat.int/VesselList/...",
"download_date": "2024-08-15",
"source_contact": "iccat@iccat.int",
"file_size_mb": 12.5,
"record_count": 8543,
"last_updated": "2024-07-30",
"processing_status": "cleaned",
"notes": "Contains Atlantic bluefin tuna vessels"
}
}
}

README Files: Include README.md in each category explaining:

  • Data sources and update frequencies
  • Processing requirements and dependencies
  • Known data quality issues
  • Contact information for data providers

4. Data Quality and Validation

Validation Files: Store validation results alongside data:

  • [dataset]_validation_report.json - Data quality metrics
  • [dataset]_errors.log - Processing errors and warnings
  • [dataset]_statistics.json - Record counts, completion rates, etc.

Example Validation Report:

{
"dataset": "iccat_vessels_2024.csv",
"validation_date": "2024-08-15T10:30:00Z",
"total_records": 8543,
"valid_records": 8456,
"error_records": 87,
"warnings": 245,
"quality_score": 0.98,
"common_issues": [
"Missing IMO numbers: 87 records",
"Invalid flag codes: 12 records",
"Duplicate vessel names: 245 records"
]
}

5. Backup and Archival Strategy

Versioning: Keep previous versions for rollback capability:

/import/vessels/archive/
├── 2024-07/
├── 2024-06/
└── 2024-05/

Compression: Use compression for archived data:

  • tar.gz for folder archives
  • zip for individual large CSV files
  • Document compression ratios for storage planning

6. Access Control and Security

Sensitive Data: Some vessel data may be sensitive:

  • IUU lists may contain enforcement-sensitive information
  • National registries may have privacy restrictions
  • RFMO data may have commercial sensitivity

Recommended Approach:

  • Use environment variables for sensitive file paths
  • Implement file permission controls (600 for sensitive files)
  • Consider encryption for highly sensitive datasets
  • Document data classification levels

Integration with Scripts

Path References in Scripts

Absolute Paths: Use consistent absolute paths in scripts:

# Raw data input
RAW_DATA_PATH="/import/vessels/vessel_data/RFMO/raw"
CLEANED_DATA_PATH="/import/vessels/vessel_data/RFMO/cleaned"
STAGED_DATA_PATH="/import/vessels/vessel_data/RFMO/staged"

# Process ICCAT vessels
python3 clean_iccat.py \
--input "$RAW_DATA_PATH/iccat_vessels_2024.csv" \
--output "$CLEANED_DATA_PATH/iccat_vessels_cleaned.csv"

Environment Variables: For flexibility across environments:

# In script header
VESSEL_DATA_ROOT="${VESSEL_DATA_ROOT:-/import/vessels/vessel_data}"
CATEGORY="${CATEGORY:-RFMO}"

# In processing
INPUT_FILE="$VESSEL_DATA_ROOT/$CATEGORY/raw/iccat_vessels_2024.csv"
OUTPUT_FILE="$VESSEL_DATA_ROOT/$CATEGORY/cleaned/iccat_vessels_cleaned.csv"

Configuration Files

Dataset Configuration: Create config files for each dataset:

# configs/iccat_vessels.yaml
dataset:
name: "ICCAT Vessels"
category: "RFMO"
source_url: "https://www.iccat.int/VesselList/"
update_frequency: "MONTHLY"

files:
raw: "RFMO/raw/iccat_vessels_2024.csv"
cleaned: "RFMO/cleaned/iccat_vessels_cleaned.csv"
staged: "RFMO/staged/iccat_vessels_staged.csv"

processing:
cleaning_script: "scripts/import/vessels/vessel_data/RFMO/clean_iccat_vessels.sh"
loading_script: "scripts/import/vessels/vessel_data/RFMO/load_iccat_vessels.sh"

validation:
required_columns: ["vessel_name", "flag_state", "imo"]
unique_columns: ["imo"]
max_error_rate: 0.05

Monitoring and Maintenance

Disk Space Management

  • Monitor /import/vessels/ disk usage
  • Implement automated cleanup of old processed files
  • Set retention policies for each processing phase
  • Alert when disk usage exceeds thresholds

Data Freshness Tracking

  • Track last update dates for each dataset
  • Alert when datasets become stale
  • Automate download of frequently updated datasets
  • Maintain update schedules aligned with source refresh cycles

Recommendation: Use Option 1 (Mirrored Structure)

Why Option 1 is Best for Your Use Case:

  1. Consistency: Data organization mirrors script organization
  2. Navigation: Easy to find data files when working on specific scripts
  3. Team Efficiency: RFMO specialists know exactly where to find RFMO data
  4. Scalability: Easy to add new datasets to appropriate categories
  5. Maintenance: Clear separation makes backup and archival easier
  6. Documentation: Each category can have its own specific documentation

This approach will make your 40+ vessel datasets much more manageable while maintaining clear data lineage and processing workflows.