Skip to main content

Source: ebisu/docs/guides/import/vessel-sources-readme.md | ✏️ Edit on GitHub

Vessel Sources Import Scripts

This directory contains scripts to create and populate the original_sources_vessels table, which tracks data sources for vessel information including RFMO registries, national registries, and other vessel databases.

Directory Structure

scripts/import/vessels/
├── create_sources_vessels.sql # SQL schema creation
├── load_sources_vessels.sh # Data import script
├── README_vessel_sources_import.md # This documentation
└── vessel_data/ # 40+ dataset processing organized by category
├── RFMO/ # RFMO vessel datasets
│ ├── clean_iccat_vessels.sh
│ ├── load_iccat_vessels.sh
│ ├── clean_iotc_vessels.sh
│ ├── load_iotc_vessels.sh
│ └── ... [other RFMO datasets]
├── COUNTRY/ # National vessel registries
│ ├── clean_usa_national_registry.sh
│ ├── load_usa_national_registry.sh
│ ├── clean_canada_vessel_registry.sh
│ ├── load_canada_vessel_registry.sh
│ └── ... [other country datasets]
├── OTHER/ # Organizations, MSC, etc.
│ ├── clean_msc_certified_vessels.sh
│ ├── load_msc_certified_vessels.sh
│ ├── clean_noaa_fisheries.sh
│ ├── load_noaa_fisheries.sh
│ └── ... [other organization datasets]
└── BADDIE/ # IUU and enforcement datasets
├── clean_global_iuu_list.sh
├── load_global_iuu_list.sh
├── clean_rfmo_iuu_combined.sh
├── load_rfmo_iuu_combined.sh
└── ... [other enforcement datasets]

Note: Main vessel table schemas are handled by the existing migrations/0002_vessel_tables_migration.sql file.

1. create_sources_vessels.sql

Creates the PostgreSQL table and associated functions for vessel source tracking.

Features:

  • UUID primary keys for unique source identification
  • Multi-type source support using PostgreSQL arrays
  • RFMO integration with rfmo_id field
  • Comprehensive indexing for performance
  • Helper functions for source management
  • Views for source analysis

2. load_sources_vessels.sh

Bash script that processes the CSV file and loads data into the database.

Features:

  • Python-based CSV processing with data cleaning
  • Multi-type source conversion (single type → array)
  • URL combination (source_url[0] + source_url[1] → JSON array)
  • RFMO ID handling
  • Data validation and error handling
  • Progress reporting and verification

3. original_sources_vessels_rfmos.csv

Input CSV file containing vessel source metadata.

3. original_sources_vessels.csv

Input CSV file containing vessel source metadata.

Expected Columns:

  • source_shortname - Short identifier (e.g., "ICCAT_VESSELS", "USA_NATIONAL_REGISTRY")
  • source_fullname - Full descriptive name
  • source_type - Source type from enum: RFMO, RFMO-IUU, COUNTRY, COUNTRY-IUU, INTERGOVERNMENTAL, ORGANIZATION, CERTIFICATION BODY, CIVIL SOCIETY, MSC
  • refresh_date - When data was last refreshed
  • source_url[0] - Primary URL
  • source_url[1] - Secondary URL (optional)
  • update_frequency - How often data updates (MONTHLY, ANNUALLY, etc.)
  • size_approx - Approximate number of records
  • status - Load status (will be set to PENDING initially)
  • rfmo_id - RFMO identifier (will be mapped to UUID automatically)
  • country_id - Country alpha3 code (e.g., "USA") for country sources (will be mapped to UUID)

Usage

Prerequisites

  • PostgreSQL database with write access
  • Python 3 with csv, json modules
  • Environment variables set:
    • POSTGRES_HOST
    • POSTGRES_USER
    • POSTGRES_PASSWORD
    • POSTGRES_DB

Step 1: Create the Table Structure

# Run the SQL script to create table and functions
psql -h $POSTGRES_HOST -U $POSTGRES_USER -d $POSTGRES_DB -f create_enhanced_original_sources_vessels.sql

Step 2: Load the Data

# Ensure CSV file is in /import/ directory
cp original_sources_vessels_rfmos.csv /import/

# Run the load script
./load_original_sources_vessels.sh

Database Schema

CREATE TYPE vessel_source_type_enum AS ENUM (
'RFMO', 'RFMO-IUU', 'COUNTRY', 'COUNTRY-IUU',
'INTERGOVERNMENTAL', 'ORGANIZATION', 'CERTIFICATION BODY',
'CIVIL SOCIETY', 'MSC'
);

CREATE TABLE original_sources_vessels (
source_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
source_shortname TEXT UNIQUE NOT NULL,
source_fullname TEXT NOT NULL,
version_year INTEGER,
source_types vessel_source_type_enum[] NOT NULL, -- Enum array for type safety
refresh_date DATE,
source_urls JSONB, -- Combined URLs as JSON array
update_frequency vessel_update_frequency_enum,
size_approx INTEGER,
status vessel_source_status_enum DEFAULT 'PENDING',
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
last_updated TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
rfmo_id UUID REFERENCES rfmos(id), -- FK to rfmos table
country_id UUID REFERENCES country_iso(id), -- FK to country_iso table
metadata JSONB -- Additional flexible metadata
);

RFMO & Country Integration

The system automatically maps vessel sources to reference tables based on source type:

Automatic RFMO Mapping

  • When: Only when source_type contains "RFMO"
  • How: Extracts RFMO acronym from source_shortname (e.g., "ICCAT_VESSELS" → "ICCAT")
  • Lookup: Matches rfmo_acronym in rfmos table to get UUID
  • Result: Sets rfmo_id foreign key to link vessel source with RFMO organization

Automatic Country Mapping

  • When: Only when source_type contains "COUNTRY"
  • How: Uses country_id column value as alpha3 code (e.g., "USA")
  • Lookup: Matches alpha3_code in country_iso table to get UUID
  • Result: Sets country_id foreign key to link vessel source with country

Integration Examples

  • ICCAT_VESSELS + source_type="RFMO" → Links to ICCAT in rfmos table
  • USA_NATIONAL_REGISTRY + source_type="COUNTRY" + country_id="USA" → Links to USA in country_iso table
  • NOAA_FISHERIES + source_type="ORGANIZATION" → No automatic mapping
  • MSC_CERTIFIED + source_type="MSC" → No automatic mapping

Vessel Source Type Enum

The system supports 9 standardized source types via enum:

  • RFMO - Regional Fisheries Management Organization vessel lists
  • RFMO-IUU - RFMO Illegal, Unreported, Unregulated fishing lists
  • COUNTRY - National vessel registries and databases
  • COUNTRY-IUU - National IUU vessel lists
  • INTERGOVERNMENTAL - International organization databases
  • ORGANIZATION - Non-governmental organization databases
  • CERTIFICATION BODY - Certification scheme vessel lists
  • CIVIL SOCIETY - NGO and advocacy group databases
  • MSC - Marine Stewardship Council certified vessel lists

Helper Functions

The scripts create several utility functions:

Source Management

  • mark_vessel_source_loaded(shortname, record_count, refresh_date) - Mark source as successfully loaded
  • update_vessel_source_record_count(shortname, table_name) - Update record count from actual table

Source Queries

  • get_vessel_sources_by_type(source_type) - Get all sources of specific enum type
  • get_all_vessel_source_types() - Get breakdown of all source types with counts
  • vessel_source_has_type(shortname, type) - Check if source has specific enum type

Mapping Functions

  • get_rfmo_id_from_source(shortname) - Extract RFMO acronym and lookup UUID
  • get_country_id_from_alpha3(alpha3_code) - Map country alpha3 code to UUID

Views

active_vessel_sources_summary

Summary of active sources by type:

SELECT * FROM active_vessel_sources_summary;

vessel_source_details

Detailed information with freshness status:

SELECT * FROM vessel_source_details;

Example Queries

Find all RFMO vessel sources with organization details

SELECT ovs.source_shortname, ovs.source_fullname, r.rfmo_acronym, r.rfmo_name
FROM original_sources_vessels ovs
JOIN rfmos r ON ovs.rfmo_id = r.id
WHERE 'RFMO'::vessel_source_type_enum = ANY(ovs.source_types);

Find all country vessel sources with country details

SELECT ovs.source_shortname, ovs.source_fullname, c.alpha3_code, c.short_name_en
FROM original_sources_vessels ovs
JOIN country_iso c ON ovs.country_id = c.id
WHERE 'COUNTRY'::vessel_source_type_enum = ANY(ovs.source_types);

Find sources with multiple types

SELECT source_shortname, source_types, array_length(source_types, 1) as type_count
FROM original_sources_vessels
WHERE array_length(source_types, 1) > 1;

Get sources by RFMO organization

SELECT r.rfmo_acronym, COUNT(*) as source_count, 
string_agg(ovs.source_shortname, ', ') as sources
FROM original_sources_vessels ovs
JOIN rfmos r ON ovs.rfmo_id = r.id
GROUP BY r.rfmo_acronym
ORDER BY r.rfmo_acronym;

Get sources by country

SELECT c.alpha3_code, c.short_name_en, COUNT(*) as source_count,
string_agg(ovs.source_shortname, ', ') as sources
FROM original_sources_vessels ovs
JOIN country_iso c ON ovs.country_id = c.id
GROUP BY c.alpha3_code, c.short_name_en
ORDER BY c.alpha3_code;

Get sources by freshness

SELECT source_shortname, freshness_status, refresh_date
FROM vessel_source_details
ORDER BY refresh_date DESC;

Troubleshooting

Common Issues

  1. CSV Format Errors

    • Ensure CSV uses proper comma separation
    • Check for special characters in text fields
    • Verify date format is YYYY-MM-DD or MM/DD/YYYY
  2. Missing Required Fields

    • source_shortname and source_fullname are required
    • source_type must not be empty
    • Check for NULL/NaN values in required columns
  3. URL Processing Issues

    • URLs with special characters may need escaping
    • Both source_url[0] and source_url[1] are optional
    • Invalid URLs are logged but don't stop processing
  4. Enum Validation Issues

    • Source types must be one of the 9 valid enum values
    • Invalid source types are logged and skipped during processing
    • Check CSV for typos in source_type values
    • Enum values are case-sensitive and must be uppercase
  5. RFMO Mapping Issues

    • RFMO mapping only occurs when source_type contains "RFMO"
    • RFMO acronym is extracted from the first part of source_shortname
    • If RFMO acronym not found in rfmos table, rfmo_id remains NULL
    • Check that rfmos table is populated before running vessel sources import
    • Verify RFMO acronyms match exactly (case-sensitive)
  6. Country Mapping Issues

    • Country mapping only occurs when source_type contains "COUNTRY"
    • Country alpha3 code is read directly from country_id column in CSV
    • Alpha3 codes must be exactly 3 alphabetic characters (e.g., "USA", "CAN")
    • If alpha3 code not found in country_iso table, country_id remains NULL
    • Check that country_iso table is populated with proper alpha3_code values
    • Verify alpha3 codes match exactly (case-sensitive)

Log Analysis

The script provides detailed logging:

  • Row processing status
  • Data cleaning results
  • Import verification
  • Query capability testing

Check the console output for warnings about skipped rows or data cleaning issues.

Integration

After loading vessel sources, they can be referenced by:

  • Vessel import scripts
  • Data quality checks
  • Source tracking in vessel tables
  • RFMO-specific processing workflows

The source_id UUID can be used as foreign key in vessel-related tables to track data provenance.