Skip to main content

Source: ebisu/docs/reference/schemas/reference_data/reference_data_documentation.md | ✏️ Edit on GitHub

Reference Data System Documentation

Overview

The Reference Data System provides a centralized, source-tracked approach to managing taxonomic, geographic, and equipment reference data for marine and fisheries applications. Built with UUID primary keys and comprehensive data lineage tracking.


🎯 Core Components

1. Central Source Management

  • File: original_sources.csv
  • Table: original_sources
  • Purpose: Centralized tracking of all external data sources with multi-type support

2. Processing Scripts

  • load_original_sources.sh: Multi-type source definitions with array support
  • clean_reference_data.py: Data cleaning and column normalization
  • import_reference_data.sql: UUID-based import with source tracking
  • load_reference_data.sh: Orchestration script with validation

🔄 Process Flow

Phase 1: Source Foundation

load_original_sources.sh
  • Creates original_sources table with TEXT[] multi-type support
  • Processes CSV with URL JSON arrays and status tracking
  • Enables queries like: 'GEAR_DATA' = ANY(source_types)

Phase 2: Data Cleaning

clean_reference_data.py
  • Removes manual ID generation (lets database handle UUIDs)
  • Normalizes columns and handles data quality issues
  • Preserves leading zeros in codes (e.g., FAO gear codes)
  • Expands relationships (semicolon-delimited to individual records)

Phase 3: Enhanced Import

import_reference_data.sql
  • Generates UUID primary keys via gen_random_uuid()
  • Maps to source_id foreign keys where applicable
  • Updates original_sources with status, size, and refresh_date
  • Creates performance indexes on UUIDs and source relationships

Phase 4: Orchestration

load_reference_data.sh
  • Validates prerequisites (original_sources existence)
  • Coordinates execution of cleaning and import phases
  • Provides comprehensive logging and verification
  • Reports final mappings and potential issues

📊 Database Structure

External Reference Tables (Source Tracked)

TableSourcePurposeRecords
country_isoISOCountry codes (ISO 3166)~249
fao_major_areasFAO Major AreaFishing areas~26
gear_types_faoFAO ISSCFGFAO gear classifications~71
gear_types_cbpCBP GearCBP gear codes~41
vessel_typesFAO ISSCFVVessel classifications~63

Internal Reference Tables (Company Data)

TablePurposeRecords
vessel_hull_materialHull material types~14
rfmosRegional fisheries orgs~26

Relationship Tables (UUID Mapped)

TablePurposeMapping
gear_types_relationship_fao_cbpFAO ↔ CBP gear mapping~69 relationships
country_iso_euEU membership status~249 countries

Database Benefits

1. UUID Primary Keys

  • Distributed system ready - no ID conflicts across databases
  • Security - non-sequential, non-guessable identifiers
  • Performance - indexed UUID lookups are fast
  • Future-proof - works with replication and sharding

2. Source Lineage Tracking

  • Data provenance - every record traces to its original source
  • Refresh tracking - know when each source was last updated
  • Size estimation - automatic record counting per source
  • Status management - PENDING → LOADED → FAILED states

3. Multi-Type Source Support

  • Complex sources - single source can have multiple data types
  • Flexible queries - find all sources with GEAR_DATA
  • Array operations - 'SPECIES_DATA' = ANY(source_types)

4. Optimized Relationships

  • UUID foreign keys - fast joins without sequential ID dependencies
  • Cascade deletes - referential integrity maintained
  • Performance indexes - all major lookup patterns indexed

🔍 Query Capabilities

Source Analysis

-- Find all gear-related sources
SELECT source_shortname, source_types
FROM original_sources
WHERE 'GEAR_DATA' = ANY(source_types);

-- Get source health status
SELECT source_shortname, status, refresh_date, size_approx
FROM original_sources
WHERE status = 'LOADED'
ORDER BY refresh_date DESC;

Cross-Reference Lookups

-- Find country with source info
SELECT ci.alpha_3_code, ci.short_name_en, os.source_fullname
FROM country_iso ci
JOIN original_sources os ON ci.source_id = os.source_id
WHERE ci.alpha_2_code = 'US';

-- Gear type relationships with sources
SELECT fao.fao_isscfg_name, cbp.cbp_gear_name,
os1.source_shortname as fao_source,
os2.source_shortname as cbp_source
FROM gear_types_relationship_fao_cbp rel
JOIN gear_types_fao fao ON rel.fao_gear_id = fao.id
JOIN gear_types_cbp cbp ON rel.cbp_gear_id = cbp.id
JOIN original_sources os1 ON fao.source_id = os1.source_id
JOIN original_sources os2 ON cbp.source_id = os2.source_id;

Data Quality Monitoring

-- Sources needing refresh (older than 6 months)
SELECT source_shortname, refresh_date,
CURRENT_DATE - refresh_date as days_old
FROM original_sources
WHERE refresh_date < CURRENT_DATE - INTERVAL '6 months'
ORDER BY refresh_date;

-- Record count validation
SELECT source_shortname, size_approx,
CASE
WHEN size_approx > 1000 THEN 'Large dataset'
WHEN size_approx > 100 THEN 'Medium dataset'
ELSE 'Small dataset'
END as dataset_size
FROM original_sources
WHERE status = 'LOADED';

🚀 Future Enhancements

1. Automated Refresh System

  • Schedule-based updates using update_frequency field
  • Change detection via ETL pipelines
  • Notification system for failed refreshes
  • Delta processing for large datasets

2. Data Quality Framework

  • Validation rules per source type
  • Quality scores based on completeness and consistency
  • Anomaly detection for unusual data patterns
  • Data drift monitoring across refresh cycles

3. API Integration

  • RESTful endpoints for source metadata
  • GraphQL schema for complex relationship queries
  • Webhook notifications for source status changes
  • Real-time sync with external source systems

4. Advanced Analytics

  • Data lineage visualization - track data flow through transformations
  • Usage analytics - which sources are queried most frequently
  • Performance metrics - query performance per source
  • Dependency mapping - impact analysis for source changes

5. Multi-Environment Support

  • Environment-specific sources (dev/staging/prod)
  • Source promotion workflows
  • A/B testing capabilities for source updates
  • Rollback mechanisms for problematic updates

6. Extended Metadata

  • Data schemas - store column definitions and constraints
  • Licensing information - usage rights and restrictions
  • Contact information - source maintainer details
  • SLA tracking - uptime and reliability metrics

🔧 Maintenance Tasks

Daily

  • Monitor original_sources status for failures
  • Check log files for import errors

Weekly

  • Review sources with refresh_date older than expected frequency
  • Validate relationship table integrity

Monthly

  • Analyze size_approx trends for unusual growth/shrinkage
  • Update internal reference data (hull materials, RFMOs)
  • Review and cleanup old temporary tables

Quarterly

  • Audit source mappings for new data requirements
  • Performance analysis of UUID indexes
  • Update documentation for new sources or processes

📈 Performance Considerations

Optimizations Implemented

  • UUID indexes on all primary and foreign keys
  • Composite indexes on frequently queried columns
  • GIN indexes on JSONB and array columns
  • Partial indexes on status and date ranges

Query Best Practices

  • Use UUID foreign keys for joins (faster than string lookups)
  • Leverage array operators for multi-type source queries
  • Utilize JSONB operators for URL array searches
  • Consider materialized views for complex multi-source queries

Scaling Strategies

  • Horizontal partitioning by source_type for large datasets
  • Read replicas for analytical workloads
  • Connection pooling for high-concurrency access
  • Caching layer for frequently accessed reference data

🛡️ Security & Compliance

Data Protection

  • UUID obfuscation prevents ID enumeration attacks
  • Source URL validation prevents injection attacks
  • Access controls via database roles and permissions
  • Audit logging for all source modifications

Compliance Ready

  • Data lineage for regulatory requirements
  • Source attribution for licensing compliance
  • Retention policies via automated cleanup
  • Change tracking for audit trails