Source:
ebisu/docs/reference/schemas/reference_data/reference_data_documentation.md| ✏️ Edit on GitHub
Reference Data System Documentation
Overview
The Reference Data System provides a centralized, source-tracked approach to managing taxonomic, geographic, and equipment reference data for marine and fisheries applications. Built with UUID primary keys and comprehensive data lineage tracking.
🎯 Core Components
1. Central Source Management
- File:
original_sources.csv - Table:
original_sources - Purpose: Centralized tracking of all external data sources with multi-type support
2. Processing Scripts
load_original_sources.sh: Multi-type source definitions with array supportclean_reference_data.py: Data cleaning and column normalizationimport_reference_data.sql: UUID-based import with source trackingload_reference_data.sh: Orchestration script with validation
🔄 Process Flow
Phase 1: Source Foundation
load_original_sources.sh
- Creates
original_sourcestable with TEXT[] multi-type support - Processes CSV with URL JSON arrays and status tracking
- Enables queries like:
'GEAR_DATA' = ANY(source_types)
Phase 2: Data Cleaning
clean_reference_data.py
- Removes manual ID generation (lets database handle UUIDs)
- Normalizes columns and handles data quality issues
- Preserves leading zeros in codes (e.g., FAO gear codes)
- Expands relationships (semicolon-delimited to individual records)
Phase 3: Enhanced Import
import_reference_data.sql
- Generates UUID primary keys via
gen_random_uuid() - Maps to source_id foreign keys where applicable
- Updates original_sources with status, size, and refresh_date
- Creates performance indexes on UUIDs and source relationships
Phase 4: Orchestration
load_reference_data.sh
- Validates prerequisites (original_sources existence)
- Coordinates execution of cleaning and import phases
- Provides comprehensive logging and verification
- Reports final mappings and potential issues
📊 Database Structure
External Reference Tables (Source Tracked)
| Table | Source | Purpose | Records |
|---|---|---|---|
country_iso | ISO | Country codes (ISO 3166) | ~249 |
fao_major_areas | FAO Major Area | Fishing areas | ~26 |
gear_types_fao | FAO ISSCFG | FAO gear classifications | ~71 |
gear_types_cbp | CBP Gear | CBP gear codes | ~41 |
vessel_types | FAO ISSCFV | Vessel classifications | ~63 |
Internal Reference Tables (Company Data)
| Table | Purpose | Records |
|---|---|---|
vessel_hull_material | Hull material types | ~14 |
rfmos | Regional fisheries orgs | ~26 |
Relationship Tables (UUID Mapped)
| Table | Purpose | Mapping |
|---|---|---|
gear_types_relationship_fao_cbp | FAO ↔ CBP gear mapping | ~69 relationships |
country_iso_eu | EU membership status | ~249 countries |
⚡ Database Benefits
1. UUID Primary Keys
- Distributed system ready - no ID conflicts across databases
- Security - non-sequential, non-guessable identifiers
- Performance - indexed UUID lookups are fast
- Future-proof - works with replication and sharding
2. Source Lineage Tracking
- Data provenance - every record traces to its original source
- Refresh tracking - know when each source was last updated
- Size estimation - automatic record counting per source
- Status management - PENDING → LOADED → FAILED states
3. Multi-Type Source Support
- Complex sources - single source can have multiple data types
- Flexible queries - find all sources with
GEAR_DATA - Array operations -
'SPECIES_DATA' = ANY(source_types)
4. Optimized Relationships
- UUID foreign keys - fast joins without sequential ID dependencies
- Cascade deletes - referential integrity maintained
- Performance indexes - all major lookup patterns indexed
🔍 Query Capabilities
Source Analysis
-- Find all gear-related sources
SELECT source_shortname, source_types
FROM original_sources
WHERE 'GEAR_DATA' = ANY(source_types);
-- Get source health status
SELECT source_shortname, status, refresh_date, size_approx
FROM original_sources
WHERE status = 'LOADED'
ORDER BY refresh_date DESC;
Cross-Reference Lookups
-- Find country with source info
SELECT ci.alpha_3_code, ci.short_name_en, os.source_fullname
FROM country_iso ci
JOIN original_sources os ON ci.source_id = os.source_id
WHERE ci.alpha_2_code = 'US';
-- Gear type relationships with sources
SELECT fao.fao_isscfg_name, cbp.cbp_gear_name,
os1.source_shortname as fao_source,
os2.source_shortname as cbp_source
FROM gear_types_relationship_fao_cbp rel
JOIN gear_types_fao fao ON rel.fao_gear_id = fao.id
JOIN gear_types_cbp cbp ON rel.cbp_gear_id = cbp.id
JOIN original_sources os1 ON fao.source_id = os1.source_id
JOIN original_sources os2 ON cbp.source_id = os2.source_id;
Data Quality Monitoring
-- Sources needing refresh (older than 6 months)
SELECT source_shortname, refresh_date,
CURRENT_DATE - refresh_date as days_old
FROM original_sources
WHERE refresh_date < CURRENT_DATE - INTERVAL '6 months'
ORDER BY refresh_date;
-- Record count validation
SELECT source_shortname, size_approx,
CASE
WHEN size_approx > 1000 THEN 'Large dataset'
WHEN size_approx > 100 THEN 'Medium dataset'
ELSE 'Small dataset'
END as dataset_size
FROM original_sources
WHERE status = 'LOADED';
🚀 Future Enhancements
1. Automated Refresh System
- Schedule-based updates using
update_frequencyfield - Change detection via ETL pipelines
- Notification system for failed refreshes
- Delta processing for large datasets
2. Data Quality Framework
- Validation rules per source type
- Quality scores based on completeness and consistency
- Anomaly detection for unusual data patterns
- Data drift monitoring across refresh cycles
3. API Integration
- RESTful endpoints for source metadata
- GraphQL schema for complex relationship queries
- Webhook notifications for source status changes
- Real-time sync with external source systems
4. Advanced Analytics
- Data lineage visualization - track data flow through transformations
- Usage analytics - which sources are queried most frequently
- Performance metrics - query performance per source
- Dependency mapping - impact analysis for source changes
5. Multi-Environment Support
- Environment-specific sources (dev/staging/prod)
- Source promotion workflows
- A/B testing capabilities for source updates
- Rollback mechanisms for problematic updates
6. Extended Metadata
- Data schemas - store column definitions and constraints
- Licensing information - usage rights and restrictions
- Contact information - source maintainer details
- SLA tracking - uptime and reliability metrics
🔧 Maintenance Tasks
Daily
- Monitor
original_sourcesstatus for failures - Check log files for import errors
Weekly
- Review sources with
refresh_dateolder than expected frequency - Validate relationship table integrity
Monthly
- Analyze
size_approxtrends for unusual growth/shrinkage - Update internal reference data (hull materials, RFMOs)
- Review and cleanup old temporary tables
Quarterly
- Audit source mappings for new data requirements
- Performance analysis of UUID indexes
- Update documentation for new sources or processes
📈 Performance Considerations
Optimizations Implemented
- UUID indexes on all primary and foreign keys
- Composite indexes on frequently queried columns
- GIN indexes on JSONB and array columns
- Partial indexes on status and date ranges
Query Best Practices
- Use UUID foreign keys for joins (faster than string lookups)
- Leverage array operators for multi-type source queries
- Utilize JSONB operators for URL array searches
- Consider materialized views for complex multi-source queries
Scaling Strategies
- Horizontal partitioning by source_type for large datasets
- Read replicas for analytical workloads
- Connection pooling for high-concurrency access
- Caching layer for frequently accessed reference data
🛡️ Security & Compliance
Data Protection
- UUID obfuscation prevents ID enumeration attacks
- Source URL validation prevents injection attacks
- Access controls via database roles and permissions
- Audit logging for all source modifications
Compliance Ready
- Data lineage for regulatory requirements
- Source attribution for licensing compliance
- Retention policies via automated cleanup
- Change tracking for audit trails