Source:
ebisu/docs/reference/schemas/asfis_species/asfis_cleaning_documentation.md| ✏️ Edit on GitHub
ASFIS Data Cleaning & Processing Documentation
📋 Overview
The ASFIS (Aquatic Sciences and Fisheries Information System) data processing pipeline transforms raw FAO fisheries data into a clean, structured database optimized for trade data analysis and annual updates. This system handles the complexities of multi-species trade codes and taxonomic hierarchies while maintaining data integrity and performance.
🔧 Scripts & Components
Core Processing Scripts
| Script | Purpose | Input | Output |
|---|---|---|---|
clean_asfis_data.py | Edge case preprocessing | Raw ASFIS CSV | Preprocessed CSV with edge cases resolved |
clean_asfis.py | Rule-based cleaning & validation | Preprocessed CSV | Schema-compliant cleaned CSV |
import_asfis.sql | PostgreSQL import & setup | Cleaned CSV | Populated asfis_species table |
clean_and_load_asfis.sh | Orchestration pipeline | Raw CSV | Complete database import |
Supporting Infrastructure
| Component | Purpose | Function |
|---|---|---|
original_sources.csv | Centralized metadata | Tracks data sources, versions, refresh dates |
load_original_sources.sh | Source registration | Populates original_sources table |
| Enhanced error handling | Validation & recovery | Prevents partial imports, enables rollback |
🔍 Data Cleaning Process
Phase 1: Edge Case Preprocessing (clean_asfis_data.py)
Handles 130+ edge cases including:
Complex Taxonomic Names:
# Example edge cases resolved:
"Siluriformes (=Siluroidei)" → "Siluridae" (Family)
"Alosa alosa, A. fallax" → Two species: "Alosa alosa" + "Alosa fallax"
"Epinephelus fuscoguttatus x E. lanceolatus" → Hybrid: Parent species separated
"Osmerus spp, Hypomesus spp" → Multiple genera: "Osmerus" + "Hypomesus"
Subgenus Notation Cleanup:
"Holothuria (Holothuria) dakarensis" → "Holothuria dakarensis" (Species)
"Uroteuthis (Photololigo) duvaucelii" → "Uroteuthis duvaucelii" (Species)
"Cambarellus (Cambarellus) patzcuarensis" → "Cambarellus patzcuarensis" (Species)
Taxonomic Rank Detection:
- Species: 2 words, not ending in "spp"
- Genus: 1 word + "spp"
- Family: 1 word ending in "dae"
- Order: 1 word ending in "formes"
- Class: 1 word ending in "ia" or "phyceae"
- Subspecies: 3 words
Output: Expands 13,708 records to 13,733 records (+25 multi-species entries)
Phase 2: Rule-based Cleaning (clean_asfis.py)
Applies data quality rules:
- Schema validation against expected column types
- Null value handling with appropriate defaults
- Text normalization for consistent formatting
- Data type conversion (numeric fields, booleans)
- Character encoding standardization (UTF-8)
Quality Metrics Applied:
✅ Null counts tracked for all columns
✅ Required fields validated (scientificName, Taxonomic_Code)
✅ Alpha3_Code length validation (exactly 3 characters)
✅ ISSCAAP_Group numeric validation
Phase 3: PostgreSQL Import (import_asfis.sql)
Database setup and optimization:
- Table creation with proper column types and constraints
- Index creation for performance optimization
- Source tracking integration with original_sources table
- Data validation with statistical reporting
🗂️ Database Structure
Primary Table: asfis_species
CREATE TABLE asfis_species (
-- Primary key (UUID prevents constraint violations)
asfis_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
-- Trade data structure
"ISSCAAP_Group" NUMERIC, -- FAO statistical classification
"Taxonomic_Code" TEXT NOT NULL, -- FAO taxonomic identifier
"Alpha3_Code" TEXT, -- 3-letter trade code (shared across species)
"taxonRank" TEXT, -- Species, Genus, Family, etc.
"scientificName" TEXT NOT NULL, -- Scientific name for WoRMS/ITIS mapping
-- Multilingual common names
"English_name" TEXT,
"French_name" TEXT,
"Spanish_name" TEXT,
"Arabic_name" TEXT,
"Chinese_name" TEXT,
"Russian_name" TEXT,
-- Taxonomic metadata
"Author" TEXT, -- Taxonomic authority
"Family" TEXT, -- For hierarchical cascading
"Order_or_higher_taxa" TEXT, -- For hierarchical cascading
"FishStat_Data" BOOLEAN, -- Included in FAO statistics
-- Source tracking (critical for annual updates)
source_id UUID REFERENCES original_sources(source_id),
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
Performance Optimization Indexes
-- Core indexes for trade data queries
CREATE INDEX idx_asfis_taxonomic_code ON asfis_species("Taxonomic_Code");
CREATE INDEX idx_asfis_alpha3_code ON asfis_species("Alpha3_Code") WHERE "Alpha3_Code" IS NOT NULL;
CREATE INDEX idx_asfis_species_scientific_name ON asfis_species("scientificName");
-- Multilingual search indexes
CREATE INDEX idx_asfis_species_english_name ON asfis_species("English_name") WHERE "English_name" IS NOT NULL;
-- Taxonomic hierarchy indexes (for cascading)
CREATE INDEX idx_asfis_species_family ON asfis_species("Family") WHERE "Family" IS NOT NULL;
CREATE INDEX idx_asfis_species_order ON asfis_species("Order_or_higher_taxa") WHERE "Order_or_higher_taxa" IS NOT NULL;
-- Composite index for trade cascading logic
CREATE INDEX idx_asfis_trade_cascade ON asfis_species("Alpha3_Code", "taxonRank") WHERE "Alpha3_Code" IS NOT NULL;
Source Tracking Integration
-- Links to centralized source management
CREATE TABLE original_sources (
source_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
source_shortname TEXT UNIQUE NOT NULL, -- 'ASFIS'
source_type TEXT NOT NULL, -- 'SPECIES_DATA'
source_url TEXT, -- FAO download URL
refresh_date DATE, -- Last data refresh from FAO
status TEXT DEFAULT 'PENDING', -- 'LOADED', 'PENDING', 'FAILED'
size_approx INTEGER, -- Record count
last_updated TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
🎯 Trade Data Implementation Benefits
Multi-Species Trade Code Support
The database structure naturally supports all three ASFIS trade patterns:
Pattern 1: Direct Species Mapping (1:1)
-- Single species per trade code
SELECT * FROM asfis_species WHERE "Alpha3_Code" = 'YFT';
-- Returns: Thunnus albacares (Yellowfin Tuna)
Pattern 2: Multi-Species Mapping (1:Many)
-- Multiple species sharing same trade code
SELECT * FROM asfis_species WHERE "Alpha3_Code" = 'HKC';
-- Returns: Merluccius capensis AND Merluccius paradoxus (Cape Hakes)
Pattern 3: Taxonomic Cascading (1:Hierarchy)
-- Higher taxonomic ranks requiring cascading to species
SELECT * FROM asfis_species WHERE "Alpha3_Code" = 'CAX';
-- Returns: Ariidae (family level)
-- Future: Will cascade to ALL species within Ariidae family via WoRMS/ITIS
WoRMS/ITIS Integration Readiness
The scientificName field is specifically cleaned and normalized for taxonomic database matching:
-- Ready for immediate integration
SELECT
a."Alpha3_Code",
a."scientificName" as asfis_name,
w."scientificName" as worms_name,
w.aphia_id,
i.complete_name as itis_name,
i.tsn
FROM asfis_species a
LEFT JOIN worms_taxonomic_units w ON a."scientificName" = w."scientificName"
LEFT JOIN itis_taxonomic_units i ON a."scientificName" = i.complete_name
WHERE a."taxonRank" IN ('Species', 'Subspecies');
📈 Annual Updates Benefits
Optimized for Incremental Updates
The current structure is designed to support the incremental update strategy:
Source Tracking Enabled
-- Every record links to its data source and version
SELECT
COUNT(*) as total_records,
MAX(updated_at) as last_update,
COUNT(DISTINCT source_id) as data_sources
FROM asfis_species;
Ready for Historical Archiving
-- Structure supports historical table creation
CREATE TABLE asfis_species_historical (
LIKE asfis_species INCLUDING ALL,
archived_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
archived_reason TEXT DEFAULT 'REMOVED_FROM_SOURCE',
data_year INTEGER
);
UUID Primary Keys Prevent Conflicts
- ✅ No constraint violations during UPSERT operations
- ✅ Stable references for API endpoints
- ✅ Clean incremental update logic
Taxonomic_Code + Scientific_Name Composite Keys
-- Enables clean UPSERT logic for incremental updates
CREATE UNIQUE INDEX asfis_species_composite_key
ON asfis_species ("Taxonomic_Code", "scientificName");
-- Annual update UPSERT pattern:
ON CONFLICT ("Taxonomic_Code", "scientificName") DO UPDATE SET
"Alpha3_Code" = EXCLUDED."Alpha3_Code",
"English_name" = EXCLUDED."English_name",
updated_at = CURRENT_TIMESTAMP;
Data Quality & Validation
The cleaning process ensures data quality for reliable annual updates:
- Consistent taxonomic ranks enable automated processing
- Validated Alpha3_Code format (exactly 3 characters)
- Clean scientific names ready for external database matching
- Source tracking enables change detection between years
🚀 Current Capabilities
Trade Data Queries
-- Find all species for a trade code
SELECT "scientificName", "English_name", "Family"
FROM asfis_species
WHERE "Alpha3_Code" = 'COD';
-- Find trade code for a species
SELECT "Alpha3_Code", "English_name", "ISSCAAP_Group"
FROM asfis_species
WHERE "scientificName" LIKE '%Gadus morhua%';
-- Group by trade code assignment type
SELECT
CASE
WHEN COUNT(*) = 1 AND "taxonRank" = 'Species' THEN 'Direct Species'
WHEN COUNT(*) > 1 AND "taxonRank" = 'Species' THEN 'Multi-Species'
WHEN "taxonRank" IN ('Family', 'Order', 'Class') THEN 'Taxonomic Hierarchy'
ELSE 'Other'
END as assignment_type,
COUNT(DISTINCT "Alpha3_Code") as trade_codes,
COUNT(*) as total_records
FROM asfis_species
WHERE "Alpha3_Code" IS NOT NULL
GROUP BY 1;
Multilingual Support
-- Search across all language names
SELECT "Alpha3_Code", "scientificName", "English_name", "French_name", "Spanish_name"
FROM asfis_species
WHERE "English_name" ILIKE '%tuna%'
OR "French_name" ILIKE '%thon%'
OR "Spanish_name" ILIKE '%atún%';
Statistical Analysis
-- ISSCAAP group distribution
SELECT "ISSCAAP_Group", COUNT(*) as species_count
FROM asfis_species
WHERE "ISSCAAP_Group" IS NOT NULL
GROUP BY "ISSCAAP_Group"
ORDER BY species_count DESC;
-- Taxonomic rank distribution
SELECT "taxonRank", COUNT(*) as count,
ROUND(COUNT(*) * 100.0 / (SELECT COUNT(*) FROM asfis_species), 2) as percentage
FROM asfis_species
WHERE "taxonRank" IS NOT NULL
GROUP BY "taxonRank"
ORDER BY count DESC;
🔮 Future Tasks
High Priority: Annual Updates Implementation
Phase 1: Incremental Update Infrastructure (Next Month)
- Historical table creation - Archive removed records
- Change detection logic - Identify what changed between years
- UPSERT functions - Update existing, insert new, archive removed
- Backup procedures - Safe rollback capability
- Validation workflows - Verify update success
Phase 2: Update Orchestration (Following Month)
- Annual update script - Automated pipeline for new ASFIS releases
- Preview functionality - Review changes before applying
- Staging environment - Test updates on copy of production data
- Notification system - Alert when updates complete
- Performance monitoring - Track update times and optimization opportunities
Phase 3: Advanced Features (Ongoing)
- Delta processing - Process only changed records for efficiency
- Data quality monitoring - Automated checks for data integrity
- API integration - Direct FAO data feeds when available
- Change history analysis - Track how trade codes evolve over time
Medium Priority: Performance Optimization
Query Performance
- Materialized views for common trade data queries
- Partial indexes for frequently filtered columns
- Query optimization based on actual usage patterns
- Connection pooling for high-volume API usage
Storage Optimization
- Table partitioning by data year for large historical datasets
- Compression for archived historical data
- Index maintenance automated optimization schedules
Low Priority: Enhanced Functionality
Advanced Trade Data Features
- Trade code evolution tracking - How codes change over time
- Related species suggestions - Based on taxonomic proximity
- Trade volume integration - Link to actual fisheries statistics
- Regional trade code variants - Handle country-specific variations
Integration Enhancements
- Direct WoRMS API integration - Real-time taxonomic validation
- ITIS API integration - Automated North American species verification
- GBIF integration - Additional species occurrence data
- FishBase integration - Comprehensive fish species information
📊 Current System Status
Data Metrics (Post-Fix)
- ✅ 13,733 ASFIS species records (duplication bug resolved)
- ✅ 13,708 unique Taxonomic_Code values (proper normalization)
- ✅ 13,708 unique Alpha3_Code values (trade codes ready)
- ✅ 25 multi-species entries (edge cases properly handled)
- ✅ 100% source tracking coverage (all records linked to sources)
Performance Characteristics
- ⚡ Sub-second queries on indexed columns
- 📊 ~21MB total table size (optimized storage)
- 🔍 11 specialized indexes (comprehensive query coverage)
- 🔗 Clean foreign key relationships (referential integrity)
Data Quality Indicators
- ✅ 91.52% species-level records (highest precision)
- ✅ 5.43% genus-level records (moderate precision)
- ✅ 2.19% family-level records (hierarchical cascading candidates)
- ✅ 0.86% higher taxonomic ranks (specialized trade categories)
The ASFIS data processing system is now optimized for both current trade data analysis and future annual updates, providing a solid foundation for sustainable fisheries data management.