Skip to main content

Source: ebisu/docs/reference/schemas/asfis_species/asfis_cleaning_documentation.md | ✏️ Edit on GitHub

ASFIS Data Cleaning & Processing Documentation

📋 Overview

The ASFIS (Aquatic Sciences and Fisheries Information System) data processing pipeline transforms raw FAO fisheries data into a clean, structured database optimized for trade data analysis and annual updates. This system handles the complexities of multi-species trade codes and taxonomic hierarchies while maintaining data integrity and performance.


🔧 Scripts & Components

Core Processing Scripts

ScriptPurposeInputOutput
clean_asfis_data.pyEdge case preprocessingRaw ASFIS CSVPreprocessed CSV with edge cases resolved
clean_asfis.pyRule-based cleaning & validationPreprocessed CSVSchema-compliant cleaned CSV
import_asfis.sqlPostgreSQL import & setupCleaned CSVPopulated asfis_species table
clean_and_load_asfis.shOrchestration pipelineRaw CSVComplete database import

Supporting Infrastructure

ComponentPurposeFunction
original_sources.csvCentralized metadataTracks data sources, versions, refresh dates
load_original_sources.shSource registrationPopulates original_sources table
Enhanced error handlingValidation & recoveryPrevents partial imports, enables rollback

🔍 Data Cleaning Process

Phase 1: Edge Case Preprocessing (clean_asfis_data.py)

Handles 130+ edge cases including:

Complex Taxonomic Names:

# Example edge cases resolved:
"Siluriformes (=Siluroidei)""Siluridae" (Family)
"Alosa alosa, A. fallax" → Two species: "Alosa alosa" + "Alosa fallax"
"Epinephelus fuscoguttatus x E. lanceolatus" → Hybrid: Parent species separated
"Osmerus spp, Hypomesus spp" → Multiple genera: "Osmerus" + "Hypomesus"

Subgenus Notation Cleanup:

"Holothuria (Holothuria) dakarensis""Holothuria dakarensis" (Species)
"Uroteuthis (Photololigo) duvaucelii""Uroteuthis duvaucelii" (Species)
"Cambarellus (Cambarellus) patzcuarensis""Cambarellus patzcuarensis" (Species)

Taxonomic Rank Detection:

  • Species: 2 words, not ending in "spp"
  • Genus: 1 word + "spp"
  • Family: 1 word ending in "dae"
  • Order: 1 word ending in "formes"
  • Class: 1 word ending in "ia" or "phyceae"
  • Subspecies: 3 words

Output: Expands 13,708 records to 13,733 records (+25 multi-species entries)

Phase 2: Rule-based Cleaning (clean_asfis.py)

Applies data quality rules:

  • Schema validation against expected column types
  • Null value handling with appropriate defaults
  • Text normalization for consistent formatting
  • Data type conversion (numeric fields, booleans)
  • Character encoding standardization (UTF-8)

Quality Metrics Applied:

✅ Null counts tracked for all columns
✅ Required fields validated (scientificName, Taxonomic_Code)
✅ Alpha3_Code length validation (exactly 3 characters)
✅ ISSCAAP_Group numeric validation

Phase 3: PostgreSQL Import (import_asfis.sql)

Database setup and optimization:

  • Table creation with proper column types and constraints
  • Index creation for performance optimization
  • Source tracking integration with original_sources table
  • Data validation with statistical reporting

🗂️ Database Structure

Primary Table: asfis_species

CREATE TABLE asfis_species (
-- Primary key (UUID prevents constraint violations)
asfis_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),

-- Trade data structure
"ISSCAAP_Group" NUMERIC, -- FAO statistical classification
"Taxonomic_Code" TEXT NOT NULL, -- FAO taxonomic identifier
"Alpha3_Code" TEXT, -- 3-letter trade code (shared across species)
"taxonRank" TEXT, -- Species, Genus, Family, etc.
"scientificName" TEXT NOT NULL, -- Scientific name for WoRMS/ITIS mapping

-- Multilingual common names
"English_name" TEXT,
"French_name" TEXT,
"Spanish_name" TEXT,
"Arabic_name" TEXT,
"Chinese_name" TEXT,
"Russian_name" TEXT,

-- Taxonomic metadata
"Author" TEXT, -- Taxonomic authority
"Family" TEXT, -- For hierarchical cascading
"Order_or_higher_taxa" TEXT, -- For hierarchical cascading
"FishStat_Data" BOOLEAN, -- Included in FAO statistics

-- Source tracking (critical for annual updates)
source_id UUID REFERENCES original_sources(source_id),
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

Performance Optimization Indexes

-- Core indexes for trade data queries
CREATE INDEX idx_asfis_taxonomic_code ON asfis_species("Taxonomic_Code");
CREATE INDEX idx_asfis_alpha3_code ON asfis_species("Alpha3_Code") WHERE "Alpha3_Code" IS NOT NULL;
CREATE INDEX idx_asfis_species_scientific_name ON asfis_species("scientificName");

-- Multilingual search indexes
CREATE INDEX idx_asfis_species_english_name ON asfis_species("English_name") WHERE "English_name" IS NOT NULL;

-- Taxonomic hierarchy indexes (for cascading)
CREATE INDEX idx_asfis_species_family ON asfis_species("Family") WHERE "Family" IS NOT NULL;
CREATE INDEX idx_asfis_species_order ON asfis_species("Order_or_higher_taxa") WHERE "Order_or_higher_taxa" IS NOT NULL;

-- Composite index for trade cascading logic
CREATE INDEX idx_asfis_trade_cascade ON asfis_species("Alpha3_Code", "taxonRank") WHERE "Alpha3_Code" IS NOT NULL;

Source Tracking Integration

-- Links to centralized source management
CREATE TABLE original_sources (
source_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
source_shortname TEXT UNIQUE NOT NULL, -- 'ASFIS'
source_type TEXT NOT NULL, -- 'SPECIES_DATA'
source_url TEXT, -- FAO download URL
refresh_date DATE, -- Last data refresh from FAO
status TEXT DEFAULT 'PENDING', -- 'LOADED', 'PENDING', 'FAILED'
size_approx INTEGER, -- Record count
last_updated TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

🎯 Trade Data Implementation Benefits

Multi-Species Trade Code Support

The database structure naturally supports all three ASFIS trade patterns:

Pattern 1: Direct Species Mapping (1:1)

-- Single species per trade code
SELECT * FROM asfis_species WHERE "Alpha3_Code" = 'YFT';
-- Returns: Thunnus albacares (Yellowfin Tuna)

Pattern 2: Multi-Species Mapping (1:Many)

-- Multiple species sharing same trade code
SELECT * FROM asfis_species WHERE "Alpha3_Code" = 'HKC';
-- Returns: Merluccius capensis AND Merluccius paradoxus (Cape Hakes)

Pattern 3: Taxonomic Cascading (1:Hierarchy)

-- Higher taxonomic ranks requiring cascading to species
SELECT * FROM asfis_species WHERE "Alpha3_Code" = 'CAX';
-- Returns: Ariidae (family level)
-- Future: Will cascade to ALL species within Ariidae family via WoRMS/ITIS

WoRMS/ITIS Integration Readiness

The scientificName field is specifically cleaned and normalized for taxonomic database matching:

-- Ready for immediate integration
SELECT
a."Alpha3_Code",
a."scientificName" as asfis_name,
w."scientificName" as worms_name,
w.aphia_id,
i.complete_name as itis_name,
i.tsn
FROM asfis_species a
LEFT JOIN worms_taxonomic_units w ON a."scientificName" = w."scientificName"
LEFT JOIN itis_taxonomic_units i ON a."scientificName" = i.complete_name
WHERE a."taxonRank" IN ('Species', 'Subspecies');

📈 Annual Updates Benefits

Optimized for Incremental Updates

The current structure is designed to support the incremental update strategy:

Source Tracking Enabled

-- Every record links to its data source and version
SELECT
COUNT(*) as total_records,
MAX(updated_at) as last_update,
COUNT(DISTINCT source_id) as data_sources
FROM asfis_species;

Ready for Historical Archiving

-- Structure supports historical table creation
CREATE TABLE asfis_species_historical (
LIKE asfis_species INCLUDING ALL,
archived_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
archived_reason TEXT DEFAULT 'REMOVED_FROM_SOURCE',
data_year INTEGER
);

UUID Primary Keys Prevent Conflicts

  • ✅ No constraint violations during UPSERT operations
  • ✅ Stable references for API endpoints
  • ✅ Clean incremental update logic

Taxonomic_Code + Scientific_Name Composite Keys

-- Enables clean UPSERT logic for incremental updates
CREATE UNIQUE INDEX asfis_species_composite_key
ON asfis_species ("Taxonomic_Code", "scientificName");

-- Annual update UPSERT pattern:
ON CONFLICT ("Taxonomic_Code", "scientificName") DO UPDATE SET
"Alpha3_Code" = EXCLUDED."Alpha3_Code",
"English_name" = EXCLUDED."English_name",
updated_at = CURRENT_TIMESTAMP;

Data Quality & Validation

The cleaning process ensures data quality for reliable annual updates:

  • Consistent taxonomic ranks enable automated processing
  • Validated Alpha3_Code format (exactly 3 characters)
  • Clean scientific names ready for external database matching
  • Source tracking enables change detection between years

🚀 Current Capabilities

Trade Data Queries

-- Find all species for a trade code
SELECT "scientificName", "English_name", "Family"
FROM asfis_species
WHERE "Alpha3_Code" = 'COD';

-- Find trade code for a species
SELECT "Alpha3_Code", "English_name", "ISSCAAP_Group"
FROM asfis_species
WHERE "scientificName" LIKE '%Gadus morhua%';

-- Group by trade code assignment type
SELECT
CASE
WHEN COUNT(*) = 1 AND "taxonRank" = 'Species' THEN 'Direct Species'
WHEN COUNT(*) > 1 AND "taxonRank" = 'Species' THEN 'Multi-Species'
WHEN "taxonRank" IN ('Family', 'Order', 'Class') THEN 'Taxonomic Hierarchy'
ELSE 'Other'
END as assignment_type,
COUNT(DISTINCT "Alpha3_Code") as trade_codes,
COUNT(*) as total_records
FROM asfis_species
WHERE "Alpha3_Code" IS NOT NULL
GROUP BY 1;

Multilingual Support

-- Search across all language names
SELECT "Alpha3_Code", "scientificName", "English_name", "French_name", "Spanish_name"
FROM asfis_species
WHERE "English_name" ILIKE '%tuna%'
OR "French_name" ILIKE '%thon%'
OR "Spanish_name" ILIKE '%atún%';

Statistical Analysis

-- ISSCAAP group distribution
SELECT "ISSCAAP_Group", COUNT(*) as species_count
FROM asfis_species
WHERE "ISSCAAP_Group" IS NOT NULL
GROUP BY "ISSCAAP_Group"
ORDER BY species_count DESC;

-- Taxonomic rank distribution
SELECT "taxonRank", COUNT(*) as count,
ROUND(COUNT(*) * 100.0 / (SELECT COUNT(*) FROM asfis_species), 2) as percentage
FROM asfis_species
WHERE "taxonRank" IS NOT NULL
GROUP BY "taxonRank"
ORDER BY count DESC;

🔮 Future Tasks

High Priority: Annual Updates Implementation

Phase 1: Incremental Update Infrastructure (Next Month)

  • Historical table creation - Archive removed records
  • Change detection logic - Identify what changed between years
  • UPSERT functions - Update existing, insert new, archive removed
  • Backup procedures - Safe rollback capability
  • Validation workflows - Verify update success

Phase 2: Update Orchestration (Following Month)

  • Annual update script - Automated pipeline for new ASFIS releases
  • Preview functionality - Review changes before applying
  • Staging environment - Test updates on copy of production data
  • Notification system - Alert when updates complete
  • Performance monitoring - Track update times and optimization opportunities

Phase 3: Advanced Features (Ongoing)

  • Delta processing - Process only changed records for efficiency
  • Data quality monitoring - Automated checks for data integrity
  • API integration - Direct FAO data feeds when available
  • Change history analysis - Track how trade codes evolve over time

Medium Priority: Performance Optimization

Query Performance

  • Materialized views for common trade data queries
  • Partial indexes for frequently filtered columns
  • Query optimization based on actual usage patterns
  • Connection pooling for high-volume API usage

Storage Optimization

  • Table partitioning by data year for large historical datasets
  • Compression for archived historical data
  • Index maintenance automated optimization schedules

Low Priority: Enhanced Functionality

Advanced Trade Data Features

  • Trade code evolution tracking - How codes change over time
  • Related species suggestions - Based on taxonomic proximity
  • Trade volume integration - Link to actual fisheries statistics
  • Regional trade code variants - Handle country-specific variations

Integration Enhancements

  • Direct WoRMS API integration - Real-time taxonomic validation
  • ITIS API integration - Automated North American species verification
  • GBIF integration - Additional species occurrence data
  • FishBase integration - Comprehensive fish species information

📊 Current System Status

Data Metrics (Post-Fix)

  • 13,733 ASFIS species records (duplication bug resolved)
  • 13,708 unique Taxonomic_Code values (proper normalization)
  • 13,708 unique Alpha3_Code values (trade codes ready)
  • 25 multi-species entries (edge cases properly handled)
  • 100% source tracking coverage (all records linked to sources)

Performance Characteristics

  • Sub-second queries on indexed columns
  • 📊 ~21MB total table size (optimized storage)
  • 🔍 11 specialized indexes (comprehensive query coverage)
  • 🔗 Clean foreign key relationships (referential integrity)

Data Quality Indicators

  • 91.52% species-level records (highest precision)
  • 5.43% genus-level records (moderate precision)
  • 2.19% family-level records (hierarchical cascading candidates)
  • 0.86% higher taxonomic ranks (specialized trade categories)

The ASFIS data processing system is now optimized for both current trade data analysis and future annual updates, providing a solid foundation for sustainable fisheries data management.