Skip to main content

Source: ebisu/docs/IMPORT_ARCHITECTURE.md | ✏️ Edit on GitHub

Ebisu Import Architecture

Overview

This document explains how Ebisu's import system implements intelligence platform principles where multiple reports are confirmations, not duplicates.

Core Architecture

Two-Layer Deduplication Strategy

1. File-Level Deduplication (Infrastructure)

  • Purpose: Prevent processing same file twice
  • Method: SHA-256 hash of entire file
  • Table: data_lineage
  • Benefit: Saves processing resources
# Example from import script
FILE_HASH=$(sha256sum "$INPUT_FILE" | cut -d' ' -f1)
EXISTING_LINEAGE=$(execute_sql "
SELECT lineage_id FROM data_lineage
WHERE source_file_hash = '$FILE_HASH'
")

2. Intelligence Confirmations (Analysis)

  • Purpose: Track how many sources report same vessel
  • Method: vessel_key_hash (IMO + name + flag)
  • Table: vessel_identity_confirmations
  • Benefit: Increases data confidence
-- Example: Finding vessels confirmed by multiple sources
SELECT
vessel_name,
confirmation_count,
confirming_source_names
FROM vessel_identity_confirmations
WHERE confirmation_count > 1;

Import Process Flow

1. File Validation
├── Check file hash (prevent duplicate processing)
└── Validate data quality

2. Batch Creation
├── Create import batch record
├── Link to previous batch (for change detection)
└── Record file metadata

3. Raw Data Import
├── Load to staging table
├── Convert to intelligence_reports
├── Preserve ALL raw data in JSONB
└── Generate vessel_key_hash

4. Intelligence Extraction
├── Extract structured vessel_intelligence
├── Calculate completeness scores
└── Maintain 1:1 with reports

5. Confirmation Tracking
├── Update vessel_identity_confirmations
├── Count confirming sources
└── Calculate confidence scores

6. Change Detection
├── Compare with previous batch
├── Log changes to intelligence_change_log
└── Score risk of changes

Table Relationships

intelligence_reports
↓ 1:1
vessel_intelligence
↓ vessel_key_hash
vessel_identity_confirmations (aggregated view)

Cross-source analysis

Key Functions

rebuild_vessel_confirmations()

Rebuilds confirmation tracking from all sources:

-- Aggregates all reports by vessel_key_hash
-- Counts confirming sources
-- Calculates confidence scores
SELECT * FROM rebuild_vessel_confirmations();

update_vessel_confirmations(batch_id)

Updates confirmations for a specific import:

-- Incremental update after new import
-- Tracks new and updated confirmations
SELECT * FROM update_vessel_confirmations('batch-uuid');

detect_vessel_changes(current_batch, previous_batch)

Identifies changes between imports:

-- Compares vessel data between batches
-- Identifies NEW, UPDATED, REMOVED vessels
-- Calculates risk scores for changes

Import Script Template

The enhanced import template (template_enhanced_staged_import.sh) implements all these principles:

  1. File Deduplication Check

    # Lines 47-72: Check if exact file already imported
  2. Incremental Import Support

    # Lines 78-102: Detect previous batch for changes
  3. Raw Data Preservation

    # Lines 303-333: Store complete raw data as JSONB
  4. Confirmation Updates

    # Lines 503-513: Update cross-source confirmations
  5. Change Detection

    # Lines 441-496: Detect and log changes

Best Practices

DO

  • ✅ Preserve all raw data in JSONB
  • ✅ Track file lineage with hashes
  • ✅ Update confirmations after each import
  • ✅ Use temporal tracking (valid_from/valid_to)
  • ✅ Calculate confidence based on confirmations

DON'T

  • ❌ Skip reports because "already have this vessel"
  • ❌ Merge data from different sources
  • ❌ Delete old reports when updating
  • ❌ Treat confirmations as duplicates
  • ❌ Modify raw data during import

Example: Multi-Source Confirmation

When NAFO, NEAFC, and ICCAT all report vessel "ATLANTIC STAR":

  1. Three separate intelligence_reports records created
  2. Three separate vessel_intelligence records created
  3. One vessel_identity_confirmations record shows:
    • confirmation_count: 3
    • confirming_source_names: ['NAFO', 'NEAFC', 'ICCAT']
    • confidence_score: 0.3 (high confidence)

Monitoring Import Health

Check Import Status

SELECT 
rfmo_shortname,
import_date,
raw_records_count,
intelligence_extracted_count,
is_current
FROM intelligence_import_batches
ORDER BY import_date DESC;

Check Confirmation Distribution

SELECT 
confirmation_count,
COUNT(*) as vessel_count
FROM vessel_identity_confirmations
GROUP BY confirmation_count
ORDER BY confirmation_count;

Check Source Overlaps

-- Which sources confirm each other most
SELECT
source_1,
source_2,
shared_vessels
FROM source_overlap_analysis
WHERE shared_vessels > 100
ORDER BY shared_vessels DESC;

Troubleshooting

"Duplicate key violation"

  • This is CORRECT behavior when reimporting same file
  • System is preventing duplicate processing
  • Either skip the file or force new version

"No confirmations found"

  • Check vessel_key_hash generation
  • Ensure consistent IMO/name/flag formatting
  • Run rebuild_vessel_confirmations()

"Data loss detected"

  • Check staging table structure matches data
  • Verify CSV parsing handles embedded newlines
  • Check for encoding issues

References