Source:
ebisu/docs/IMPORT_ARCHITECTURE.md| ✏️ Edit on GitHub
Ebisu Import Architecture
Overview
This document explains how Ebisu's import system implements intelligence platform principles where multiple reports are confirmations, not duplicates.
Core Architecture
Two-Layer Deduplication Strategy
1. File-Level Deduplication (Infrastructure)
- Purpose: Prevent processing same file twice
- Method: SHA-256 hash of entire file
- Table:
data_lineage - Benefit: Saves processing resources
# Example from import script
FILE_HASH=$(sha256sum "$INPUT_FILE" | cut -d' ' -f1)
EXISTING_LINEAGE=$(execute_sql "
SELECT lineage_id FROM data_lineage
WHERE source_file_hash = '$FILE_HASH'
")
2. Intelligence Confirmations (Analysis)
- Purpose: Track how many sources report same vessel
- Method:
vessel_key_hash(IMO + name + flag) - Table:
vessel_identity_confirmations - Benefit: Increases data confidence
-- Example: Finding vessels confirmed by multiple sources
SELECT
vessel_name,
confirmation_count,
confirming_source_names
FROM vessel_identity_confirmations
WHERE confirmation_count > 1;
Import Process Flow
1. File Validation
├── Check file hash (prevent duplicate processing)
└── Validate data quality
2. Batch Creation
├── Create import batch record
├── Link to previous batch (for change detection)
└── Record file metadata
3. Raw Data Import
├── Load to staging table
├── Convert to intelligence_reports
├── Preserve ALL raw data in JSONB
└── Generate vessel_key_hash
4. Intelligence Extraction
├── Extract structured vessel_intelligence
├── Calculate completeness scores
└── Maintain 1:1 with reports
5. Confirmation Tracking
├── Update vessel_identity_confirmations
├── Count confirming sources
└── Calculate confidence scores
6. Change Detection
├── Compare with previous batch
├── Log changes to intelligence_change_log
└── Score risk of changes
Table Relationships
intelligence_reports
↓ 1:1
vessel_intelligence
↓ vessel_key_hash
vessel_identity_confirmations (aggregated view)
↓
Cross-source analysis
Key Functions
rebuild_vessel_confirmations()
Rebuilds confirmation tracking from all sources:
-- Aggregates all reports by vessel_key_hash
-- Counts confirming sources
-- Calculates confidence scores
SELECT * FROM rebuild_vessel_confirmations();
update_vessel_confirmations(batch_id)
Updates confirmations for a specific import:
-- Incremental update after new import
-- Tracks new and updated confirmations
SELECT * FROM update_vessel_confirmations('batch-uuid');
detect_vessel_changes(current_batch, previous_batch)
Identifies changes between imports:
-- Compares vessel data between batches
-- Identifies NEW, UPDATED, REMOVED vessels
-- Calculates risk scores for changes
Import Script Template
The enhanced import template (template_enhanced_staged_import.sh) implements all these principles:
-
File Deduplication Check
# Lines 47-72: Check if exact file already imported -
Incremental Import Support
# Lines 78-102: Detect previous batch for changes -
Raw Data Preservation
# Lines 303-333: Store complete raw data as JSONB -
Confirmation Updates
# Lines 503-513: Update cross-source confirmations -
Change Detection
# Lines 441-496: Detect and log changes
Best Practices
DO
- ✅ Preserve all raw data in JSONB
- ✅ Track file lineage with hashes
- ✅ Update confirmations after each import
- ✅ Use temporal tracking (valid_from/valid_to)
- ✅ Calculate confidence based on confirmations
DON'T
- ❌ Skip reports because "already have this vessel"
- ❌ Merge data from different sources
- ❌ Delete old reports when updating
- ❌ Treat confirmations as duplicates
- ❌ Modify raw data during import
Example: Multi-Source Confirmation
When NAFO, NEAFC, and ICCAT all report vessel "ATLANTIC STAR":
- Three separate
intelligence_reportsrecords created - Three separate
vessel_intelligencerecords created - One
vessel_identity_confirmationsrecord shows:confirmation_count: 3confirming_source_names: ['NAFO', 'NEAFC', 'ICCAT']confidence_score: 0.3 (high confidence)
Monitoring Import Health
Check Import Status
SELECT
rfmo_shortname,
import_date,
raw_records_count,
intelligence_extracted_count,
is_current
FROM intelligence_import_batches
ORDER BY import_date DESC;
Check Confirmation Distribution
SELECT
confirmation_count,
COUNT(*) as vessel_count
FROM vessel_identity_confirmations
GROUP BY confirmation_count
ORDER BY confirmation_count;
Check Source Overlaps
-- Which sources confirm each other most
SELECT
source_1,
source_2,
shared_vessels
FROM source_overlap_analysis
WHERE shared_vessels > 100
ORDER BY shared_vessels DESC;
Troubleshooting
"Duplicate key violation"
- This is CORRECT behavior when reimporting same file
- System is preventing duplicate processing
- Either skip the file or force new version
"No confirmations found"
- Check
vessel_key_hashgeneration - Ensure consistent IMO/name/flag formatting
- Run
rebuild_vessel_confirmations()
"Data loss detected"
- Check staging table structure matches data
- Verify CSV parsing handles embedded newlines
- Check for encoding issues
References
- INTELLIGENCE_ARCHITECTURE.md
- ADR-0061: Intelligence Confirmations Are Not Duplicates
- ADR-0062: Phase 1 Data Isolation Architecture