Source:
ebisu/docs/adr/0061-intelligence-confirmations-not-duplicates.md| ✏️ Edit on GitHub
ADR-0061: Intelligence Confirmations Are Not Duplicates
Status: Accepted
Date: 2025-01-11
Stakeholders: Development team, Intelligence analysts
Context
During Phase 1 implementation, there was potential confusion between:
- File deduplication - Preventing the same file from being imported twice
- Intelligence confirmations - Multiple sources reporting the same vessel
This distinction is CRITICAL for an intelligence platform where multiple reports of the same entity increase confidence, not redundancy.
Decision
We explicitly separate these two concepts:
1. File-Level Deduplication (Good)
- Prevent reimporting the exact same file (same SHA-256 hash)
- Tracked in
data_lineagetable - Purpose: Avoid processing waste and data corruption
2. Intelligence Confirmations (Critical)
- Multiple sources reporting the same vessel are CONFIRMATIONS
- Each source maintains its own records in
intelligence_reports - Cross-source confirmations tracked in
vessel_identity_confirmations - More sources = higher confidence score
Architecture
-- Each source has its own data
intelligence_reports
├── source_id (RFMO/source identifier)
├── raw_vessel_data (complete original data)
├── vessel_key_hash (for cross-source matching)
└── data_hash (includes source-specific fields)
-- Confirmation tracking
vessel_identity_confirmations
├── vessel_key_hash (IMO + name + flag)
├── confirming_sources[] (array of source IDs)
├── confirmation_count (number of sources)
└── confidence_score (based on confirmations)
Implementation
File Deduplication
# Check if file already imported
EXISTING_LINEAGE=$(execute_sql "
SELECT lineage_id
FROM data_lineage
WHERE source_file_hash = '$FILE_HASH'
")
if [[ -n "$EXISTING_LINEAGE" ]]; then
log_warning "This exact file was already imported"
# Optionally exit or continue with new version
fi
Intelligence Confirmations
-- Multiple reports = confirmations
SELECT
vessel_name,
confirmation_count,
confirming_source_names,
confidence_score
FROM vessel_identity_confirmations
WHERE confirmation_count > 1
ORDER BY confirmation_count DESC;
Consequences
Positive
- No intelligence data is ever lost
- Cross-source validation increases data confidence
- Can track how many sources confirm each vessel
- Proper audit trail of all reports
Negative
- More storage required (but intelligence requires this)
- Must carefully distinguish between file and data deduplication
Neutral
- Requires clear documentation and training
- Import scripts must handle both concepts
Examples
Correct: Multiple Sources Confirming
NAFO reports vessel "OCEAN STAR" IMO 1234567
NEAFC reports vessel "OCEAN STAR" IMO 1234567
→ 2 intelligence_reports records
→ 1 vessel_identity_confirmations record with confirmation_count = 2
→ Higher confidence score
Correct: Preventing File Re-import
Import NAFO_vessels_2024-12.csv (hash: abc123...)
Re-run same import
→ System detects same file hash
→ Prevents duplicate import
→ Protects data integrity
Incorrect: Treating Confirmations as Duplicates
❌ NEVER DO THIS:
"NAFO already reported this vessel, skip NEAFC's report"
→ This loses valuable confirmation data
References
- ADR-0056: Intelligence Platform Principles
- ADR-0059: PostgreSQL 17 Native Architecture
- CLAUDE.md: Critical principle about no duplicates in intelligence