Skip to main content

Source: ebisu/docs/adr/0061-intelligence-confirmations-not-duplicates.md | ✏️ Edit on GitHub

ADR-0061: Intelligence Confirmations Are Not Duplicates

Status: Accepted
Date: 2025-01-11
Stakeholders: Development team, Intelligence analysts

Context

During Phase 1 implementation, there was potential confusion between:

  1. File deduplication - Preventing the same file from being imported twice
  2. Intelligence confirmations - Multiple sources reporting the same vessel

This distinction is CRITICAL for an intelligence platform where multiple reports of the same entity increase confidence, not redundancy.

Decision

We explicitly separate these two concepts:

1. File-Level Deduplication (Good)

  • Prevent reimporting the exact same file (same SHA-256 hash)
  • Tracked in data_lineage table
  • Purpose: Avoid processing waste and data corruption

2. Intelligence Confirmations (Critical)

  • Multiple sources reporting the same vessel are CONFIRMATIONS
  • Each source maintains its own records in intelligence_reports
  • Cross-source confirmations tracked in vessel_identity_confirmations
  • More sources = higher confidence score

Architecture

-- Each source has its own data
intelligence_reports
├── source_id (RFMO/source identifier)
├── raw_vessel_data (complete original data)
├── vessel_key_hash (for cross-source matching)
└── data_hash (includes source-specific fields)

-- Confirmation tracking
vessel_identity_confirmations
├── vessel_key_hash (IMO + name + flag)
├── confirming_sources[] (array of source IDs)
├── confirmation_count (number of sources)
└── confidence_score (based on confirmations)

Implementation

File Deduplication

# Check if file already imported
EXISTING_LINEAGE=$(execute_sql "
SELECT lineage_id
FROM data_lineage
WHERE source_file_hash = '$FILE_HASH'
")

if [[ -n "$EXISTING_LINEAGE" ]]; then
log_warning "This exact file was already imported"
# Optionally exit or continue with new version
fi

Intelligence Confirmations

-- Multiple reports = confirmations
SELECT
vessel_name,
confirmation_count,
confirming_source_names,
confidence_score
FROM vessel_identity_confirmations
WHERE confirmation_count > 1
ORDER BY confirmation_count DESC;

Consequences

Positive

  • No intelligence data is ever lost
  • Cross-source validation increases data confidence
  • Can track how many sources confirm each vessel
  • Proper audit trail of all reports

Negative

  • More storage required (but intelligence requires this)
  • Must carefully distinguish between file and data deduplication

Neutral

  • Requires clear documentation and training
  • Import scripts must handle both concepts

Examples

Correct: Multiple Sources Confirming

NAFO reports vessel "OCEAN STAR" IMO 1234567
NEAFC reports vessel "OCEAN STAR" IMO 1234567
→ 2 intelligence_reports records
→ 1 vessel_identity_confirmations record with confirmation_count = 2
→ Higher confidence score

Correct: Preventing File Re-import

Import NAFO_vessels_2024-12.csv (hash: abc123...)
Re-run same import
→ System detects same file hash
→ Prevents duplicate import
→ Protects data integrity

Incorrect: Treating Confirmations as Duplicates

❌ NEVER DO THIS:
"NAFO already reported this vessel, skip NEAFC's report"
→ This loses valuable confirmation data

References

  • ADR-0056: Intelligence Platform Principles
  • ADR-0059: PostgreSQL 17 Native Architecture
  • CLAUDE.md: Critical principle about no duplicates in intelligence