Source: ebisu/docs/adr/0062-phase1-data-isolation-architecture.md | ✏️ Edit on GitHub

ADR-0062: Phase 1 Data Isolation Architecture

Status: Accepted
Date: 2025-01-11
Stakeholders: Development team, Data architects

Context

Phase 1 (Raw Intelligence Collection) must maintain complete data isolation between sources while enabling cross-source analysis. Each source's data must remain independent and unmodified.

Decision

Implement a three-tier data isolation architecture:

Tier 1: Source Isolation

Each source's raw data stored in intelligence_reports
No cross-contamination between sources
Complete preservation of original data

Tier 2: Structured Extraction

vessel_intelligence extracts structured data
Still maintains source relationship
No merging or deduplication

Tier 3: Cross-Source Analysis

vessel_identity_confirmations for confirmation tracking
intelligence_change_log for temporal analysis
Analysis layers that never modify source data

Architecture

┌─────────────────────────────────────────────────┐
│                 Source Data Files                │
│  (NAFO.csv, NEAFC.csv, IOTC.csv, etc.)         │
└────────────────────┬────────────────────────────┘
                     │
┌────────────────────▼────────────────────────────┐
│            intelligence_reports                  │
│  - Separate records for each source             │
│  - Complete raw data preservation               │
│  - No deduplication across sources              │
└────────────────────┬────────────────────────────┘
                     │
┌────────────────────▼────────────────────────────┐
│           vessel_intelligence                    │
│  - Structured extraction per report             │
│  - 1:1 relationship with reports                │
│  - No cross-source merging                      │
└────────────────────┬────────────────────────────┘
                     │
┌────────────────────▼────────────────────────────┐
│         Cross-Source Analysis Layer             │
│  - vessel_identity_confirmations                │
│  - intelligence_change_log                      │
│  - Read-only analysis, no data modification     │
└─────────────────────────────────────────────────┘

Implementation Details

1. Source Data Isolation

-- Each source maintains its own records
SELECT 
    rfmo_shortname,
    COUNT(*) as vessel_count,
    COUNT(DISTINCT data_hash) as unique_data_patterns
FROM intelligence_reports
GROUP BY rfmo_shortname
ORDER BY vessel_count DESC;

2. Confirmation Tracking

-- Cross-source analysis without data merging
SELECT 
    vic.vessel_name,
    vic.vessel_imo,
    vic.confirmation_count,
    vic.confirming_source_names
FROM vessel_identity_confirmations vic
WHERE confirmation_count > 1;

3. Temporal Tracking

-- Each source's temporal history preserved
SELECT 
    rfmo_shortname,
    report_date,
    COUNT(*) as reports_on_date
FROM intelligence_reports
WHERE is_current = TRUE
GROUP BY rfmo_shortname, report_date
ORDER BY rfmo_shortname, report_date DESC;

Data Flow Example

1. NAFO reports vessel "ATLANTIC STAR" on 2024-01-15
   → Creates intelligence_report (ID: 123)
   → Creates vessel_intelligence (linked to 123)
   
2. NEAFC reports vessel "ATLANTIC STAR" on 2024-02-20
   → Creates intelligence_report (ID: 456) 
   → Creates vessel_intelligence (linked to 456)
   
3. Confirmation system detects match
   → Updates vessel_identity_confirmations
   → Records show 2 confirming sources
   → Original reports remain unchanged

Consequences

Positive

Complete data lineage and provenance
No information loss
Can trace any data back to original source
Sources can be updated independently
Cross-source validation without contamination

Negative

More storage required
Query complexity for cross-source analysis
Must carefully manage relationships

Neutral

Requires clear understanding of data model
Phase 2 will handle identity resolution

Migration Path to Phase 2

Phase 1's isolation architecture enables Phase 2:

All source data preserved for re-processing
Can build identity graph without losing originals
Can implement different resolution strategies
Can roll back if needed

References

ADR-0056: Staged Intelligence Import Architecture
ADR-0061: Intelligence Confirmations Are Not Duplicates
ADR-0059: PostgreSQL 17 Native Architecture

Context​

Decision​

Tier 1: Source Isolation​

Tier 2: Structured Extraction​

Tier 3: Cross-Source Analysis​

Architecture​

Implementation Details​

1. Source Data Isolation​

2. Confirmation Tracking​

3. Temporal Tracking​

Data Flow Example​

Consequences​

Positive​

Negative​

Neutral​

Migration Path to Phase 2​

References​