Skip to main content

Source: ebisu/docs/adr/0062-phase1-data-isolation-architecture.md | ✏️ Edit on GitHub

ADR-0062: Phase 1 Data Isolation Architecture

Status: Accepted
Date: 2025-01-11
Stakeholders: Development team, Data architects

Context

Phase 1 (Raw Intelligence Collection) must maintain complete data isolation between sources while enabling cross-source analysis. Each source's data must remain independent and unmodified.

Decision

Implement a three-tier data isolation architecture:

Tier 1: Source Isolation

  • Each source's raw data stored in intelligence_reports
  • No cross-contamination between sources
  • Complete preservation of original data

Tier 2: Structured Extraction

  • vessel_intelligence extracts structured data
  • Still maintains source relationship
  • No merging or deduplication

Tier 3: Cross-Source Analysis

  • vessel_identity_confirmations for confirmation tracking
  • intelligence_change_log for temporal analysis
  • Analysis layers that never modify source data

Architecture

┌─────────────────────────────────────────────────┐
│ Source Data Files │
│ (NAFO.csv, NEAFC.csv, IOTC.csv, etc.) │
└────────────────────┬────────────────────────────┘

┌────────────────────▼────────────────────────────┐
│ intelligence_reports │
│ - Separate records for each source │
│ - Complete raw data preservation │
│ - No deduplication across sources │
└────────────────────┬────────────────────────────┘

┌────────────────────▼────────────────────────────┐
│ vessel_intelligence │
│ - Structured extraction per report │
│ - 1:1 relationship with reports │
│ - No cross-source merging │
└────────────────────┬────────────────────────────┘

┌────────────────────▼────────────────────────────┐
│ Cross-Source Analysis Layer │
│ - vessel_identity_confirmations │
│ - intelligence_change_log │
│ - Read-only analysis, no data modification │
└─────────────────────────────────────────────────┘

Implementation Details

1. Source Data Isolation

-- Each source maintains its own records
SELECT
rfmo_shortname,
COUNT(*) as vessel_count,
COUNT(DISTINCT data_hash) as unique_data_patterns
FROM intelligence_reports
GROUP BY rfmo_shortname
ORDER BY vessel_count DESC;

2. Confirmation Tracking

-- Cross-source analysis without data merging
SELECT
vic.vessel_name,
vic.vessel_imo,
vic.confirmation_count,
vic.confirming_source_names
FROM vessel_identity_confirmations vic
WHERE confirmation_count > 1;

3. Temporal Tracking

-- Each source's temporal history preserved
SELECT
rfmo_shortname,
report_date,
COUNT(*) as reports_on_date
FROM intelligence_reports
WHERE is_current = TRUE
GROUP BY rfmo_shortname, report_date
ORDER BY rfmo_shortname, report_date DESC;

Data Flow Example

1. NAFO reports vessel "ATLANTIC STAR" on 2024-01-15
→ Creates intelligence_report (ID: 123)
→ Creates vessel_intelligence (linked to 123)

2. NEAFC reports vessel "ATLANTIC STAR" on 2024-02-20
→ Creates intelligence_report (ID: 456)
→ Creates vessel_intelligence (linked to 456)

3. Confirmation system detects match
→ Updates vessel_identity_confirmations
→ Records show 2 confirming sources
→ Original reports remain unchanged

Consequences

Positive

  • Complete data lineage and provenance
  • No information loss
  • Can trace any data back to original source
  • Sources can be updated independently
  • Cross-source validation without contamination

Negative

  • More storage required
  • Query complexity for cross-source analysis
  • Must carefully manage relationships

Neutral

  • Requires clear understanding of data model
  • Phase 2 will handle identity resolution

Migration Path to Phase 2

Phase 1's isolation architecture enables Phase 2:

  • All source data preserved for re-processing
  • Can build identity graph without losing originals
  • Can implement different resolution strategies
  • Can roll back if needed

References

  • ADR-0056: Staged Intelligence Import Architecture
  • ADR-0061: Intelligence Confirmations Are Not Duplicates
  • ADR-0059: PostgreSQL 17 Native Architecture