Source:
ebisu/docs/adr/0062-phase1-data-isolation-architecture.md| ✏️ Edit on GitHub
ADR-0062: Phase 1 Data Isolation Architecture
Status: Accepted
Date: 2025-01-11
Stakeholders: Development team, Data architects
Context
Phase 1 (Raw Intelligence Collection) must maintain complete data isolation between sources while enabling cross-source analysis. Each source's data must remain independent and unmodified.
Decision
Implement a three-tier data isolation architecture:
Tier 1: Source Isolation
- Each source's raw data stored in
intelligence_reports - No cross-contamination between sources
- Complete preservation of original data
Tier 2: Structured Extraction
vessel_intelligenceextracts structured data- Still maintains source relationship
- No merging or deduplication
Tier 3: Cross-Source Analysis
vessel_identity_confirmationsfor confirmation trackingintelligence_change_logfor temporal analysis- Analysis layers that never modify source data
Architecture
┌─────────────────────────────────────────────────┐
│ Source Data Files │
│ (NAFO.csv, NEAFC.csv, IOTC.csv, etc.) │
└────────────────────┬────────────────────────────┘
│
┌────────────────────▼────────────────────────────┐
│ intelligence_reports │
│ - Separate records for each source │
│ - Complete raw data preservation │
│ - No deduplication across sources │
└────────────────────┬────────────────────────────┘
│
┌────────────────────▼────────────────────────────┐
│ vessel_intelligence │
│ - Structured extraction per report │
│ - 1:1 relationship with reports │
│ - No cross-source merging │
└────────────────────┬────────────────────────────┘
│
┌────────────────────▼────────────────────────────┐
│ Cross-Source Analysis Layer │
│ - vessel_identity_confirmations │
│ - intelligence_change_log │
│ - Read-only analysis, no data modification │
└─────────────────────────────────────────────────┘
Implementation Details
1. Source Data Isolation
-- Each source maintains its own records
SELECT
rfmo_shortname,
COUNT(*) as vessel_count,
COUNT(DISTINCT data_hash) as unique_data_patterns
FROM intelligence_reports
GROUP BY rfmo_shortname
ORDER BY vessel_count DESC;
2. Confirmation Tracking
-- Cross-source analysis without data merging
SELECT
vic.vessel_name,
vic.vessel_imo,
vic.confirmation_count,
vic.confirming_source_names
FROM vessel_identity_confirmations vic
WHERE confirmation_count > 1;
3. Temporal Tracking
-- Each source's temporal history preserved
SELECT
rfmo_shortname,
report_date,
COUNT(*) as reports_on_date
FROM intelligence_reports
WHERE is_current = TRUE
GROUP BY rfmo_shortname, report_date
ORDER BY rfmo_shortname, report_date DESC;
Data Flow Example
1. NAFO reports vessel "ATLANTIC STAR" on 2024-01-15
→ Creates intelligence_report (ID: 123)
→ Creates vessel_intelligence (linked to 123)
2. NEAFC reports vessel "ATLANTIC STAR" on 2024-02-20
→ Creates intelligence_report (ID: 456)
→ Creates vessel_intelligence (linked to 456)
3. Confirmation system detects match
→ Updates vessel_identity_confirmations
→ Records show 2 confirming sources
→ Original reports remain unchanged
Consequences
Positive
- Complete data lineage and provenance
- No information loss
- Can trace any data back to original source
- Sources can be updated independently
- Cross-source validation without contamination
Negative
- More storage required
- Query complexity for cross-source analysis
- Must carefully manage relationships
Neutral
- Requires clear understanding of data model
- Phase 2 will handle identity resolution
Migration Path to Phase 2
Phase 1's isolation architecture enables Phase 2:
- All source data preserved for re-processing
- Can build identity graph without losing originals
- Can implement different resolution strategies
- Can roll back if needed
References
- ADR-0056: Staged Intelligence Import Architecture
- ADR-0061: Intelligence Confirmations Are Not Duplicates
- ADR-0059: PostgreSQL 17 Native Architecture