Skip to main content

Source: ebisu/docs/DATA_STORAGE_ARCHITECTURE.md | ✏️ Edit on GitHub

Ebisu Data Storage Architecture

Problem Statement

  • Large vessel registry files (10MB-100MB+ each)
  • Git repository bloat from binary files
  • Need versioning and audit trail
  • Multiple data formats (CSV, Excel, XML, JSON)

Architecture

┌─────────────────┐     ┌──────────────────┐     ┌─────────────────┐
│ Data Source │────▶│ Import Script │────▶│ Database │
└─────────────────┘ └──────────────────┘ └─────────────────┘
│ │ ▲
│ ▼ │
│ ┌──────────────────┐ │
└─────────────▶│ Object Storage │ │
│ (S3/MinIO) │ │
└──────────────────┘ │
│ │
▼ │
┌──────────────────┐ │
│ Metadata in DB │───────────────┘
└──────────────────┘

Implementation Options

# S3 bucket structure
s3://ebisu-vessel-data/
├── sources/
│ ├── usa-alaska/
│ │ ├── 2025-01-01/data.csv
│ │ └── 2025-01-01/metadata.json
│ └── eu-fleet/
│ ├── spain/2025-01-01/data.csv
│ └── france/2025-01-01/data.csv
└── archive/
└── [historical versions]

Option 2: MinIO (Self-Hosted)

# docker-compose addition
services:
minio:
image: minio/minio:latest
ports:
- "9000:9000"
- "9001:9001"
environment:
MINIO_ROOT_USER: ebisu_admin
MINIO_ROOT_PASSWORD: ${MINIO_PASSWORD}
volumes:
- ./data/minio:/data
command: server /data --console-address ":9001"

Option 3: PostgreSQL Large Objects (Simple but Limited)

-- Store files directly in database
CREATE TABLE data_files (
file_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
source_id UUID REFERENCES original_sources_vessels(source_id),
file_name TEXT NOT NULL,
file_size BIGINT NOT NULL,
file_hash TEXT NOT NULL,
file_oid OID, -- PostgreSQL large object ID
uploaded_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
metadata JSONB
);

Metadata Tracking Schema

CREATE TABLE import_file_registry (
registry_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
source_shortname TEXT NOT NULL,
file_path TEXT NOT NULL, -- S3 path or local reference
file_hash TEXT NOT NULL UNIQUE,
file_size BIGINT NOT NULL,
file_format TEXT NOT NULL CHECK (file_format IN ('csv', 'xlsx', 'xls', 'xml', 'json')),

-- Storage location
storage_type TEXT NOT NULL CHECK (storage_type IN ('s3', 'minio', 'local', 'pg_largeobject')),
storage_uri TEXT NOT NULL, -- Full URI to access the file

-- Import tracking
import_status TEXT NOT NULL DEFAULT 'pending',
import_batch_id UUID REFERENCES intelligence_import_batches(batch_id),
imported_at TIMESTAMP,

-- Data profile
record_count INTEGER,
column_count INTEGER,
columns JSONB, -- Array of column names/types
data_sample JSONB, -- First few rows for preview

-- Metadata
source_date DATE,
collection_date DATE NOT NULL DEFAULT CURRENT_DATE,
expires_at DATE, -- For temporary/licensed data
data_license TEXT,
notes TEXT,

created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

CREATE INDEX idx_import_files_source ON import_file_registry(source_shortname);
CREATE INDEX idx_import_files_status ON import_file_registry(import_status);
CREATE INDEX idx_import_files_hash ON import_file_registry(file_hash);

Benefits

  1. Git Repository Stays Clean

    • Only code, scripts, and documentation in Git
    • No binary files bloating the repository
  2. Better Data Management

    • Versioning through object storage
    • Deduplication via hash checking
    • Expiration for temporary data
  3. Scalability

    • Can handle TB of vessel data
    • Parallel processing support
    • CDN distribution possible
  4. Security

    • Signed URLs for temporary access
    • Encryption at REST
    • Access control per source

Migration Plan

  1. Set up object storage (S3 or MinIO)
  2. Create metadata tracking tables
  3. Write migration script to move existing files
  4. Update import scripts to use new storage
  5. Remove large files from Git history with BFG Repo-Cleaner

Example Usage

# Upload new data file
./scripts/data-manager.sh upload \
--source "usa-alaska" \
--file "Alaska_vessels_2025-02-01.csv" \
--date "2025-02-01"

# List available files
./scripts/data-manager.sh list --source "usa-alaska"

# Import from storage
./scripts/import-from-storage.sh --source "usa-alaska" --date "latest"