Source: ebisu/docs/DATA_STORAGE_ARCHITECTURE.md | ✏️ Edit on GitHub

Ebisu Data Storage Architecture

Problem Statement

Large vessel registry files (10MB-100MB+ each)
Git repository bloat from binary files
Need versioning and audit trail
Multiple data formats (CSV, Excel, XML, JSON)

Recommended Solution: Object Storage with Metadata Tracking

Architecture

┌─────────────────┐     ┌──────────────────┐     ┌─────────────────┐
│   Data Source   │────▶│  Import Script   │────▶│    Database     │
└─────────────────┘     └──────────────────┘     └─────────────────┘
         │                        │                         ▲
         │                        ▼                         │
         │              ┌──────────────────┐               │
         └─────────────▶│  Object Storage  │               │
                        │   (S3/MinIO)     │               │
                        └──────────────────┘               │
                                  │                         │
                                  ▼                         │
                        ┌──────────────────┐               │
                        │ Metadata in DB   │───────────────┘
                        └──────────────────┘

Implementation Options

Option 1: AWS S3 (Recommended for Production)

# S3 bucket structure
s3://ebisu-vessel-data/
├── sources/
│   ├── usa-alaska/
│   │   ├── 2025-01-01/data.csv
│   │   └── 2025-01-01/metadata.json
│   └── eu-fleet/
│       ├── spain/2025-01-01/data.csv
│       └── france/2025-01-01/data.csv
└── archive/
    └── [historical versions]

Option 2: MinIO (Self-Hosted)

# docker-compose addition
services:
  minio:
    image: minio/minio:latest
    ports:
      - "9000:9000"
      - "9001:9001"
    environment:
      MINIO_ROOT_USER: ebisu_admin
      MINIO_ROOT_PASSWORD: ${MINIO_PASSWORD}
    volumes:
      - ./data/minio:/data
    command: server /data --console-address ":9001"

Option 3: PostgreSQL Large Objects (Simple but Limited)

-- Store files directly in database
CREATE TABLE data_files (
    file_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    source_id UUID REFERENCES original_sources_vessels(source_id),
    file_name TEXT NOT NULL,
    file_size BIGINT NOT NULL,
    file_hash TEXT NOT NULL,
    file_oid OID,  -- PostgreSQL large object ID
    uploaded_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    metadata JSONB
);

Metadata Tracking Schema

CREATE TABLE import_file_registry (
    registry_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    source_shortname TEXT NOT NULL,
    file_path TEXT NOT NULL,  -- S3 path or local reference
    file_hash TEXT NOT NULL UNIQUE,
    file_size BIGINT NOT NULL,
    file_format TEXT NOT NULL CHECK (file_format IN ('csv', 'xlsx', 'xls', 'xml', 'json')),
    
    -- Storage location
    storage_type TEXT NOT NULL CHECK (storage_type IN ('s3', 'minio', 'local', 'pg_largeobject')),
    storage_uri TEXT NOT NULL,  -- Full URI to access the file
    
    -- Import tracking
    import_status TEXT NOT NULL DEFAULT 'pending',
    import_batch_id UUID REFERENCES intelligence_import_batches(batch_id),
    imported_at TIMESTAMP,
    
    -- Data profile
    record_count INTEGER,
    column_count INTEGER,
    columns JSONB,  -- Array of column names/types
    data_sample JSONB,  -- First few rows for preview
    
    -- Metadata
    source_date DATE,
    collection_date DATE NOT NULL DEFAULT CURRENT_DATE,
    expires_at DATE,  -- For temporary/licensed data
    data_license TEXT,
    notes TEXT,
    
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

CREATE INDEX idx_import_files_source ON import_file_registry(source_shortname);
CREATE INDEX idx_import_files_status ON import_file_registry(import_status);
CREATE INDEX idx_import_files_hash ON import_file_registry(file_hash);

Benefits

Git Repository Stays Clean
- Only code, scripts, and documentation in Git
- No binary files bloating the repository
Better Data Management
- Versioning through object storage
- Deduplication via hash checking
- Expiration for temporary data
Scalability
- Can handle TB of vessel data
- Parallel processing support
- CDN distribution possible
Security
- Signed URLs for temporary access
- Encryption at REST
- Access control per source

Migration Plan

Set up object storage (S3 or MinIO)
Create metadata tracking tables
Write migration script to move existing files
Update import scripts to use new storage
Remove large files from Git history with BFG Repo-Cleaner

Example Usage

# Upload new data file
./scripts/data-manager.sh upload \
  --source "usa-alaska" \
  --file "Alaska_vessels_2025-02-01.csv" \
  --date "2025-02-01"

# List available files
./scripts/data-manager.sh list --source "usa-alaska"

# Import from storage
./scripts/import-from-storage.sh --source "usa-alaska" --date "latest"

Problem Statement​

Recommended Solution: Object Storage with Metadata Tracking​

Architecture​

Implementation Options​

Option 1: AWS S3 (Recommended for Production)​

Option 2: MinIO (Self-Hosted)​

Option 3: PostgreSQL Large Objects (Simple but Limited)​

Metadata Tracking Schema​

Benefits​

Migration Plan​

Example Usage​

Problem Statement

Recommended Solution: Object Storage with Metadata Tracking

Architecture

Implementation Options

Option 1: AWS S3 (Recommended for Production)

Option 2: MinIO (Self-Hosted)

Option 3: PostgreSQL Large Objects (Simple but Limited)

Metadata Tracking Schema

Benefits

Migration Plan

Example Usage