Source:
ebisu/docs/DATA_STORAGE_ARCHITECTURE.md| ✏️ Edit on GitHub
Ebisu Data Storage Architecture
Problem Statement
- Large vessel registry files (10MB-100MB+ each)
- Git repository bloat from binary files
- Need versioning and audit trail
- Multiple data formats (CSV, Excel, XML, JSON)
Recommended Solution: Object Storage with Metadata Tracking
Architecture
┌─────────────────┐ ┌──────────────────┐ ┌─────────────────┐
│ Data Source │────▶│ Import Script │────▶│ Database │
└─────────────────┘ └──────────────────┘ └─────────────────┘
│ │ ▲
│ ▼ │
│ ┌──────────────────┐ │
└─────────────▶│ Object Storage │ │
│ (S3/MinIO) │ │
└──────────────────┘ │
│ │
▼ │
┌──────────────────┐ │
│ Metadata in DB │───────────────┘
└──────────────────┘
Implementation Options
Option 1: AWS S3 (Recommended for Production)
# S3 bucket structure
s3://ebisu-vessel-data/
├── sources/
│ ├── usa-alaska/
│ │ ├── 2025-01-01/data.csv
│ │ └── 2025-01-01/metadata.json
│ └── eu-fleet/
│ ├── spain/2025-01-01/data.csv
│ └── france/2025-01-01/data.csv
└── archive/
└── [historical versions]
Option 2: MinIO (Self-Hosted)
# docker-compose addition
services:
minio:
image: minio/minio:latest
ports:
- "9000:9000"
- "9001:9001"
environment:
MINIO_ROOT_USER: ebisu_admin
MINIO_ROOT_PASSWORD: ${MINIO_PASSWORD}
volumes:
- ./data/minio:/data
command: server /data --console-address ":9001"
Option 3: PostgreSQL Large Objects (Simple but Limited)
-- Store files directly in database
CREATE TABLE data_files (
file_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
source_id UUID REFERENCES original_sources_vessels(source_id),
file_name TEXT NOT NULL,
file_size BIGINT NOT NULL,
file_hash TEXT NOT NULL,
file_oid OID, -- PostgreSQL large object ID
uploaded_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
metadata JSONB
);
Metadata Tracking Schema
CREATE TABLE import_file_registry (
registry_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
source_shortname TEXT NOT NULL,
file_path TEXT NOT NULL, -- S3 path or local reference
file_hash TEXT NOT NULL UNIQUE,
file_size BIGINT NOT NULL,
file_format TEXT NOT NULL CHECK (file_format IN ('csv', 'xlsx', 'xls', 'xml', 'json')),
-- Storage location
storage_type TEXT NOT NULL CHECK (storage_type IN ('s3', 'minio', 'local', 'pg_largeobject')),
storage_uri TEXT NOT NULL, -- Full URI to access the file
-- Import tracking
import_status TEXT NOT NULL DEFAULT 'pending',
import_batch_id UUID REFERENCES intelligence_import_batches(batch_id),
imported_at TIMESTAMP,
-- Data profile
record_count INTEGER,
column_count INTEGER,
columns JSONB, -- Array of column names/types
data_sample JSONB, -- First few rows for preview
-- Metadata
source_date DATE,
collection_date DATE NOT NULL DEFAULT CURRENT_DATE,
expires_at DATE, -- For temporary/licensed data
data_license TEXT,
notes TEXT,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
CREATE INDEX idx_import_files_source ON import_file_registry(source_shortname);
CREATE INDEX idx_import_files_status ON import_file_registry(import_status);
CREATE INDEX idx_import_files_hash ON import_file_registry(file_hash);
Benefits
-
Git Repository Stays Clean
- Only code, scripts, and documentation in Git
- No binary files bloating the repository
-
Better Data Management
- Versioning through object storage
- Deduplication via hash checking
- Expiration for temporary data
-
Scalability
- Can handle TB of vessel data
- Parallel processing support
- CDN distribution possible
-
Security
- Signed URLs for temporary access
- Encryption at REST
- Access control per source
Migration Plan
- Set up object storage (S3 or MinIO)
- Create metadata tracking tables
- Write migration script to move existing files
- Update import scripts to use new storage
- Remove large files from Git history with BFG Repo-Cleaner
Example Usage
# Upload new data file
./scripts/data-manager.sh upload \
--source "usa-alaska" \
--file "Alaska_vessels_2025-02-01.csv" \
--date "2025-02-01"
# List available files
./scripts/data-manager.sh list --source "usa-alaska"
# Import from storage
./scripts/import-from-storage.sh --source "usa-alaska" --date "latest"