Source: ebisu/docs/guides/import/reorganization-plan.md | ✏️ Edit on GitHub

Import System Reorganization Plan

Current Issues

Disorganized Scripts: 60+ import scripts scattered across multiple directories
Data in Git: 152 data files (61MB+) bloating the repository
No Standardization: Each importer has different patterns and error handling
Poor Discoverability: Hard to find what sources are available or how to import them
No Metadata: Missing documentation about data sources, update frequency, licensing

Proposed Structure

1. Self-Contained Source Modules

ebisu-v1/
├── import-system/
│   ├── core/                      # Core import framework
│   │   ├── base_importer.sh       # Base class all importers inherit
│   │   ├── intelligence_loader.py # Standard intelligence conversion
│   │   ├── validators.py          # Common validation functions
│   │   └── storage_client.py      # Object storage interface
│   │
│   ├── registry.yaml              # Master registry of all sources
│   │
│   └── sources/                   # One directory per source
│       ├── usa-alaska/
│       │   ├── metadata.yaml      # Source configuration
│       │   ├── import.sh          # Import script (inherits base)
│       │   ├── transform.py       # Optional custom transformation
│       │   ├── validate.sql       # Optional custom validation
│       │   ├── schema.sql         # Expected data schema
│       │   ├── README.md          # Documentation
│       │   └── tests/             # Test files and scripts
│       │
│       ├── chile-ltp-pep/
│       │   └── ...
│       │
│       └── eu-fleet/
│           ├── metadata.yaml      # Parent metadata
│           ├── import.sh          # Orchestrates country imports
│           └── countries/         # Sub-sources
│               ├── spain/
│               └── france/

2. Metadata Standard (metadata.yaml)

source:
  shortname: USA_AK
  full_name: "USA Alaska Vessel Registry"
  type: COUNTRY
  authority_level: AUTHORITATIVE
  country_code: USA
  region: Alaska
  
provider:
  name: "Alaska Department of Fish and Game"
  website: "https://www.adfg.alaska.gov"
  contact: "vessels@adfg.alaska.gov"
  
data:
  format: csv
  encoding: utf-8
  delimiter: ","
  update_frequency: monthly
  license: "Public Domain"
  
fields:
  vessel_name:
    source_column: "VESSEL NAME"
    type: string
    required: true
  adfg_number:
    source_column: "ADFG NUMBER"
    type: string
    required: true
  length:
    source_column: "LENGTH"
    type: numeric
    unit: feet
    
storage:
  type: s3  # or 'local', 'minio', 'url'
  bucket: ebisu-vessel-data
  prefix: sources/usa-alaska/
  
import:
  schedule: "0 2 1 * *"  # Monthly on 1st at 2am
  notifications:
    - email: data-team@example.com

3. Storage Options

Option A: AWS S3 (Recommended)

# Configure AWS CLI
aws configure set aws_access_key_id $AWS_KEY
aws configure set aws_secret_access_key $AWS_SECRET

# Upload data
aws s3 cp Alaska_vessels_2025-02-01.csv \
  s3://ebisu-vessel-data/sources/usa-alaska/2025-02-01/data.csv

# Import latest
./import-system/sources/usa-alaska/import.sh

Option B: MinIO (Self-Hosted S3)

# docker-compose.yml addition
services:
  minio:
    image: minio/minio
    ports:
      - "9000:9000"
      - "9001:9001"
    environment:
      MINIO_ROOT_USER: admin
      MINIO_ROOT_PASSWORD: ${MINIO_PASSWORD}
    volumes:
      - ./data/minio:/data
    command: server /data --console-address ":9001"

Option C: PostgreSQL Large Objects

-- Simple but limited to 4TB per file
SELECT lo_import('/path/to/file.csv');

4. Import Workflow

5. Usage Examples

# Import specific source
./import-system/sources/usa-alaska/import.sh

# Import with specific file
./import-system/sources/usa-alaska/import.sh --file s3://bucket/custom.csv

# Import all sources
./import-system/import-all.sh

# Check source status
./import-system/status.sh usa-alaska

# Add new source
./import-system/new-source.sh --template country --name "canada-pacific"

Implementation Steps

Phase 1: Core Framework (Week 1)
- Create base importer class
- Set up storage interface
- Create metadata schema
Phase 2: Migration (Week 2-3)
- Move existing importers to new structure
- Extract data files to object storage
- Create metadata files for each source
Phase 3: Enhancement (Week 4)
- Add scheduling support
- Create monitoring dashboard
- Add data quality metrics

Benefits

Maintainability: Consistent structure and patterns
Scalability: Easy to add new sources
Discoverability: Clear registry of all sources
Flexibility: Support multiple storage backends
Quality: Built-in validation and testing
Documentation: Self-documenting with metadata

Storage Cost Comparison

Solution	Setup Cost	Monthly Cost (1TB)	Pros	Cons
Git LFS	Free	$5/50GB	Simple	Expensive at scale
AWS S3	Free	$23	Industry standard	External dependency
MinIO	Server costs	Server costs only	Self-hosted	Maintenance required
PostgreSQL	Free	Included in DB	No external deps	Limited features

Next Steps

Choose storage solution
Create migration plan for existing data
Build core framework
Migrate one source as proof of concept
Document and train team

Current Issues​

Proposed Structure​

1. Self-Contained Source Modules​

2. Metadata Standard (metadata.yaml)​

3. Storage Options​

Option A: AWS S3 (Recommended)​

Option B: MinIO (Self-Hosted S3)​

Option C: PostgreSQL Large Objects​

4. Import Workflow​

5. Usage Examples​

Implementation Steps​

Benefits​

Storage Cost Comparison​

Next Steps​

Current Issues

Proposed Structure

1. Self-Contained Source Modules

2. Metadata Standard (metadata.yaml)

3. Storage Options

Option A: AWS S3 (Recommended)

Option B: MinIO (Self-Hosted S3)

Option C: PostgreSQL Large Objects

4. Import Workflow

5. Usage Examples

Implementation Steps

Benefits

Storage Cost Comparison

Next Steps