Skip to main content

Source: ebisu/docs/guides/import/reorganization-plan.md | ✏️ Edit on GitHub

Import System Reorganization Plan

Current Issues

  1. Disorganized Scripts: 60+ import scripts scattered across multiple directories
  2. Data in Git: 152 data files (61MB+) bloating the repository
  3. No Standardization: Each importer has different patterns and error handling
  4. Poor Discoverability: Hard to find what sources are available or how to import them
  5. No Metadata: Missing documentation about data sources, update frequency, licensing

Proposed Structure

1. Self-Contained Source Modules

ebisu-v1/
├── import-system/
│ ├── core/ # Core import framework
│ │ ├── base_importer.sh # Base class all importers inherit
│ │ ├── intelligence_loader.py # Standard intelligence conversion
│ │ ├── validators.py # Common validation functions
│ │ └── storage_client.py # Object storage interface
│ │
│ ├── registry.yaml # Master registry of all sources
│ │
│ └── sources/ # One directory per source
│ ├── usa-alaska/
│ │ ├── metadata.yaml # Source configuration
│ │ ├── import.sh # Import script (inherits base)
│ │ ├── transform.py # Optional custom transformation
│ │ ├── validate.sql # Optional custom validation
│ │ ├── schema.sql # Expected data schema
│ │ ├── README.md # Documentation
│ │ └── tests/ # Test files and scripts
│ │
│ ├── chile-ltp-pep/
│ │ └── ...
│ │
│ └── eu-fleet/
│ ├── metadata.yaml # Parent metadata
│ ├── import.sh # Orchestrates country imports
│ └── countries/ # Sub-sources
│ ├── spain/
│ └── france/

2. Metadata Standard (metadata.yaml)

source:
shortname: USA_AK
full_name: "USA Alaska Vessel Registry"
type: COUNTRY
authority_level: AUTHORITATIVE
country_code: USA
region: Alaska

provider:
name: "Alaska Department of Fish and Game"
website: "https://www.adfg.alaska.gov"
contact: "vessels@adfg.alaska.gov"

data:
format: csv
encoding: utf-8
delimiter: ","
update_frequency: monthly
license: "Public Domain"

fields:
vessel_name:
source_column: "VESSEL NAME"
type: string
required: true
adfg_number:
source_column: "ADFG NUMBER"
type: string
required: true
length:
source_column: "LENGTH"
type: numeric
unit: feet

storage:
type: s3 # or 'local', 'minio', 'url'
bucket: ebisu-vessel-data
prefix: sources/usa-alaska/

import:
schedule: "0 2 1 * *" # Monthly on 1st at 2am
notifications:
- email: data-team@example.com

3. Storage Options

# Configure AWS CLI
aws configure set aws_access_key_id $AWS_KEY
aws configure set aws_secret_access_key $AWS_SECRET

# Upload data
aws s3 cp Alaska_vessels_2025-02-01.csv \
s3://ebisu-vessel-data/sources/usa-alaska/2025-02-01/data.csv

# Import latest
./import-system/sources/usa-alaska/import.sh

Option B: MinIO (Self-Hosted S3)

# docker-compose.yml addition
services:
minio:
image: minio/minio
ports:
- "9000:9000"
- "9001:9001"
environment:
MINIO_ROOT_USER: admin
MINIO_ROOT_PASSWORD: ${MINIO_PASSWORD}
volumes:
- ./data/minio:/data
command: server /data --console-address ":9001"

Option C: PostgreSQL Large Objects

-- Simple but limited to 4TB per file
SELECT lo_import('/path/to/file.csv');

4. Import Workflow

5. Usage Examples

# Import specific source
./import-system/sources/usa-alaska/import.sh

# Import with specific file
./import-system/sources/usa-alaska/import.sh --file s3://bucket/custom.csv

# Import all sources
./import-system/import-all.sh

# Check source status
./import-system/status.sh usa-alaska

# Add new source
./import-system/new-source.sh --template country --name "canada-pacific"

Implementation Steps

  1. Phase 1: Core Framework (Week 1)

    • Create base importer class
    • Set up storage interface
    • Create metadata schema
  2. Phase 2: Migration (Week 2-3)

    • Move existing importers to new structure
    • Extract data files to object storage
    • Create metadata files for each source
  3. Phase 3: Enhancement (Week 4)

    • Add scheduling support
    • Create monitoring dashboard
    • Add data quality metrics

Benefits

  1. Maintainability: Consistent structure and patterns
  2. Scalability: Easy to add new sources
  3. Discoverability: Clear registry of all sources
  4. Flexibility: Support multiple storage backends
  5. Quality: Built-in validation and testing
  6. Documentation: Self-documenting with metadata

Storage Cost Comparison

SolutionSetup CostMonthly Cost (1TB)ProsCons
Git LFSFree$5/50GBSimpleExpensive at scale
AWS S3Free$23Industry standardExternal dependency
MinIOServer costsServer costs onlySelf-hostedMaintenance required
PostgreSQLFreeIncluded in DBNo external depsLimited features

Next Steps

  1. Choose storage solution
  2. Create migration plan for existing data
  3. Build core framework
  4. Migrate one source as proof of concept
  5. Document and train team