Source:
ebisu/docs/guides/import/reorganization-plan.md| ✏️ Edit on GitHub
Import System Reorganization Plan
Current Issues
- Disorganized Scripts: 60+ import scripts scattered across multiple directories
- Data in Git: 152 data files (61MB+) bloating the repository
- No Standardization: Each importer has different patterns and error handling
- Poor Discoverability: Hard to find what sources are available or how to import them
- No Metadata: Missing documentation about data sources, update frequency, licensing
Proposed Structure
1. Self-Contained Source Modules
ebisu-v1/
├── import-system/
│ ├── core/ # Core import framework
│ │ ├── base_importer.sh # Base class all importers inherit
│ │ ├── intelligence_loader.py # Standard intelligence conversion
│ │ ├── validators.py # Common validation functions
│ │ └── storage_client.py # Object storage interface
│ │
│ ├── registry.yaml # Master registry of all sources
│ │
│ └── sources/ # One directory per source
│ ├── usa-alaska/
│ │ ├── metadata.yaml # Source configuration
│ │ ├── import.sh # Import script (inherits base)
│ │ ├── transform.py # Optional custom transformation
│ │ ├── validate.sql # Optional custom validation
│ │ ├── schema.sql # Expected data schema
│ │ ├── README.md # Documentation
│ │ └── tests/ # Test files and scripts
│ │
│ ├── chile-ltp-pep/
│ │ └── ...
│ │
│ └── eu-fleet/
│ ├── metadata.yaml # Parent metadata
│ ├── import.sh # Orchestrates country imports
│ └── countries/ # Sub-sources
│ ├── spain/
│ └── france/
2. Metadata Standard (metadata.yaml)
source:
shortname: USA_AK
full_name: "USA Alaska Vessel Registry"
type: COUNTRY
authority_level: AUTHORITATIVE
country_code: USA
region: Alaska
provider:
name: "Alaska Department of Fish and Game"
website: "https://www.adfg.alaska.gov"
contact: "vessels@adfg.alaska.gov"
data:
format: csv
encoding: utf-8
delimiter: ","
update_frequency: monthly
license: "Public Domain"
fields:
vessel_name:
source_column: "VESSEL NAME"
type: string
required: true
adfg_number:
source_column: "ADFG NUMBER"
type: string
required: true
length:
source_column: "LENGTH"
type: numeric
unit: feet
storage:
type: s3 # or 'local', 'minio', 'url'
bucket: ebisu-vessel-data
prefix: sources/usa-alaska/
import:
schedule: "0 2 1 * *" # Monthly on 1st at 2am
notifications:
- email: data-team@example.com
3. Storage Options
Option A: AWS S3 (Recommended)
# Configure AWS CLI
aws configure set aws_access_key_id $AWS_KEY
aws configure set aws_secret_access_key $AWS_SECRET
# Upload data
aws s3 cp Alaska_vessels_2025-02-01.csv \
s3://ebisu-vessel-data/sources/usa-alaska/2025-02-01/data.csv
# Import latest
./import-system/sources/usa-alaska/import.sh
Option B: MinIO (Self-Hosted S3)
# docker-compose.yml addition
services:
minio:
image: minio/minio
ports:
- "9000:9000"
- "9001:9001"
environment:
MINIO_ROOT_USER: admin
MINIO_ROOT_PASSWORD: ${MINIO_PASSWORD}
volumes:
- ./data/minio:/data
command: server /data --console-address ":9001"
Option C: PostgreSQL Large Objects
-- Simple but limited to 4TB per file
SELECT lo_import('/path/to/file.csv');
4. Import Workflow
5. Usage Examples
# Import specific source
./import-system/sources/usa-alaska/import.sh
# Import with specific file
./import-system/sources/usa-alaska/import.sh --file s3://bucket/custom.csv
# Import all sources
./import-system/import-all.sh
# Check source status
./import-system/status.sh usa-alaska
# Add new source
./import-system/new-source.sh --template country --name "canada-pacific"
Implementation Steps
-
Phase 1: Core Framework (Week 1)
- Create base importer class
- Set up storage interface
- Create metadata schema
-
Phase 2: Migration (Week 2-3)
- Move existing importers to new structure
- Extract data files to object storage
- Create metadata files for each source
-
Phase 3: Enhancement (Week 4)
- Add scheduling support
- Create monitoring dashboard
- Add data quality metrics
Benefits
- Maintainability: Consistent structure and patterns
- Scalability: Easy to add new sources
- Discoverability: Clear registry of all sources
- Flexibility: Support multiple storage backends
- Quality: Built-in validation and testing
- Documentation: Self-documenting with metadata
Storage Cost Comparison
| Solution | Setup Cost | Monthly Cost (1TB) | Pros | Cons |
|---|---|---|---|---|
| Git LFS | Free | $5/50GB | Simple | Expensive at scale |
| AWS S3 | Free | $23 | Industry standard | External dependency |
| MinIO | Server costs | Server costs only | Self-hosted | Maintenance required |
| PostgreSQL | Free | Included in DB | No external deps | Limited features |
Next Steps
- Choose storage solution
- Create migration plan for existing data
- Build core framework
- Migrate one source as proof of concept
- Document and train team