Source:
oceanid/docs/architecture/project-plan.md| ✏️ Edit on GitHub
Oceanid Intelligence Staging — Project Plan
This document defines the end goal, architecture, scope, and milestones for the intelligence staging layer that supports SME pre‑labeling, annotation, and curation.
End Goal
- Provide a robust staging pipeline where:
- SMEs upload documents, receive pre‑labels, and annotate in Label Studio.
- Annotations are versioned and stored in a Hugging Face dataset repo.
- Cleaned extractions (structured data) are persisted in Postgres with strong lineage.
- Multiple raw sources land into per‑source tables, tracked for freshness and provenance.
- Curated aggregates align to Ebisu’s production schema and can be published downstream.
Architecture Overview
High‑Level Dataflow
K8s + Calypso Components
Data Model (Medallion)
- raw
- raw.
_documents(source_doc_id, fetched_at, content, content_sha, url, metadata, ingestion_run_id)
- raw.
- stage
- stage.documents(id, source_id, source_doc_id, collected_at, text, content_sha, metadata)
- stage.extractions(document_id, label, value, start, end, confidence, db_mapping, annotator, updated_at)
- label
- label.annotation_refs(project_id, task_id, hf_repo, path, commit_sha, annotated_at, schema_version)
- curated (Ebisu‑aligned)
- curated.vessels, curated.vessel_info, curated.vessel_associates, curated.entity_persons, curated.entity_organizations
- control
- control.sources(name, sla_minutes, enabled), control.ingestion_runs, control.schema_versions
Versioning & Evolution
- NER labels: managed via ESC; adapter maps indices using NER_LABELS; stage stores label strings.
- Schema version: stored per JSONL record and referenced by curated transformations.
- Model updates: re‑export ONNX and adjust Triton config dims; no change needed in stage.extractions.
Milestones
- GPU + Pre‑labels (Done)
- Triton on GPU, Cloudflare tunnel, Adapter, DistilBERT support
- Annotations Sink (Done)
- Write annotations to HF; persist cleaned extractions to PG
- Postgres Baseline (Planned)
- Add control.and raw. schemas; add views for freshness and duplicates
- Curated Layer (Planned)
- Add SQL/dbt jobs to populate curated.* aligned with Ebisu
- CI & QA (Planned)
- Smoke tests for adapter/sink; data quality checks on stage/curated
Acceptance Criteria
- Pre‑labels served reliably; latency acceptable for SMEs.
- Annotations appear in HF JSONL with schema_version and provenance.
- Cleaned extractions are queryable in Postgres; freshness and lineage available.
- Curated aggregates map correctly to Ebisu schema, with reproducible lineage.
Configuration Keys
- hfAccessToken (ESC): Hugging Face write token
- schemaVersion (ESC): Staging schema version; default 1.0.0
- nerLabels (ESC): JSON label list (order defines index mapping)
- postgres_password (Pulumi): enables in‑cluster Postgres when provided
Operational Notes
- LS Webhooks: configure ANNOTATION_CREATED/UPDATED to Annotations Sink.
- Model swaps: prefer per-request model override in Adapter; default remains DistilBERT.
- Cloudflare: ensure node tunnel credentials match the tunnel ID.