Source: oceanid/docs/architecture/project-plan.md | ✏️ Edit on GitHub

Oceanid Intelligence Staging — Project Plan

This document defines the end goal, architecture, scope, and milestones for the intelligence staging layer that supports SME pre‑labeling, annotation, and curation.

End Goal

Provide a robust staging pipeline where:
- SMEs upload documents, receive pre‑labels, and annotate in Label Studio.
- Annotations are versioned and stored in a Hugging Face dataset repo.
- Cleaned extractions (structured data) are persisted in Postgres with strong lineage.
- Multiple raw sources land into per‑source tables, tracked for freshness and provenance.
- Curated aggregates align to Ebisu’s production schema and can be published downstream.

Architecture Overview

High‑Level Dataflow

K8s + Calypso Components

Data Model (Medallion)

raw
- raw._documents(source_doc_id, fetched_at, content, content_sha, url, metadata, ingestion_run_id)
stage
- stage.documents(id, source_id, source_doc_id, collected_at, text, content_sha, metadata)
- stage.extractions(document_id, label, value, start, end, confidence, db_mapping, annotator, updated_at)
label
- label.annotation_refs(project_id, task_id, hf_repo, path, commit_sha, annotated_at, schema_version)
curated (Ebisu‑aligned)
- curated.vessels, curated.vessel_info, curated.vessel_associates, curated.entity_persons, curated.entity_organizations
control
- control.sources(name, sla_minutes, enabled), control.ingestion_runs, control.schema_versions

Versioning & Evolution

NER labels: managed via ESC; adapter maps indices using NER_LABELS; stage stores label strings.
Schema version: stored per JSONL record and referenced by curated transformations.
Model updates: re‑export ONNX and adjust Triton config dims; no change needed in stage.extractions.

Milestones

GPU + Pre‑labels (Done)
- Triton on GPU, Cloudflare tunnel, Adapter, DistilBERT support
Annotations Sink (Done)
- Write annotations to HF; persist cleaned extractions to PG
Postgres Baseline (Planned)
- Add control.and raw. schemas; add views for freshness and duplicates
Curated Layer (Planned)
- Add SQL/dbt jobs to populate curated.* aligned with Ebisu
CI & QA (Planned)
- Smoke tests for adapter/sink; data quality checks on stage/curated

Acceptance Criteria

Pre‑labels served reliably; latency acceptable for SMEs.
Annotations appear in HF JSONL with schema_version and provenance.
Cleaned extractions are queryable in Postgres; freshness and lineage available.
Curated aggregates map correctly to Ebisu schema, with reproducible lineage.

Configuration Keys

hfAccessToken (ESC): Hugging Face write token
schemaVersion (ESC): Staging schema version; default 1.0.0
nerLabels (ESC): JSON label list (order defines index mapping)
postgres_password (Pulumi): enables in‑cluster Postgres when provided

Operational Notes

LS Webhooks: configure ANNOTATION_CREATED/UPDATED to Annotations Sink.
Model swaps: prefer per-request model override in Adapter; default remains DistilBERT.
Cloudflare: ensure node tunnel credentials match the tunnel ID.

End Goal​

Architecture Overview​

High‑Level Dataflow​

K8s + Calypso Components​

Data Model (Medallion)​

Versioning & Evolution​

Milestones​

Acceptance Criteria​

Configuration Keys​

Operational Notes​

End Goal

Architecture Overview

High‑Level Dataflow

K8s + Calypso Components

Data Model (Medallion)

Versioning & Evolution

Milestones

Acceptance Criteria

Configuration Keys

Operational Notes