Skip to main content

Source: oceanid/docs/operations/overview.md | ✏️ Edit on GitHub

Operations Guide

This guide covers the day‑to‑day flows for the Oceanid stack with 2× VPS and 1× GPU workstation.

Topology

  • K8s on primary VPS (tethys). Label Studio runs here and is exposed via the Cloudflare cluster tunnel at https://label.boathou.se.
  • Calypso (GPU workstation) runs a host‑level cloudflared connector and a simple GPU HTTP service at https://gpu.boathou.se.
  • All secrets and tokens are stored in Pulumi ESC (default/oceanid-cluster).

Calypso Contract

  • DNS: gpu.<base> CNAME points to the Node Tunnel target <TUNNEL_ID>.cfargotunnel.com.
  • Host tunnel: systemd cloudflared-node.service, config under /etc/cloudflared/config.yaml routing gpu.<base>http://localhost:8000.
  • Triton: systemd tritonserver.service (Docker), ports 8000/8001/8002; models mounted from /opt/triton/models.
  • Adapter: calls TRITON_BASE_URL=https://gpu.<base> and presents Cloudflare Access service token when enabled.
  • Pulumi ownership: HostCloudflared and HostDockerService components render/update these units.

Deploy

Policy: Always deploy via Pulumi (no manual kubectl edits). This ensures changes are reproducible and audited.

  • Minimal, non‑disruptive deploy:
    • pulumi -C cluster preview
    • pulumi -C cluster up
  • Enable provisioning/LB once tunnels are stable:
    • pulumi -C cluster config set oceanid-cluster:enableNodeProvisioning true
    • pulumi -C cluster config set oceanid-cluster:enableControlPlaneLB true
    • pulumi -C cluster up

Validate

  • If kubectl is flaky, ensure a local API tunnel:
    • scripts/k3s-ssh-tunnel.sh tethys
    • export KUBECONFIG=cluster/kubeconfig.yaml
  • Basic smoke tests:
    • make smoke (uses label.boathou.se and gpu.boathou.se)
    • Triton HTTP V2 live:
      • curl -s https://gpu.boathou.se/v2/health/ready
      • curl -s https://gpu.boathou.se/v2/models
    • Docling model present:
      • curl -s https://gpu.<base>/v2/models/docling_granite_python

Hands‑off Training and Deployment (ESC‑only)

  • GitHub Actions (.github/workflows/train-ner.yml) trains (nightly or on demand) from the HF dataset and publishes an updated ONNX to a private HF model repo.
  • Calypso runs a model-puller systemd timer that fetches the latest ONNX into /opt/triton/models/distilbert-base-uncased/<n>/.
  • Triton repository polling reloads new versions automatically.

Configure once in Pulumi ESC (no GitHub Secrets/Vars required):

  • pulumiConfig.oceanid-cluster:hfAccessToken → HF write token (used by sink + CI)
  • pulumiConfig.oceanid-cluster:hfDatasetRepo → e.g., goldfish-inc/oceanid-annotations
  • pulumiConfig.oceanid-cluster:hfModelRepo → e.g., goldfish-inc/oceanid-ner-distilbert
  • pulumiConfig.oceanid-cluster:postgres_url → CrunchyBridge PG 17 URL (for migrations)

ESC commands:

esc env set default/oceanid-cluster pulumiConfig.oceanid-cluster:hfAccessToken "<HF_WRITE_TOKEN>" --secret
esc env set default/oceanid-cluster pulumiConfig.oceanid-cluster:hfDatasetRepo "goldfish-inc/oceanid-annotations"
esc env set default/oceanid-cluster pulumiConfig.oceanid-cluster:hfModelRepo "goldfish-inc/oceanid-ner-distilbert"
esc env set default/oceanid-cluster pulumiConfig.oceanid-cluster:postgres_url "postgres://<user>:<pass>@p.<cluster-id>.db.postgresbridge.com:5432/postgres" --secret

Workflows:

  • train-ner.yml pulls HF token + repo names from ESC via OIDC.
  • database-migrations.yml pulls DB URL from ESC and applies SQL migrations V3–V6; ensures extensions pgcrypto, postgis, btree_gist. Skips gracefully if DB URL not set.
  • Check connector health in Cloudflare Zero Trust → Tunnels.

Access Controls

  • GPU endpoint gpu.<base> is protected by a Cloudflare Zero Trust Access application that only allows a service token issued to the in-cluster adapter. The adapter is provisioned with CF_ACCESS_CLIENT_ID and CF_ACCESS_CLIENT_SECRET automatically.
  • Kubernetes API api.<base> is exposed via the node tunnel and protected by Cloudflare Access. Use cloudflared access tcp to connect.
cloudflared access tcp --hostname api.<base> --url 127.0.0.1:6443 &
export KUBECONFIG=~/.kube/k3s-config.yaml
kubectl cluster-info

Ensure your email domain or emails are included in Access policy.

Mermaid (GPU path):

Label Studio ML Backend (Auto‑provisioned)

  • Project NER_Data is auto‑provisioned in‑cluster by a one‑off Job:
    • Connects ML backend: http://ls-triton-adapter.apps.svc.cluster.local:9090
    • Applies a full NER labeling interface from ESC/labels.json
    • Registers task‑create webhooks to the sink /ingest
  • For additional projects, you can run the provisioner Job again or connect the ML backend manually.

Manual steps (if needed):

  • Project → Settings → Model → Connect model
  • URL: http://ls-triton-adapter.apps.svc.cluster.local:9090
  • Test (uses /setup GET/POST); Health: /health

Notes:

  • The deprecated ls-ml-autoconnect service was removed to avoid coupling infra to a personal API token.
  • ls-triton-adapter stays running cluster‑wide as the shared inference endpoint.
  • PDF page images for box annotation are controlled by PDF_CONVERT_TO_IMAGES (set via Pulumi).
    • If you change this setting or other env vars, redeploy with Pulumi (pulumi -C cluster up).

Secrets & Config

  • ESC keys to verify:
    • cloudflareNodeTunnelId, cloudflareNodeTunnelToken, cloudflareNodeTunnelHostname, cloudflareNodeTunnelTarget
    • cloudflareAccountId, cloudflareApiToken, cloudflareZoneId
    • labelStudioApiToken - API token for ML backend auto-configuration (from 1Password)
  • The node tunnel token can be either:
    • Base64‑encoded credentials.json, or
    • Raw TUNNEL_TOKEN string The NodeTunnels + HostCloudflared components auto‑detect both.

Troubleshooting

  • Cloudflare record exists: delete the existing DNS record (e.g., label.boathou.se) or remove Pulumi management for that hostname.
  • cloudflared “control stream failure”:
    • Ensure protocol: auto and dnsPolicy: ClusterFirstWithHostNet are active.
    • Verify Calypso has the label oceanid.cluster/tunnel-enabled=true if using the K8s DaemonSet.
  • SSH provisioning timeouts:
    • Keep enableNodeProvisioning=false while stabilizing tunnels.
  • Calypso sudo:
    • oceanid must have passwordless sudo for apt/systemd.

Calypso quick checks

ssh calypso 'systemctl status cloudflared-node --no-pager; systemctl status tritonserver --no-pager'
ssh calypso 'curl -sf http://localhost:8000/v2/health/ready && echo OK'
curl -sk https://gpu.<base>/v2/health/ready

CrunchyBridge Postgres

  • Configure sink to use external DB:
    • pulumi -C cluster config set --secret oceanid-cluster:postgres_url 'postgresql://<user>:<pass>@<host>:5432/<db>'
    • make up
  • Apply schema migrations locally:
    • export DATABASE_URL='postgresql://<user>:<pass>@<host>:5432/<db>'
    • make db:migrate
  • Quick checks:
    • make db:psql
    • psql "$DATABASE_URL" -c "select * from stage.v_documents_freshness;"

Label Studio database (labelfish)

  • Database: labelfish on the "ebisu" CrunchyBridge cluster (PG 17), isolated from staging/curated
  • Schema: labelfish (bootstrap in sql/labelstudio/labelfish_schema.sql); app tables created by Label Studio migrations
  • Roles: labelfish_owner (app), labelfish_rw (optional services), labelfish_ro (read‑only)
  • Extensions: pgcrypto, citext, pg_trgm, btree_gist; defaults: UTC, search_path=labelfish,public, timeouts

Provision (manual):

# Create DB (example owner shown via CrunchyBridge CLI)
cb psql 3x4xvkn3xza2zjwiklcuonpamy --role postgres -- -c \
"CREATE DATABASE labelfish OWNER u_ogfzdegyvvaj3g4iyuvlu5yxmi;"

# Apply bootstrap SQL (roles/schema/extensions/grants)
cb psql 3x4xvkn3xza2zjwiklcuonpamy --role postgres --database labelfish \
< sql/labelstudio/labelfish_schema.sql

# Store connection URL for the cluster stack
esc env set default/oceanid-cluster pulumiConfig.oceanid-cluster:labelStudioDbUrl \
"postgres://labelfish_owner:<password>@p.3x4xvkn3xza2zjwiklcuonpamy.db.postgresbridge.com:5432/labelfish" --secret

Wire Label Studio (cluster):

  • The cluster stack reads labelStudioDbUrl (ESC secret) and sets DATABASE_URL for the LS deployment.
  • Public host: LABEL_STUDIO_HOST=https://label.<base>
  • Tunnel ingress/DNS: label.<base> CNAME → <cluster_tunnel_id>.cfargotunnel.com (proxied) with ingress mapping to LS service.

Verify:

  • curl -I https://label.<base>/302 Found to /user/login/
  • First start creates app tables: psql …/labelfish -c "\dt labelfish.*"

Add a new GPU host (host‑level)

  1. Provision SSH user + key; add to ESC.
  2. Add a HostCloudflared + optional HostGpuService for the host.
  3. Point a new gpuX.<base> route via Cloudflare DNS.

Using Triton with Docling/Granite

  • If you have a ready Docker image (e.g., a Docling‑Granite HTTP server), you can run it instead of Triton. Ask and we’ll switch the host service to that container and route gpu.<base> to its HTTP port.
  • To use a model with Triton, place it under /opt/triton/models/<model_name>/1/ on Calypso and add a config.pbtxt. Triton supports TensorRT, ONNX, PyTorch, TensorFlow and Python backends.
  • For Docling‑Granite via Python backend, this repo includes a skeleton at triton-models/docling_granite_python/. Copy it to Calypso and customize model.py as needed.

Example (on Calypso):

sudo mkdir -p /opt/triton/models
scp -r triton-models/docling_granite_python calypso:/tmp/
ssh calypso "sudo mv /tmp/docling_granite_python /opt/triton/models/ && sudo systemctl restart tritonserver"
curl -s https://gpu.<base>/v2/models

GPU pinning
- `distilbert-base-uncased` is pinned to GPU0; `docling_granite_python` is pinned to GPU1 (see `instance_group.gpus` in each `config.pbtxt`). Adjust if hardware layout changes.

Adapter usage (PDF):

kubectl -n apps port-forward svc/ls-triton-adapter 9090:9090 &
PDF64=$(base64 -w0 sample.pdf)
curl -s -X POST http://localhost:9090/predict \
-H 'Content-Type: application/json' \
-d '{"model":"docling_granite_python","pdf_base64":"'"$PDF64"'","prompt":"Extract vessel summary"}' | jq .