Source:
oceanid/docs/operations/overview.md| ✏️ Edit on GitHub
Operations Guide
This guide covers the day‑to‑day flows for the Oceanid stack with 2× VPS and 1× GPU workstation.
Topology
- K8s on primary VPS (tethys). Label Studio runs here and is exposed via the Cloudflare cluster tunnel at
https://label.boathou.se. - Calypso (GPU workstation) runs a host‑level cloudflared connector and a simple GPU HTTP service at
https://gpu.boathou.se. - All secrets and tokens are stored in Pulumi ESC (
default/oceanid-cluster).
Calypso Contract
- DNS:
gpu.<base>CNAME points to the Node Tunnel target<TUNNEL_ID>.cfargotunnel.com. - Host tunnel: systemd
cloudflared-node.service, config under/etc/cloudflared/config.yamlroutinggpu.<base>→http://localhost:8000. - Triton: systemd
tritonserver.service(Docker), ports 8000/8001/8002; models mounted from/opt/triton/models. - Adapter: calls
TRITON_BASE_URL=https://gpu.<base>and presents Cloudflare Access service token when enabled. - Pulumi ownership:
HostCloudflaredandHostDockerServicecomponents render/update these units.
Deploy
Policy: Always deploy via Pulumi (no manual kubectl edits). This ensures changes are reproducible and audited.
- Minimal, non‑disruptive deploy:
pulumi -C cluster previewpulumi -C cluster up
- Enable provisioning/LB once tunnels are stable:
pulumi -C cluster config set oceanid-cluster:enableNodeProvisioning truepulumi -C cluster config set oceanid-cluster:enableControlPlaneLB truepulumi -C cluster up
Validate
- If kubectl is flaky, ensure a local API tunnel:
scripts/k3s-ssh-tunnel.sh tethysexport KUBECONFIG=cluster/kubeconfig.yaml
- Basic smoke tests:
make smoke(uses label.boathou.se and gpu.boathou.se)- Triton HTTP V2 live:
curl -s https://gpu.boathou.se/v2/health/readycurl -s https://gpu.boathou.se/v2/models
- Docling model present:
curl -s https://gpu.<base>/v2/models/docling_granite_python
Hands‑off Training and Deployment (ESC‑only)
- GitHub Actions (
.github/workflows/train-ner.yml) trains (nightly or on demand) from the HF dataset and publishes an updated ONNX to a private HF model repo. - Calypso runs a
model-pullersystemd timer that fetches the latest ONNX into/opt/triton/models/distilbert-base-uncased/<n>/. - Triton repository polling reloads new versions automatically.
Configure once in Pulumi ESC (no GitHub Secrets/Vars required):
pulumiConfig.oceanid-cluster:hfAccessToken→ HF write token (used by sink + CI)pulumiConfig.oceanid-cluster:hfDatasetRepo→ e.g.,goldfish-inc/oceanid-annotationspulumiConfig.oceanid-cluster:hfModelRepo→ e.g.,goldfish-inc/oceanid-ner-distilbertpulumiConfig.oceanid-cluster:postgres_url→ CrunchyBridge PG 17 URL (for migrations)
ESC commands:
esc env set default/oceanid-cluster pulumiConfig.oceanid-cluster:hfAccessToken "<HF_WRITE_TOKEN>" --secret
esc env set default/oceanid-cluster pulumiConfig.oceanid-cluster:hfDatasetRepo "goldfish-inc/oceanid-annotations"
esc env set default/oceanid-cluster pulumiConfig.oceanid-cluster:hfModelRepo "goldfish-inc/oceanid-ner-distilbert"
esc env set default/oceanid-cluster pulumiConfig.oceanid-cluster:postgres_url "postgres://<user>:<pass>@p.<cluster-id>.db.postgresbridge.com:5432/postgres" --secret
Workflows:
train-ner.ymlpulls HF token + repo names from ESC via OIDC.database-migrations.ymlpulls DB URL from ESC and applies SQL migrations V3–V6; ensures extensionspgcrypto,postgis,btree_gist. Skips gracefully if DB URL not set.- Check connector health in Cloudflare Zero Trust → Tunnels.
Access Controls
- GPU endpoint
gpu.<base>is protected by a Cloudflare Zero Trust Access application that only allows a service token issued to the in-cluster adapter. The adapter is provisioned withCF_ACCESS_CLIENT_IDandCF_ACCESS_CLIENT_SECRETautomatically. - Kubernetes API
api.<base>is exposed via the node tunnel and protected by Cloudflare Access. Usecloudflared access tcpto connect.
K8s API via Access (recommended)
cloudflared access tcp --hostname api.<base> --url 127.0.0.1:6443 &
export KUBECONFIG=~/.kube/k3s-config.yaml
kubectl cluster-info
Ensure your email domain or emails are included in Access policy.
Mermaid (GPU path):
Label Studio ML Backend (Auto‑provisioned)
- Project
NER_Datais auto‑provisioned in‑cluster by a one‑off Job:- Connects ML backend:
http://ls-triton-adapter.apps.svc.cluster.local:9090 - Applies a full NER labeling interface from ESC/labels.json
- Registers task‑create webhooks to the sink
/ingest
- Connects ML backend:
- For additional projects, you can run the provisioner Job again or connect the ML backend manually.
Manual steps (if needed):
- Project → Settings → Model → Connect model
- URL:
http://ls-triton-adapter.apps.svc.cluster.local:9090 - Test (uses
/setupGET/POST); Health:/health
Notes:
- The deprecated
ls-ml-autoconnectservice was removed to avoid coupling infra to a personal API token. ls-triton-adapterstays running cluster‑wide as the shared inference endpoint.- PDF page images for box annotation are controlled by
PDF_CONVERT_TO_IMAGES(set via Pulumi).- If you change this setting or other env vars, redeploy with Pulumi (
pulumi -C cluster up).
- If you change this setting or other env vars, redeploy with Pulumi (
Secrets & Config
- ESC keys to verify:
cloudflareNodeTunnelId,cloudflareNodeTunnelToken,cloudflareNodeTunnelHostname,cloudflareNodeTunnelTargetcloudflareAccountId,cloudflareApiToken,cloudflareZoneIdlabelStudioApiToken- API token for ML backend auto-configuration (from 1Password)
- The node tunnel token can be either:
- Base64‑encoded credentials.json, or
- Raw TUNNEL_TOKEN string The NodeTunnels + HostCloudflared components auto‑detect both.
Troubleshooting
- Cloudflare record exists: delete the existing DNS record (e.g.,
label.boathou.se) or remove Pulumi management for that hostname. - cloudflared “control stream failure”:
- Ensure
protocol: autoanddnsPolicy: ClusterFirstWithHostNetare active. - Verify Calypso has the label
oceanid.cluster/tunnel-enabled=trueif using the K8s DaemonSet.
- Ensure
- SSH provisioning timeouts:
- Keep
enableNodeProvisioning=falsewhile stabilizing tunnels.
- Keep
- Calypso sudo:
oceanidmust have passwordless sudo for apt/systemd.
Calypso quick checks
ssh calypso 'systemctl status cloudflared-node --no-pager; systemctl status tritonserver --no-pager'
ssh calypso 'curl -sf http://localhost:8000/v2/health/ready && echo OK'
curl -sk https://gpu.<base>/v2/health/ready
CrunchyBridge Postgres
- Configure sink to use external DB:
pulumi -C cluster config set --secret oceanid-cluster:postgres_url 'postgresql://<user>:<pass>@<host>:5432/<db>'make up
- Apply schema migrations locally:
export DATABASE_URL='postgresql://<user>:<pass>@<host>:5432/<db>'make db:migrate
- Quick checks:
make db:psqlpsql "$DATABASE_URL" -c "select * from stage.v_documents_freshness;"
Label Studio database (labelfish)
- Database:
labelfishon the "ebisu" CrunchyBridge cluster (PG 17), isolated from staging/curated - Schema:
labelfish(bootstrap insql/labelstudio/labelfish_schema.sql); app tables created by Label Studio migrations - Roles:
labelfish_owner(app),labelfish_rw(optional services),labelfish_ro(read‑only) - Extensions:
pgcrypto,citext,pg_trgm,btree_gist; defaults:UTC,search_path=labelfish,public, timeouts
Provision (manual):
# Create DB (example owner shown via CrunchyBridge CLI)
cb psql 3x4xvkn3xza2zjwiklcuonpamy --role postgres -- -c \
"CREATE DATABASE labelfish OWNER u_ogfzdegyvvaj3g4iyuvlu5yxmi;"
# Apply bootstrap SQL (roles/schema/extensions/grants)
cb psql 3x4xvkn3xza2zjwiklcuonpamy --role postgres --database labelfish \
< sql/labelstudio/labelfish_schema.sql
# Store connection URL for the cluster stack
esc env set default/oceanid-cluster pulumiConfig.oceanid-cluster:labelStudioDbUrl \
"postgres://labelfish_owner:<password>@p.3x4xvkn3xza2zjwiklcuonpamy.db.postgresbridge.com:5432/labelfish" --secret
Wire Label Studio (cluster):
- The cluster stack reads
labelStudioDbUrl(ESC secret) and setsDATABASE_URLfor the LS deployment. - Public host:
LABEL_STUDIO_HOST=https://label.<base> - Tunnel ingress/DNS:
label.<base>CNAME →<cluster_tunnel_id>.cfargotunnel.com(proxied) with ingress mapping to LS service.
Verify:
curl -I https://label.<base>/→302 Foundto/user/login/- First start creates app tables:
psql …/labelfish -c "\dt labelfish.*"
Add a new GPU host (host‑level)
- Provision SSH user + key; add to ESC.
- Add a
HostCloudflared+ optionalHostGpuServicefor the host. - Point a new
gpuX.<base>route via Cloudflare DNS.
Using Triton with Docling/Granite
- If you have a ready Docker image (e.g., a Docling‑Granite HTTP server), you can run it instead of Triton. Ask and we’ll switch the host service to that container and route
gpu.<base>to its HTTP port. - To use a model with Triton, place it under
/opt/triton/models/<model_name>/1/on Calypso and add aconfig.pbtxt. Triton supports TensorRT, ONNX, PyTorch, TensorFlow and Python backends. - For Docling‑Granite via Python backend, this repo includes a skeleton at
triton-models/docling_granite_python/. Copy it to Calypso and customizemodel.pyas needed.
Example (on Calypso):
sudo mkdir -p /opt/triton/models
scp -r triton-models/docling_granite_python calypso:/tmp/
ssh calypso "sudo mv /tmp/docling_granite_python /opt/triton/models/ && sudo systemctl restart tritonserver"
curl -s https://gpu.<base>/v2/models
GPU pinning
- `distilbert-base-uncased` is pinned to GPU0; `docling_granite_python` is pinned to GPU1 (see `instance_group.gpus` in each `config.pbtxt`). Adjust if hardware layout changes.
Adapter usage (PDF):
kubectl -n apps port-forward svc/ls-triton-adapter 9090:9090 &
PDF64=$(base64 -w0 sample.pdf)
curl -s -X POST http://localhost:9090/predict \
-H 'Content-Type: application/json' \
-d '{"model":"docling_granite_python","pdf_base64":"'"$PDF64"'","prompt":"Extract vessel summary"}' | jq .