Source:
oceanid/@docs/guides/SME/index.mdx| ✏️ Edit on GitHub
SME Guide: Uploading, Annotating, and Accessing Raw Data (Staging)
Audience: Subject Matter Experts. This guide focuses only on what you need to do: upload, annotate, and view the raw data in the staging database. Everything else is pre‑wired so you don’t need to worry about infrastructure.
SME Quickstart
- Open Label Studio:
https://label.boathou.se(Cloudflare Access) - Open project “NER_Data” (or your team project)
- Verify labeling interface has one block:
<Labels name="label" toName="text" ...> - Click a task → confirm pre‑labels appear inline
- Adjust spans and labels; Save/Submit
- Optional: open DBeaver to browse
stage.*views (read‑only)
Draw Boxes on PDFs
- Open a PDF task and use "PDF Page (Boxes)" to draw rectangles.
- Labels available: TABLE, FIGURE, SECTION, SIGNATURE, STAMP (ask if you need more).
- Your boxes are stored in the database and converted to PDF‑point coordinates for model training.
- Quick checks:
SELECT * FROM stage.v_pdf_boxes_latest ORDER BY created_at DESC LIMIT 50;
Need a new project or a custom label set? See Project Setup for SMEs: project-setup.mdx
What You’ll Do
- Upload data to Label Studio (text, PDFs, CSV/XLSX, images with text).
- Review pre‑labels, correct, and save.
- Optionally, open DBeaver to view your raw data in the staging database (CrunchyBridge PostgreSQL).
Uploading and Annotating
- Open Label Studio at your team URL (Cloudflare Access protected), e.g.,
https://label.<your-domain>. - Create or open a project.
- Import your data:
- Text: a paragraph or sentence per task.
- PDFs: either upload the file or provide an HTTPS URL.
- CSV/XLSX: upload directly. If possible, include a
textcolumn summarizing the most relevant info per row. Keeping apdforurlcolumn helps fetch source docs for context.
- Open a task to see pre‑labels. Adjust spans and labels; Save/Submit.
Tips for high‑quality annotations:
- Choose the most specific label (e.g.,
IMOvs. genericVESSEL). - Keep spans tight (no extra whitespace).
- For CSV/XLSX, if your sheet has multiple useful fields, add a temporary
textcolumn in Excel/Sheets that concatenates them (no scripts needed).
SME Checklist (keep handy)
- You see pre‑labels when opening a task
- The Labels control name is
labelmapped totext - You’re not annotating HTML (HTML Labels are optional and can be read‑only)
- Saved work appears in read‑only staging views (
stage.*)
Where Your Work Goes (Tables you’ll use)
Your saved work is captured in the staging schema. The key tables and views you’ll browse are:
stage.documents— the task text + metadata; one row per task.stage.extractions— span-level NER results from text/PDF tasks.stage.csv_extractions— cell-level raw→cleaned values for CSV/XLSX rows.stage.training_corpus— human corrections used to improve models.stage.v_review_queue— extractions flagged for human review.stage.v_document_processing_stats— per-document processing overview.
SME Cheat Sheet
- Mental model: Upload → pre‑labels → correct → rows land in
stage.*views for QA. - Don’t worry about: GPUs, networking, credentials, or non‑stage schemas.
- Quick checks:
stage.v_review_queuefor work needing eyes;stage.v_document_processing_statsfor status. - Label Studio speed: use hotkeys, keep spans tight, prefer specific labels (e.g.,
IMO). - CSV/XLSX: add a simple
textcolumn in Excel/Sheets for better pre‑labels; keeppdf/urlcolumns when available. - DBeaver: use the pre‑made “Staging (RO)” and save filters for your frequently used views.
Accessing Raw Data with DBeaver (CrunchyBridge staging)
We provide a read‑only connection for SMEs. If your workstation is pre‑configured, you’ll see “Oceanid Staging (RO)” in DBeaver. Note: some docs/chat may say “crunchybase”; it refers to the same CrunchyBridge-hosted Postgres staging database.
-
If you already have the connection:
- Open DBeaver → Database Navigator → double‑click “Oceanid Staging (RO)”.
- Expand
Schemas→stage. - Right‑click a table → “Read Data in SQL Console”.
-
To add it manually (if not pre‑configured):
- Driver: PostgreSQL
- Host: provided CrunchyBridge hostname (format:
p.<cluster-id>.db.postgresbridge.com) - Port:
5432 - Database: your staging DB name (ask ops if unsure)
- User/Password: provided read‑only credentials
- SSL: Required (default)
Helpful queries:
-- Most recent documents and status
SELECT *
FROM stage.v_document_processing_stats
ORDER BY last_processed_at DESC NULLS LAST
LIMIT 50;
-- NER spans for a document
SELECT e.label, e.value, e.confidence, e.updated_at
FROM stage.extractions e
WHERE e.document_id = $1
ORDER BY e.updated_at DESC;
-- CSV/XLSX cleaned cells needing review
SELECT document_id, row_index, column_name, raw_value, cleaned_value, confidence
FROM stage.csv_extractions
WHERE needs_review = true
ORDER BY confidence ASC, created_at ASC
LIMIT 200;
Self‑serve extras:
- Connection template:
./dbeaver/connection-template.txt - Saved queries:
./dbeaver/snippets/sme_saved_queries.sql - DBeaver guide:
./dbeaver/README.md
Live SQL (Playground)
If your docs site enables an in‑browser SQL Playground with a read‑only staging connection, you can run these directly. The examples assume a pre‑configured connection alias named staging and enforce read‑only mode.
See setup details: docs/operations/sqlplayground-connection.mdx.
-- Your corrections (training data for the model)
SELECT corrected_value, correction_type, annotator, annotated_at
FROM stage.training_corpus
ORDER BY annotated_at DESC
LIMIT 200;
Network Picture (What’s already working for you)
Storage (S3/MinIO)
- All uploads (files you import to Label Studio) are stored durably in our S3‑compatible object storage (MinIO/S3), not just in the database.
- The database holds metadata, spans, and table values; the original files live in object storage and can be re‑fetched for reprocessing.
- Practical takeaway: you don’t need to keep local copies after upload; if we need to re‑run extraction/cleaning, we will pull from storage.
Managing NER Labels (Self‑Service)
Source of truth: labels.json (order matters; it defines model output shape) and ESC config nerLabels (injected into the adapter as NER_LABELS). Keep these in sync.
Steps to add or change labels:
- Edit
labels.json:- Append new labels at the end of the
labelsarray (preserve indices). - Bump
version(minor for additions; major for renames/removals) and updatemodelOutputShape.numLabels. - Update
changeLogentry.
- Append new labels at the end of the
- Update runtime config (exact same order):
- Generate the array from
labels.json:python scripts/ner_labels_from_labels_json.py > ner_labels.jsonpulumi -C cluster config set oceanid-cluster:nerLabels "$(cat ner_labels.json)" --secret
- Values must match exactly and remain in the same order as
labels.json.
- Generate the array from
- Refresh the adapter:
- If you have cluster access:
kubectl -n apps rollout restart deploy/ls-triton-adapterand wait for READY=1/1. - Otherwise, request a restart (no code changes needed).
- If you have cluster access:
- Update your Label Studio project labeling interface to include the new label(s):
- Project → Settings → Labeling Interface → add the label(s) in the
<Labels>block so the UI matches the backend.
- Project → Settings → Labeling Interface → add the label(s) in the
- Validate:
- In LS: Settings → Model → Test connection (200 OK on
/setup). - In adapter logs: confirm “label validation passed” (counts match model output shape).
- In LS: Settings → Model → Test connection (200 OK on
Retraining the NER model (recommended after label changes):
- Export latest annotations to the HF dataset (sink does this automatically).
- Train and export ONNX (see README “Model Training Loop”).
- Push to the model repo; the model puller updates Triton automatically.
Minimal Label Studio NER interface snippet (names must match labels.json):
<View>
<Labels name="label" toName="text">
<Label value="IMO" background="#ffd1a8"/>
<Label value="VESSEL_NAME" background="#a8d1ff"/>
<!-- add new labels here; names must match exactly -->
</Labels>
<Text name="text" value="$text"/>
</View>
Updating Schemas (Self‑Service, Staging)
Small iterative changes are expected in staging. Typical SME changes:
- Add a view for QA (e.g., flag low‑confidence cells by source).
- Add a typed column in curated tables for a new high‑value field (after a label is stabilized).
Two ways to apply in staging:
- Manual psql (Option B in the migration guide):
export DATABASE_URL='postgresql://<user>:<pass>@<host>:5432/<db>'psql $DATABASE_URL -f sql/migrations/V7__your_change.sql
- GitOps job (ops can include the file in the migration list):
- Add
sql/migrations/V7__*.sqland open a PR. CI will validate; then it is wired into the cluster job list and runs automatically.
- Add
Examples
- New QA view in
stage(save assql/migrations/V7__stage_low_confidence_by_source.sql):
CREATE OR REPLACE VIEW stage.v_low_confidence_by_source AS
SELECT d.source_doc_id, ce.column_name, ce.raw_value, ce.cleaned_value, ce.confidence
FROM stage.csv_extractions ce
JOIN stage.documents d ON d.id = ce.document_id
WHERE ce.confidence < 0.85
ORDER BY ce.confidence ASC, ce.created_at ASC;
- Add a typed field to curated vessel info (extend V6 pattern):
ALTER TABLE curated.vessel_info ADD COLUMN IF NOT EXISTS hull_material text;
COMMENT ON COLUMN curated.vessel_info.hull_material IS 'Hull material extracted from NER HULL_MATERIAL label.';
Validation after schema change:
- Re‑run your DBeaver saved queries and confirm the new view/column exists.
- For typed fields, check that post‑processing maps extractions to the new column (may require a small adapter mapping patch if dbMapping changes).
TABERT Pilot (Optional)
TaBERT is a research model for understanding tables. We run a small pilot to assist with CSV/XLSX normalization. This is optional and not GA; availability may vary.
What it does:
- Suggests standardized values for messy cells (e.g., vessel type aliases, country names, date formats).
- Flags low‑confidence transformations for human review.
How to try it (if enabled for your project):
- In Label Studio, open a CSV/XLSX task.
- Open the “Assist” panel and enable “Tabert (Pilot)”.
- Click “Suggest clean values”. Suggestions appear inline; accept or edit, then Save.
- Behind the scenes, accepted suggestions are recorded in
stage.csv_extractionswithextraction_method = 'ml_model'and amodel_versionliketabert-<date>.
Useful queries for pilot data:
-- Pilot suggestions that were auto‑filled by the model
SELECT document_id, row_index, column_name, raw_value, cleaned_value, confidence, model_version
FROM stage.csv_extractions
WHERE extraction_method = 'ml_model'
AND model_version ILIKE 'tabert%'
ORDER BY created_at DESC
LIMIT 200;
Feedback:
- If you don’t see the Tabert toggle, or suggestions look off, please report it to ops with example rows. The pilot uses your corrections (via
stage.training_corpus) to improve.
FAQ
- Can I use Excel/Sheets only? Yes. For CSV/XLSX, adding a simple
textcolumn often yields better pre‑labels; no scripts are required. - Can I query without DBeaver? Yes, any Postgres client works. You’ll need the same read‑only credentials and TLS.
- Are my edits versioned? Yes. An annotation sink appends versioned JSONL to our dataset and inserts rows into
stage.*tables.
At a Glance
Interactive Helpers
Postgres URI Builder (Local Only)
Build a connection string for psql or other clients. Nothing is sent anywhere; this runs in your browser.
function PgUriBuilder(){
const [host,setHost]=React.useState('p.<cluster-id>.db.postgresbridge.com');
const [db,setDb]=React.useState('oceanid_staging');
const [user,setUser]=React.useState('readonly_user');
const [pass,setPass]=React.useState('');
const [show,setShow]=React.useState(false);
const enc = s => encodeURIComponent(s || '');
const uri = host && db && user && pass ? `postgresql://${enc(user)}:${enc(pass)}@${host}:5432/${db}` : '';
return (
<div style={{border:'1px solid #ddd', padding:12, borderRadius:8}}>
<div style={{display:'grid', gridTemplateColumns:'120px 1fr', gap:8}}>
<label>Host</label><input value={host} onChange={e=>setHost(e.target.value)} />
<label>Database</label><input value={db} onChange={e=>setDb(e.target.value)} />
<label>User</label><input value={user} onChange={e=>setUser(e.target.value)} />
<label>Password</label>
<div>
<input type={show?'text':'password'} value={pass} onChange={e=>setPass(e.target.value)} style={{width:'100%'}}/>
<label style={{marginLeft:8}}><input type="checkbox" checked={show} onChange={()=>setShow(s=>!s)} /> show</label>
</div>
</div>
<div style={{marginTop:12}}><strong>Connection URI</strong></div>
<pre style={{overflowX:'auto'}}>{uri || '# fill all fields to generate'}</pre>
<div style={{fontSize:12,color:'#555'}}>Keep credentials safe. Prefer the pre-made DBeaver RO connection.</div>
</div>
);
}
<PgUriBuilder />
SQL Snippet Generator
Generate a spans query for a specific document_id.
function SqlGenerator(){
const [id,setId]=React.useState('');
const sql = id ? `SELECT e.label, e.value, e.confidence, e.updated_at\nFROM stage.extractions e\nWHERE e.document_id = ${id}\nORDER BY e.updated_at DESC;` : '-- enter a document_id';
return (
<div style={{border:'1px solid #ddd', padding:12, borderRadius:8}}>
<div>document_id: <input value={id} onChange={e=>setId(e.target.value)} style={{width:160}} /></div>
<div style={{marginTop:8}}><strong>SQL</strong></div>
<pre style={{overflowX:'auto'}}>{sql}</pre>
</div>
);
}
<SqlGenerator />
If anything here isn’t visible or doesn’t match what you see, ping ops — your environment may not have the pilot features enabled yet, but the core upload → annotate → query flow is always on.