Skip to main content

Source: ebisu/docs/adr/0056-docling-rfmo-pdf-extraction.md | ✏️ Edit on GitHub

ADR-056: Using Docling for RFMO PDF Data Extraction

Status

Proposed

Context

Many Regional Fisheries Management Organizations (RFMOs) publish their vessel registries in PDF format rather than structured data formats like CSV or JSON. This presents a significant challenge for our vessel Master Data Management (MDM) system, which requires structured data for trust scoring and vessel matching.

Currently affected RFMOs:

  • SEAFO (South East Atlantic Fisheries Organisation) - PDF only
  • GFCM (General Fisheries Commission for the Mediterranean) - PDF reports
  • SIOFA (Southern Indian Ocean Fisheries Agreement) - PDF lists
  • CCAMLR (Commission for the Conservation of Antarctic Marine Living Resources) - PDF documents

Current Pain Points

  1. Manual data extraction is error-prone and time-consuming
  2. PDF structures vary significantly between RFMOs
  3. Tables span multiple pages with varying headers
  4. Mixed content (text, tables, images) requires intelligent parsing
  5. No consistent way to handle updates when new PDFs are published

Decision

We will integrate Docling (https://github.com/docling-project/docling) as our PDF extraction solution for RFMO vessel data.

Why Docling?

  1. Advanced Table Extraction: Docling excels at extracting complex tables from PDFs, which is critical for vessel registries
  2. Multi-format Support: Handles PDF, DOCX, PPTX, and other formats some RFMOs might use
  3. Structured Output: Provides JSON/Markdown output that can be easily parsed
  4. Layout Understanding: Maintains document structure and relationships between elements
  5. Open Source: Aligns with our open-source stack and allows customization
  6. Active Development: IBM Research project with regular updates

Implementation Plan

Phase 1: Infrastructure Setup

# requirements.txt addition
docling>=1.0.0
# /app/scripts/import/vessels/data/RFMO/pdf_extraction/docling_config.py
from docling.config import DoclingConfig
from docling.document import Document

class RFMOPDFExtractor:
def __init__(self):
self.config = DoclingConfig(
extract_tables=True,
extract_images=False, # Vessel photos if needed
table_extraction_mode="accurate", # vs "fast"
output_format="json"
)

def extract_vessel_data(self, pdf_path: str) -> dict:
"""Extract structured vessel data from RFMO PDF"""
doc = Document.from_file(pdf_path, config=self.config)
return self._parse_vessel_tables(doc)

Phase 2: RFMO-Specific Parsers

Each RFMO will have its own parser to handle unique PDF structures:

/app/scripts/import/vessels/data/RFMO/pdf_extraction/
├── __init__.py
├── docling_config.py
├── base_parser.py
├── parsers/
│ ├── __init__.py
│ ├── seafo_parser.py
│ ├── gfcm_parser.py
│ ├── siofa_parser.py
│ └── ccamlr_parser.py
└── tests/
└── test_pdf_extraction.py

Phase 3: Integration with Existing Pipeline

# Example: SEAFO PDF extraction and cleaning
class SEAFOParser(BaseRFMOParser):
def parse_vessel_table(self, table_data: dict) -> list:
"""Parse SEAFO-specific vessel table structure"""
vessels = []

for row in table_data['rows']:
vessel = {
'vessel_name': self._clean_text(row.get('Vessel Name')),
'imo': self._extract_imo(row.get('IMO Number')),
'flag': self._map_country(row.get('Flag State')),
'call_sign': self._clean_text(row.get('Call Sign')),
'vessel_type': self._map_vessel_type(row.get('Type')),
# ... other fields
}
vessels.append(vessel)

return vessels

Phase 4: Automated Pipeline

#!/bin/bash
# /app/scripts/import/vessels/data/RFMO/extract_and_clean_seafo_vessels.sh

# Download latest PDF
SEAFO_PDF_URL="https://www.seafo.org/vessel-registry.pdf"
wget -O /tmp/seafo_vessels.pdf "$SEAFO_PDF_URL"

# Extract using Docling
python3 /app/scripts/import/vessels/data/RFMO/pdf_extraction/extract_seafo.py \
--input /tmp/seafo_vessels.pdf \
--output /import/vessels/vessel_data/RFMO/raw/seafo_vessels_extracted.json

# Convert to CSV for existing pipeline
python3 /app/scripts/import/vessels/data/RFMO/pdf_extraction/json_to_csv.py \
--input /import/vessels/vessel_data/RFMO/raw/seafo_vessels_extracted.json \
--output /import/vessels/vessel_data/RFMO/raw/seafo_vessels.csv

# Use existing cleaning pipeline
/app/scripts/import/vessels/data/RFMO/clean_seafo_vessels.sh

Data Quality Assurance

Validation Framework

class PDFExtractionValidator:
def validate_extraction(self, extracted_data: dict, rfmo: str) -> dict:
"""Validate extracted data quality"""
metrics = {
'total_records': len(extracted_data['vessels']),
'imo_coverage': self._calculate_field_coverage('imo'),
'name_coverage': self._calculate_field_coverage('vessel_name'),
'parsing_errors': self._identify_parsing_errors(),
'data_anomalies': self._detect_anomalies()
}

# Alert if quality below threshold
if metrics['imo_coverage'] < 0.3: # 30% IMO coverage minimum
self._alert_low_quality_extraction(rfmo, metrics)

return metrics

Manual Review Process

  1. First Run: Manual review of 10% sample
  2. Validation: Compare with manually extracted data
  3. Tuning: Adjust Docling parameters and parsing rules
  4. Production: Automated extraction with quality monitoring

Monitoring and Alerts

# Integration with existing monitoring
class PDFExtractionMonitor:
def check_rfmo_pdf_updates(self):
"""Check for new PDF versions"""
for rfmo in PDF_BASED_RFMOS:
current_hash = self._get_pdf_hash(rfmo['url'])
if current_hash != rfmo['last_hash']:
self._trigger_extraction_pipeline(rfmo)
self._notify_team(f"New {rfmo['name']} PDF detected")

Security Considerations

  1. PDF Sanitization: Scan PDFs for malicious content before processing
  2. Resource Limits: Set memory/CPU limits for Docling processing
  3. Sandboxing: Run extraction in isolated environment
  4. Validation: Strict input validation on extracted data

Consequences

Positive

  • Automated extraction from PDF-only RFMOs
  • Consistent data pipeline for all sources
  • Reduced manual effort and errors
  • Faster updates when new PDFs published
  • Enables inclusion of 4+ additional RFMOs

Negative

  • Additional dependency (Docling)
  • PDF parsing may miss some edge cases
  • Requires RFMO-specific tuning
  • Processing overhead for large PDFs
  • May need manual intervention for complex layouts

Mitigation Strategies

  1. Implement comprehensive testing for each RFMO parser
  2. Maintain fallback manual extraction process
  3. Regular quality audits of extracted data
  4. Progressive rollout (one RFMO at a time)

Implementation Timeline

  1. Week 1-2: Docling integration and base infrastructure
  2. Week 3-4: SEAFO parser (pilot RFMO)
  3. Week 5-6: GFCM and SIOFA parsers
  4. Week 7-8: CCAMLR parser and testing
  5. Week 9-10: Production deployment and monitoring

Success Metrics

  • Successfully extract 95%+ of vessel records from PDFs
  • Achieve 80%+ IMO number extraction accuracy
  • Reduce manual extraction time by 90%
  • Enable monthly automated updates (vs quarterly manual)
  • Add 1000+ vessels from PDF-only RFMOs to MDM

References