Source: ebisu/docs/adr/0056-docling-rfmo-pdf-extraction.md | ✏️ Edit on GitHub

ADR-056: Using Docling for RFMO PDF Data Extraction

Status

Proposed

Context

Many Regional Fisheries Management Organizations (RFMOs) publish their vessel registries in PDF format rather than structured data formats like CSV or JSON. This presents a significant challenge for our vessel Master Data Management (MDM) system, which requires structured data for trust scoring and vessel matching.

Currently affected RFMOs:

SEAFO (South East Atlantic Fisheries Organisation) - PDF only
GFCM (General Fisheries Commission for the Mediterranean) - PDF reports
SIOFA (Southern Indian Ocean Fisheries Agreement) - PDF lists
CCAMLR (Commission for the Conservation of Antarctic Marine Living Resources) - PDF documents

Current Pain Points

Manual data extraction is error-prone and time-consuming
PDF structures vary significantly between RFMOs
Tables span multiple pages with varying headers
Mixed content (text, tables, images) requires intelligent parsing
No consistent way to handle updates when new PDFs are published

Decision

We will integrate Docling (https://github.com/docling-project/docling) as our PDF extraction solution for RFMO vessel data.

Why Docling?

Advanced Table Extraction: Docling excels at extracting complex tables from PDFs, which is critical for vessel registries
Multi-format Support: Handles PDF, DOCX, PPTX, and other formats some RFMOs might use
Structured Output: Provides JSON/Markdown output that can be easily parsed
Layout Understanding: Maintains document structure and relationships between elements
Open Source: Aligns with our open-source stack and allows customization
Active Development: IBM Research project with regular updates

Implementation Plan

Phase 1: Infrastructure Setup

# requirements.txt addition
docling>=1.0.0

# /app/scripts/import/vessels/data/RFMO/pdf_extraction/docling_config.py
from docling.config import DoclingConfig
from docling.document import Document

class RFMOPDFExtractor:
    def __init__(self):
        self.config = DoclingConfig(
            extract_tables=True,
            extract_images=False,  # Vessel photos if needed
            table_extraction_mode="accurate",  # vs "fast"
            output_format="json"
        )
    
    def extract_vessel_data(self, pdf_path: str) -> dict:
        """Extract structured vessel data from RFMO PDF"""
        doc = Document.from_file(pdf_path, config=self.config)
        return self._parse_vessel_tables(doc)

Phase 2: RFMO-Specific Parsers

Each RFMO will have its own parser to handle unique PDF structures:

/app/scripts/import/vessels/data/RFMO/pdf_extraction/
├── __init__.py
├── docling_config.py
├── base_parser.py
├── parsers/
│   ├── __init__.py
│   ├── seafo_parser.py
│   ├── gfcm_parser.py
│   ├── siofa_parser.py
│   └── ccamlr_parser.py
└── tests/
    └── test_pdf_extraction.py

Phase 3: Integration with Existing Pipeline

# Example: SEAFO PDF extraction and cleaning
class SEAFOParser(BaseRFMOParser):
    def parse_vessel_table(self, table_data: dict) -> list:
        """Parse SEAFO-specific vessel table structure"""
        vessels = []
        
        for row in table_data['rows']:
            vessel = {
                'vessel_name': self._clean_text(row.get('Vessel Name')),
                'imo': self._extract_imo(row.get('IMO Number')),
                'flag': self._map_country(row.get('Flag State')),
                'call_sign': self._clean_text(row.get('Call Sign')),
                'vessel_type': self._map_vessel_type(row.get('Type')),
                # ... other fields
            }
            vessels.append(vessel)
        
        return vessels

Phase 4: Automated Pipeline

#!/bin/bash
# /app/scripts/import/vessels/data/RFMO/extract_and_clean_seafo_vessels.sh

# Download latest PDF
SEAFO_PDF_URL="https://www.seafo.org/vessel-registry.pdf"
wget -O /tmp/seafo_vessels.pdf "$SEAFO_PDF_URL"

# Extract using Docling
python3 /app/scripts/import/vessels/data/RFMO/pdf_extraction/extract_seafo.py \
    --input /tmp/seafo_vessels.pdf \
    --output /import/vessels/vessel_data/RFMO/raw/seafo_vessels_extracted.json

# Convert to CSV for existing pipeline
python3 /app/scripts/import/vessels/data/RFMO/pdf_extraction/json_to_csv.py \
    --input /import/vessels/vessel_data/RFMO/raw/seafo_vessels_extracted.json \
    --output /import/vessels/vessel_data/RFMO/raw/seafo_vessels.csv

# Use existing cleaning pipeline
/app/scripts/import/vessels/data/RFMO/clean_seafo_vessels.sh

Data Quality Assurance

Validation Framework

class PDFExtractionValidator:
    def validate_extraction(self, extracted_data: dict, rfmo: str) -> dict:
        """Validate extracted data quality"""
        metrics = {
            'total_records': len(extracted_data['vessels']),
            'imo_coverage': self._calculate_field_coverage('imo'),
            'name_coverage': self._calculate_field_coverage('vessel_name'),
            'parsing_errors': self._identify_parsing_errors(),
            'data_anomalies': self._detect_anomalies()
        }
        
        # Alert if quality below threshold
        if metrics['imo_coverage'] < 0.3:  # 30% IMO coverage minimum
            self._alert_low_quality_extraction(rfmo, metrics)
        
        return metrics

Manual Review Process

First Run: Manual review of 10% sample
Validation: Compare with manually extracted data
Tuning: Adjust Docling parameters and parsing rules
Production: Automated extraction with quality monitoring

Monitoring and Alerts

# Integration with existing monitoring
class PDFExtractionMonitor:
    def check_rfmo_pdf_updates(self):
        """Check for new PDF versions"""
        for rfmo in PDF_BASED_RFMOS:
            current_hash = self._get_pdf_hash(rfmo['url'])
            if current_hash != rfmo['last_hash']:
                self._trigger_extraction_pipeline(rfmo)
                self._notify_team(f"New {rfmo['name']} PDF detected")

Security Considerations

PDF Sanitization: Scan PDFs for malicious content before processing
Resource Limits: Set memory/CPU limits for Docling processing
Sandboxing: Run extraction in isolated environment
Validation: Strict input validation on extracted data

Consequences

Positive

Automated extraction from PDF-only RFMOs
Consistent data pipeline for all sources
Reduced manual effort and errors
Faster updates when new PDFs published
Enables inclusion of 4+ additional RFMOs

Negative

Additional dependency (Docling)
PDF parsing may miss some edge cases
Requires RFMO-specific tuning
Processing overhead for large PDFs
May need manual intervention for complex layouts

Mitigation Strategies

Implement comprehensive testing for each RFMO parser
Maintain fallback manual extraction process
Regular quality audits of extracted data
Progressive rollout (one RFMO at a time)

Implementation Timeline

Week 1-2: Docling integration and base infrastructure
Week 3-4: SEAFO parser (pilot RFMO)
Week 5-6: GFCM and SIOFA parsers
Week 7-8: CCAMLR parser and testing
Week 9-10: Production deployment and monitoring

Success Metrics

Successfully extract 95%+ of vessel records from PDFs
Achieve 80%+ IMO number extraction accuracy
Reduce manual extraction time by 90%
Enable monthly automated updates (vs quarterly manual)
Add 1000+ vessels from PDF-only RFMOs to MDM

References

Docling Documentation: https://github.com/docling-project/docling
SEAFO Vessel Registry: https://www.seafo.org/Management/Vessels
PDF Table Extraction Benchmarks: https://github.com/docling-project/docling#benchmarks

Status​

Context​

Current Pain Points​

Decision​

Why Docling?​

Implementation Plan​

Phase 1: Infrastructure Setup​

Phase 2: RFMO-Specific Parsers​

Phase 3: Integration with Existing Pipeline​

Phase 4: Automated Pipeline​

Data Quality Assurance​

Validation Framework​

Manual Review Process​

Monitoring and Alerts​

Security Considerations​

Consequences​

Positive​

Negative​

Mitigation Strategies​

Implementation Timeline​

Success Metrics​

References​