Source:
ebisu/docs/adr/0056-docling-rfmo-pdf-extraction.md| ✏️ Edit on GitHub
ADR-056: Using Docling for RFMO PDF Data Extraction
Status
Proposed
Context
Many Regional Fisheries Management Organizations (RFMOs) publish their vessel registries in PDF format rather than structured data formats like CSV or JSON. This presents a significant challenge for our vessel Master Data Management (MDM) system, which requires structured data for trust scoring and vessel matching.
Currently affected RFMOs:
- SEAFO (South East Atlantic Fisheries Organisation) - PDF only
- GFCM (General Fisheries Commission for the Mediterranean) - PDF reports
- SIOFA (Southern Indian Ocean Fisheries Agreement) - PDF lists
- CCAMLR (Commission for the Conservation of Antarctic Marine Living Resources) - PDF documents
Current Pain Points
- Manual data extraction is error-prone and time-consuming
- PDF structures vary significantly between RFMOs
- Tables span multiple pages with varying headers
- Mixed content (text, tables, images) requires intelligent parsing
- No consistent way to handle updates when new PDFs are published
Decision
We will integrate Docling (https://github.com/docling-project/docling) as our PDF extraction solution for RFMO vessel data.
Why Docling?
- Advanced Table Extraction: Docling excels at extracting complex tables from PDFs, which is critical for vessel registries
- Multi-format Support: Handles PDF, DOCX, PPTX, and other formats some RFMOs might use
- Structured Output: Provides JSON/Markdown output that can be easily parsed
- Layout Understanding: Maintains document structure and relationships between elements
- Open Source: Aligns with our open-source stack and allows customization
- Active Development: IBM Research project with regular updates
Implementation Plan
Phase 1: Infrastructure Setup
# requirements.txt addition
docling>=1.0.0
# /app/scripts/import/vessels/data/RFMO/pdf_extraction/docling_config.py
from docling.config import DoclingConfig
from docling.document import Document
class RFMOPDFExtractor:
def __init__(self):
self.config = DoclingConfig(
extract_tables=True,
extract_images=False, # Vessel photos if needed
table_extraction_mode="accurate", # vs "fast"
output_format="json"
)
def extract_vessel_data(self, pdf_path: str) -> dict:
"""Extract structured vessel data from RFMO PDF"""
doc = Document.from_file(pdf_path, config=self.config)
return self._parse_vessel_tables(doc)
Phase 2: RFMO-Specific Parsers
Each RFMO will have its own parser to handle unique PDF structures:
/app/scripts/import/vessels/data/RFMO/pdf_extraction/
├── __init__.py
├── docling_config.py
├── base_parser.py
├── parsers/
│ ├── __init__.py
│ ├── seafo_parser.py
│ ├── gfcm_parser.py
│ ├── siofa_parser.py
│ └── ccamlr_parser.py
└── tests/
└── test_pdf_extraction.py
Phase 3: Integration with Existing Pipeline
# Example: SEAFO PDF extraction and cleaning
class SEAFOParser(BaseRFMOParser):
def parse_vessel_table(self, table_data: dict) -> list:
"""Parse SEAFO-specific vessel table structure"""
vessels = []
for row in table_data['rows']:
vessel = {
'vessel_name': self._clean_text(row.get('Vessel Name')),
'imo': self._extract_imo(row.get('IMO Number')),
'flag': self._map_country(row.get('Flag State')),
'call_sign': self._clean_text(row.get('Call Sign')),
'vessel_type': self._map_vessel_type(row.get('Type')),
# ... other fields
}
vessels.append(vessel)
return vessels
Phase 4: Automated Pipeline
#!/bin/bash
# /app/scripts/import/vessels/data/RFMO/extract_and_clean_seafo_vessels.sh
# Download latest PDF
SEAFO_PDF_URL="https://www.seafo.org/vessel-registry.pdf"
wget -O /tmp/seafo_vessels.pdf "$SEAFO_PDF_URL"
# Extract using Docling
python3 /app/scripts/import/vessels/data/RFMO/pdf_extraction/extract_seafo.py \
--input /tmp/seafo_vessels.pdf \
--output /import/vessels/vessel_data/RFMO/raw/seafo_vessels_extracted.json
# Convert to CSV for existing pipeline
python3 /app/scripts/import/vessels/data/RFMO/pdf_extraction/json_to_csv.py \
--input /import/vessels/vessel_data/RFMO/raw/seafo_vessels_extracted.json \
--output /import/vessels/vessel_data/RFMO/raw/seafo_vessels.csv
# Use existing cleaning pipeline
/app/scripts/import/vessels/data/RFMO/clean_seafo_vessels.sh
Data Quality Assurance
Validation Framework
class PDFExtractionValidator:
def validate_extraction(self, extracted_data: dict, rfmo: str) -> dict:
"""Validate extracted data quality"""
metrics = {
'total_records': len(extracted_data['vessels']),
'imo_coverage': self._calculate_field_coverage('imo'),
'name_coverage': self._calculate_field_coverage('vessel_name'),
'parsing_errors': self._identify_parsing_errors(),
'data_anomalies': self._detect_anomalies()
}
# Alert if quality below threshold
if metrics['imo_coverage'] < 0.3: # 30% IMO coverage minimum
self._alert_low_quality_extraction(rfmo, metrics)
return metrics
Manual Review Process
- First Run: Manual review of 10% sample
- Validation: Compare with manually extracted data
- Tuning: Adjust Docling parameters and parsing rules
- Production: Automated extraction with quality monitoring
Monitoring and Alerts
# Integration with existing monitoring
class PDFExtractionMonitor:
def check_rfmo_pdf_updates(self):
"""Check for new PDF versions"""
for rfmo in PDF_BASED_RFMOS:
current_hash = self._get_pdf_hash(rfmo['url'])
if current_hash != rfmo['last_hash']:
self._trigger_extraction_pipeline(rfmo)
self._notify_team(f"New {rfmo['name']} PDF detected")
Security Considerations
- PDF Sanitization: Scan PDFs for malicious content before processing
- Resource Limits: Set memory/CPU limits for Docling processing
- Sandboxing: Run extraction in isolated environment
- Validation: Strict input validation on extracted data
Consequences
Positive
- Automated extraction from PDF-only RFMOs
- Consistent data pipeline for all sources
- Reduced manual effort and errors
- Faster updates when new PDFs published
- Enables inclusion of 4+ additional RFMOs
Negative
- Additional dependency (Docling)
- PDF parsing may miss some edge cases
- Requires RFMO-specific tuning
- Processing overhead for large PDFs
- May need manual intervention for complex layouts
Mitigation Strategies
- Implement comprehensive testing for each RFMO parser
- Maintain fallback manual extraction process
- Regular quality audits of extracted data
- Progressive rollout (one RFMO at a time)
Implementation Timeline
- Week 1-2: Docling integration and base infrastructure
- Week 3-4: SEAFO parser (pilot RFMO)
- Week 5-6: GFCM and SIOFA parsers
- Week 7-8: CCAMLR parser and testing
- Week 9-10: Production deployment and monitoring
Success Metrics
- Successfully extract 95%+ of vessel records from PDFs
- Achieve 80%+ IMO number extraction accuracy
- Reduce manual extraction time by 90%
- Enable monthly automated updates (vs quarterly manual)
- Add 1000+ vessels from PDF-only RFMOs to MDM
References
- Docling Documentation: https://github.com/docling-project/docling
- SEAFO Vessel Registry: https://www.seafo.org/Management/Vessels
- PDF Table Extraction Benchmarks: https://github.com/docling-project/docling#benchmarks