Supplier changed invoice format. 847 documents failed. Built adaptive extraction system. Now handles 47 format variations automatically.
THE PROBLEM:
Extraction workflow tuned for specific format. Supplier redesigned invoice layout. Everything broke.
847 invoices failed overnight. Manual processing required. Client angry.
THE ADAPTIVE SYSTEM:
Built format-agnostic extraction using pattern recognition rather than fixed locations.
LAYER 1: DOCUMENT TYPE DETECTION
Classify document type before extraction. Invoice, receipt, contract, form?
Uses first-page content analysis. Keywords, structure patterns, common fields.
Accuracy: 96% classification on 47 document types.
LAYER 2: FIELD LOCATION DISCOVERY
Find fields dynamically using pattern matching rather than fixed coordinates.
Amount field strategies:
1. Search for “$” or currency symbols
2. Find largest number on page
3. Look for “total”, “amount due” labels nearby
4. Verify number format (2 decimal places)
Invoice number strategies:
1. Search for “invoice #”, “inv #” keywords
2. Find alphanumeric after label
3. Check format patterns
Date strategies:
1. Search for “date”, “invoice date” keywords
2. Parse date formats (MM/DD/YYYY, DD/MM/YYYY, YYYY-MM-DD)
3. Verify date is recent and reasonable
LAYER 3: MULTI-STRATEGY EXTRACTION
Try multiple extraction strategies. Use highest confidence result.
Confidence scoring:
– Pattern match: 0.95
– Keyword proximity: 0.85
– Format validation: 0.80
– Fallback: 0.60
Select strategy with highest confidence. If all <0.75, flag for human review.
LAYER 4: FORMAT VERSION DETECTION
Track format changes over time. Detect when supplier updates layout.
Store document structure fingerprint. Compare new documents to known fingerprints. If similarity <0.70, mark as new format version.
Human reviews first document of new format. System learns automatically.
THE IMPLEMENTATION:
PostgreSQL tables:
– document_formats (format_id, type, pattern_rules)
– format_versions (format_id, version, fingerprint)
– extraction_strategies (strategy_name, pattern, priority)
THE NUMBERS:
Before adaptive system:
– Format change: Complete failure
– 847 documents failed
– 2 days manual processing
After adaptive system:
– 47 different format variations handled
– 94% automatic extraction accuracy
– 6% flagged for review
– 0 complete failures
– Format changes handled automatically
REAL EXAMPLES:
Invoice layout redesign: Automatic detection, learned new pattern in 3 documents
New supplier added: Human reviewed 1 sample, system learned
Bilingual invoices: Used language-specific extraction rules
THE MONITORING:
Track: Format versions detected, confidence score distribution, review queue size.
Alert if: New format detected, confidence scores trending down.
THE LESSON:
Static extraction breaks easily. Adaptive extraction survives format changes. Multi-strategy approach handles real-world variation.
Investment: 12 hours building adaptive system
Return: Zero format-change failures in 6 months
