First client workflow deployed. Ran perfectly in testing. Week one production: silent failure. Client discovered 3 days later. 200 invoices unprocessed. Payroll delayed.
Built error handling system. 5 workflows running. 8,000+ documents processed. Zero silent failures in 4 months.
THE DISASTER:
Monday morning: “Where are my invoices?” Checked n8n. All green. “Successful.”
Investigated. API returned 200 but extraction failed. Confidence: 0%. Workflow wrote empty records. Client didn’t know for 3 days.
THE THREE-TIER ERROR SYSTEM:
TIER 1 – AUTOMATIC RETRY:
Trigger: Confidence below 75%, timeout, rate limit
Action: Wait 30s, retry with adjustments
Success rate: 62% of failures fixed
TIER 2 – HUMAN REVIEW:
Trigger: Retry fails, required fields missing
Action: Review queue, Slack alert with document link
Success rate: 34% resolved
TIER 3 – CLIENT ESCALATION:
Trigger: In queue over 24 hours
Action: Email client, attach document, provide ETA
Success rate: 4% remaining issues
THE MONITORING LAYER:
Dashboard tracks:
– Documents in queue
– Error rate last 24 hours (alert if over 5%)
– Processing time averages
– Confidence score distribution
Alerts:
– Slack: Tier 2 errors immediately
– Email: Error rate exceeds 5%
– SMS: System down over 10 minutes
ERROR DATABASE SCHEMA:
Table columns:
– error_id, document_id, timestamp, tier_triggered
– error_type, retry_count, resolution_status
– confidence_score, assigned_to, resolution_notes
Retry configuration:
– Max attempts: 3
– Backoff: Exponential (30s, 90s, 270s)
– Timeout escalation: 30s → 60s → 90s
THE RESULTS AFTER 4 MONTHS:
Documents processed: 8,247
Tier 1 auto-resolved: 412 (62%)
Tier 2 human review: 226 (34%)
Tier 3 escalations: 27 (4%)
Total error rate: 8% (industry average: 15%)
Silent failures: 0
Time saved:
– Tier 1: 90 seconds (no human time)
– Tier 2: 8 minutes (vs 30 minutes)
– Tier 3: 45 minutes (vs 2 hours)
THE LESSON:
Production needs three safety nets. Auto-retry catches transient issues. Human review catches ambiguous cases. Escalation catches everything else.
Never trust “success” status alone. Validate output quality. Check confidence scores. Verify required fields.
