Document Intelligence Refinery

An agentic document processing pipeline that automatically classifies documents, selects the optimal extraction strategy, and executes structured data extraction using a multi-stage LangGraph agent architecture.

PythonLangGraphLangChainFAISSDoclingPydanticSQLite

Overview

Problem

Enterprise documents arrive in heterogeneous formats — native PDFs, scanned images, multi-column layouts, and domain-specific content. A one-size-fits-all extraction approach fails on this diversity. The challenge is building a system that intelligently routes each document to the most cost-effective extraction strategy without manual intervention.

Dataset

A diverse corpus of PDF documents spanning financial reports, legal contracts, technical manuals, and medical records. Documents vary in structure (single/multi-column), origin (native digital vs. scanned), and domain, requiring different extraction strategies for each type.

Architecture

A four-stage agentic pipeline orchestrated with LangGraph: (1) Triage Agent classifies documents as native digital or scanned and detects layout type and domain; (2) Extractor Router selects between three strategies — fast text extraction (pdfplumber), layout-aware extraction (Docling), or vision-augmented multimodal fallback; (3) Chunker and Indexer build semantic chunks and a FAISS vector index; (4) Query Agent provides a LangGraph interface for structured fact extraction over the indexed content.

Training

No model training required — the system uses pre-trained sentence transformers (all-MiniLM-L6-v2) for embedding and Google Gemini for the query agent. Extraction rules are defined in a YAML rubric that governs triage logic and escalation thresholds.

Results

The cost-aware routing reduces average processing time by selecting the fastest viable strategy per document type. The ledger system tracks every extraction with strategy used, confidence score, and processing time, enabling continuous monitoring and strategy optimization.

GitHub Repository

View on GitHub