Brownfield Cartographer
A multi-agent codebase intelligence system that ingests any repository and produces a living, queryable knowledge graph of its architecture, data flows, and semantic structure — designed to accelerate engineer onboarding in production environments.
Overview
A multi-agent codebase intelligence system that ingests any repository and produces a living, queryable knowledge graph of its architecture, data flows, and semantic structure — designed to accelerate engineer onboarding in production environments.
Problem
Engineers joining large production codebases spend weeks building mental models of architecture, data flows, and module dependencies. Existing documentation is stale, incomplete, or non-existent. The challenge is automatically extracting a living, accurate knowledge graph from the codebase itself.
Dataset
Any source code repository. Validated on dbt jaffle-shop (37 modules, 19 datasets) and Apache Airflow (7,538 modules, 2,263 tasks across 520 DAGs). Supports Python, SQL, YAML, JavaScript, and TypeScript files.
Architecture
A multi-agent pipeline with four specialized agents: (1) Surveyor Agent performs deep AST analysis via Tree-sitter, builds module dependency graphs, detects circular dependencies, and ranks architectural hubs via PageRank; (2) Hydrologist Agent tracks data lineage across Python (Pandas/PySpark), SQL (Sqlglot-parsed), and YAML (Airflow/dbt) boundaries with full provenance tracking; (3) Semanticist Agent (Phase 3) uses LLM-powered module summarization for business intent extraction; (4) Archivist Agent (Phase 4) generates a living CODEBASE.md for injection into AI coding agents. All outputs are stored in a structured .cartography/ directory.
Training
No model training. Static analysis uses Tree-sitter for AST parsing and Sqlglot for multi-dialect SQL parsing. PageRank identifies architectural hubs. Phase 3 integrates LLMs for semantic purpose extraction.
Results
Validated on Apache Airflow: 7,538 modules analyzed, 6,174 lineage nodes, 772 lineage edges, 2,984 SQLAlchemy operations detected in ~7 minutes. On dbt jaffle-shop: 100% accuracy vs expected lineage in ~1 second. Full provenance tracking with evidence type, confidence score, and resolution status on every extracted relationship.