Brownfield Cartographer

A multi-agent codebase intelligence system that ingests any repository and produces a living, queryable knowledge graph of its architecture, data flows, and semantic structure — designed to accelerate engineer onboarding in production environments.

PythonTree-sitterSqlglotLangGraphPageRankDockerdbtApache Airflow

Overview

Problem

Engineers joining large production codebases spend weeks building mental models of architecture, data flows, and module dependencies. Existing documentation is stale, incomplete, or non-existent. The challenge is automatically extracting a living, accurate knowledge graph from the codebase itself.

Dataset

Any source code repository. Validated on dbt jaffle-shop (37 modules, 19 datasets) and Apache Airflow (7,538 modules, 2,263 tasks across 520 DAGs). Supports Python, SQL, YAML, JavaScript, and TypeScript files.

Architecture

A multi-agent pipeline with four specialized agents: (1) Surveyor Agent performs deep AST analysis via Tree-sitter, builds module dependency graphs, detects circular dependencies, and ranks architectural hubs via PageRank; (2) Hydrologist Agent tracks data lineage across Python (Pandas/PySpark), SQL (Sqlglot-parsed), and YAML (Airflow/dbt) boundaries with full provenance tracking; (3) Semanticist Agent (Phase 3) uses LLM-powered module summarization for business intent extraction; (4) Archivist Agent (Phase 4) generates a living CODEBASE.md for injection into AI coding agents. All outputs are stored in a structured .cartography/ directory.

Training

No model training. Static analysis uses Tree-sitter for AST parsing and Sqlglot for multi-dialect SQL parsing. PageRank identifies architectural hubs. Phase 3 integrates LLMs for semantic purpose extraction.

Results

Validated on Apache Airflow: 7,538 modules analyzed, 6,174 lineage nodes, 772 lineage edges, 2,984 SQLAlchemy operations detected in ~7 minutes. On dbt jaffle-shop: 100% accuracy vs expected lineage in ~1 second. Full provenance tracking with evidence type, confidence score, and resolution status on every extracted relationship.

GitHub Repository

View on GitHub