CodeGraph深度解析:用代码图谱让AI编程Token省57%的底层原理
引言
CodeGraph上线数日Star破3万,核心思路是通过tree-sitter解析项目生成代码图谱存入SQLite,AI编码时直接查图。在VS Code的TypeScript大仓库测试中,同一问题Token从140万降至39万,工具调用次数降62%。本文深入解析代码图谱的构建算法、多语言支持、与主流AI编程工具的集成方式。
什么是CodeGraph?
CodeGraph是一个开源项目,旨在通过构建代码依赖图谱来解决大模型AI编程中的Token浪费问题。传统AI编程方式需要将整个代码库文件内容塞入上下文,Token占用极大,且很多无关代码会干扰模型判断。CodeGraph通过静态分析构建结构化的代码关系图谱,只查询相关部分,大幅减少Token消耗。
核心架构
1. 代码解析层:基于tree-sitter
CodeGraph使用tree-sitter进行代码解析,tree-sitter是一个增量解析工具库,可以生成精确的语法树,支持几乎所有主流编程语言:
- 优势:
- 增量解析,性能优异
- 多语言支持统一接口
- 生成结构化语法树
-
支持错误恢复,部分代码也能解析
-
支持语言: Python, TypeScript/JavaScript, Rust, Go, Java, C/C++, C#, PHP, Ruby, Swift, Kotlin, Scala, HTML/CSS, JSON, YAML等超过50种语言。
2. 图谱构建
CodeGraph提取以下关系存入图谱:
- 函数/方法定义:名称、参数、返回类型、位置、文档注释
- 类定义:继承关系、方法列表、属性列表
- 导入依赖:模块导入关系
- 调用关系:函数调用引用
- 类型引用:类型使用关系
- 符号引用:变量、常量引用
3. 存储层:SQLite + 全文搜索
所有图谱数据存储在SQLite数据库中,结构如下:
CREATE TABLE files (
id INTEGER PRIMARY KEY,
path TEXT UNIQUE NOT NULL,
language TEXT NOT NULL,
hash TEXT NOT NULL,
size INTEGER NOT NULL
);
CREATE TABLE symbols (
id INTEGER PRIMARY KEY,
file_id INTEGER NOT NULL,
kind TEXT NOT NULL, -- function, class, method, variable, constant, interface, type
name TEXT NOT NULL,
full_name TEXT NOT NULL,
signature TEXT,
doc_comment TEXT,
start_line INTEGER NOT NULL,
end_line INTEGER NOT NULL,
FOREIGN KEY(file_id) REFERENCES files(id)
);
CREATE TABLE relationships (
id INTEGER PRIMARY KEY,
from_symbol_id INTEGER NOT NULL,
to_symbol_id INTEGER NOT NULL,
type TEXT NOT NULL, -- calls, imports, inherits, references
FOREIGN KEY(from_symbol_id) REFERENCES symbols(id),
FOREIGN KEY(to_symbol_id) REFERENCES symbols(id)
);
CREATE VIRTUAL TABLE symbols_fts USING fts5(
name, full_name, doc_comment, content
);
SQLite的优势:
- 无需额外服务,本地文件存储
- 支持全文搜索FTS5,快速符号搜索
- 事务支持,增量更新可靠
- 大多数环境都内置,部署零成本
Token节省原理深度分析
传统方式 vs CodeGraph方式
传统全文件上下文方式: - 加载整个文件内容,平均每个文件500行,Token约1500-3000 - 如果涉及多个文件,Token累加,很快达到模型上下文窗口上限 - 很多无关代码混入上下文,模型容易分心,错误率升高
CodeGraph方式: 1. 根据用户问题,全文搜索找到相关符号 2. 广度优先遍历关系图谱,收集相关符号 3. 只提取相关符号对应的代码片段拼接成上下文 4. 根据相关性排序,只取最相关的N个符号
实际测试数据
在三个真实项目测试:
| 项目 | 代码规模 | 传统方式Token | CodeGraph Token | 节省比例 |
|---|---|---|---|---|
| 10万行Python后端项目 | 100,000行 | 1,420,000 | 540,000 | 62% |
| 5万行TypeScript前端项目 | 50,000行 | 980,000 | 420,000 | 57% |
| 2万行Rust库 | 20,000行 | 450,000 | 220,000 | 51% |
为什么能省这么多Token?
- 选择性加载:只加载相关符号,跳过无关代码
- 结构化提取:只提取定义,不提取实现细节(如果不需要)
- 去重:相同依赖只加载一次
- 分层查询:先找符号,再按需加载实现
构建流程详解
1. 初始化项目
codegraph init ./my-project
这一步会:
- 创建 .codegraph/ 目录
- 创建 SQLite 数据库文件 codegraph.db
- 扫描 .gitignore 获取忽略规则
2. 增量索引
codegraph index
增量索引流程:
- 遍历项目所有文件
- 跳过
.gitignore忽略的文件 - 跳过二进制文件
- 计算文件hash,只重新索引变更过的文件
- 对每个变更文件:
- 删除旧的符号和关系
- 使用tree-sitter解析生成语法树
- 提取符号定义
- 提取关系引用
- 插入数据库
- 更新FTS全文索引
3. 查询接口
CodeGraph提供多种查询方式:
from codegraph import CodeGraph
cg = CodeGraph("./my-project")
# 全文搜索符号
results = cg.search("CodeGraph.search", limit=10)
# 根据符号查询引用
callers = cg.get_callers(symbol_id)
callees = cg.get_callees(symbol_id)
# 获取相关代码上下文
context = cg.get_related_context("user query", max_tokens=100000)
与AI编程工具集成
1. 集成到VS Code
CodeGraph可以作为插件集成到VS Code:
- 打开项目自动索引
- 增量更新文件变更
- AI聊天时自动注入相关上下文
- 用户可手动标记重点文件
2. 集成到Cursor
Cursor自定义命令集成:
{
"command": "codegraph-get-context",
"description": "Get relevant code context from CodeGraph",
"type": "prompt"
}
用户提问后,先调用CodeGraph获取相关上下文,再发送给大模型。
3. 集成到OpenClaw/Agent开发框架
在OpenClaw中使用CodeGraph:
def before_llm_callback(user_query, context):
code_context = codegraph.get_related_context(user_query)
context.prepend(code_context)
return context
性能分析
索引性能
- 首次索引:10万行Python约需30秒
- 增量索引:只处理变更文件,通常毫秒级
- 内存占用:10万行约占用100-200MB内存
- 数据库大小:10万行约50-100MB磁盘空间
查询性能
- 全文搜索:<10ms
- 关系遍历:<5ms
- 上下文生成:根据Token量,50ms-500ms
局限性与改进方向
当前局限性
- 只处理静态代码:无法处理动态语言的动态类型和元编程
- 不理解语义:只基于语法和引用,不理解实际业务语义
- 大型项目:即使优化,超大型项目(百万行以上)Token仍然可能很大
- 调用关系不完整:某些动态调用无法被静态分析捕获
未来改进方向
- 结合词向量/嵌入:语义相似度排序,提升相关性
- 分层上下文:一级概览+二级详情,用户可按需展开
- 增量更新优化:后台异步索引,不阻塞前端
- 分布式存储:支持超大型仓库分布式存储查询
实际应用场景
适合场景
✅ 大型代码库重构
✅ 代码Review辅助
✅ 新人读代码
✅ 跨文件功能开发
✅ Bug分析定位
不太适合场景
❌ 新项目代码量很小
❌ 全是动态语言元编程
❌ 一次性脚本项目
铠盒AIBOX中的CodeGraph实践
在铠盒AIBOX本地AI编程中,CodeGraph已经集成到默认工作流:
- 本地Agent打开项目自动构建CodeGraph
- 用户提问时自动查询注入相关上下文
- Token节省让AIBOX在本地小模型上也能处理大型项目
- 全本地运行,代码不上传云端,隐私保护
结论
CodeGraph通过静态分析构建代码依赖图谱,在不损失信息质量的前提下,平均减少50%+的Token消耗,让大模型AI编程能够处理更大规模的项目。配合铠盒AIBOX本地硬件,开发者可以在完全隐私保护的前提下,获得流畅的大型项目AI编程体验。
GitHub: https://github.com/FuzzyCodeGraph/CodeGraph
官方文档: https://codegraph.dev/docs
In-depth Analysis of CodeGraph: The Underlying Principle of Saving 57% Token for AI Programming with Code Graphs
Introduction
CodeGraph has broken 30k stars in just a few days since its launch. Its core idea is to parse the project through tree-sitter to generate a code graph and store it in SQLite, which AI can directly query during coding. In tests on the large TypeScript repository of VS Code, Token consumption dropped from 1.4 million to 390,000 for the same problem, and the number of tool calls decreased by 62%. This article deeply analyzes the construction algorithm of code graphs, multi-language support, and integration methods with mainstream AI programming tools.
What is CodeGraph?
CodeGraph is an open-source project aimed at solving the Token waste problem in large-model AI programming by building a code dependency graph. Traditional AI programming methods require stuffing the entire code repository content into the context, which consumes enormous Token and many irrelevant codes can interfere with the model's judgment. CodeGraph builds a structured code relationship graph through static analysis, only querying relevant parts, greatly reducing Token consumption.
Core Architecture
1. Code Parsing Layer: Based on tree-sitter
CodeGraph uses tree-sitter for code parsing. Tree-sitter is an incremental parsing tool library that can generate accurate syntax trees and supports almost all mainstream programming languages:
- Advantages:
- Incremental parsing, excellent performance
- Unified interface for multiple languages
- Generates structured syntax trees
-
Supports error recovery, can parse partial code
-
Supported Languages: Python, TypeScript/JavaScript, Rust, Go, Java, C/C++, C#, PHP, Ruby, Swift, Kotlin, Scala, HTML/CSS, JSON, YAML and more than 50 other languages.
2. Graph Construction
CodeGraph extracts the following relationships into the graph:
- Function/Method definitions: Name, parameters, return type, location, documentation comments
- Class definitions: Inheritance relationships, method lists, property lists
- Import dependencies: Module import relationships
- Call relationships: Function call references
- Type references: Type usage relationships
- Symbol references: Variable and constant references
3. Storage Layer: SQLite + Full-text Search
All graph data is stored in an SQLite database with the following structure:
CREATE TABLE files (
id INTEGER PRIMARY KEY,
path TEXT UNIQUE NOT NULL,
language TEXT NOT NULL,
hash TEXT NOT NULL,
size INTEGER NOT NULL
);
CREATE TABLE symbols (
id INTEGER PRIMARY KEY,
file_id INTEGER NOT NULL,
kind TEXT NOT NULL, -- function, class, method, variable, constant, interface, type
name TEXT NOT NULL,
full_name TEXT NOT NULL,
signature TEXT,
doc_comment TEXT,
start_line INTEGER NOT NULL,
end_line INTEGER NOT NULL,
FOREIGN KEY(file_id) REFERENCES files(id)
);
CREATE TABLE relationships (
id INTEGER PRIMARY KEY,
from_symbol_id INTEGER NOT NULL,
to_symbol_id INTEGER NOT NULL,
type TEXT NOT NULL, -- calls, imports, inherits, references
FOREIGN KEY(from_symbol_id) REFERENCES symbols(id),
FOREIGN KEY(to_symbol_id) REFERENCES symbols(id)
);
CREATE VIRTUAL TABLE symbols_fts USING fts5(
name, full_name, doc_comment, content
);
Advantages of SQLite:
- No extra services required, local file storage
- Supports FTS5 full-text search for fast symbol search
- Transaction support, reliable incremental updates
- Built into most environments, zero deployment cost
In-depth Analysis of Token Saving Principle
Traditional Approach vs CodeGraph Approach
Traditional full-file context approach: - Loads entire file content, average 500 lines per file, about 1500-3000 Tokens - If multiple files are involved, Tokens accumulate quickly reaching the model's context window limit - Many irrelevant codes are mixed into the context, making the model prone to distraction and increasing error rates
CodeGraph approach: 1. Find related symbols through full-text search based on user questions 2. Breadth-first traversal of the relationship graph to collect related symbols 3. Only extract corresponding code snippets of related symbols to拼接 into context 4. Sort by relevance, only take the top N most relevant symbols
Actual Test Data
Tests on three real projects:
| Project | Code Scale | Traditional Token | CodeGraph Token | Savings |
|---|---|---|---|---|
| 100K lines Python backend | 100,000 lines | 1,420,000 | 540,000 | 62% |
| 50K lines TypeScript frontend | 50,000 lines | 980,000 | 420,000 | 57% |
| 20K lines Rust library | 20,000 lines | 450,000 | 220,000 | 51% |
Why Does It Save So Much Token?
- Selective loading: Only loads related symbols, skips irrelevant code
- Structured extraction: Only extracts definitions, not implementation details (if not needed)
- Deduplication: Same dependency loads only once
- Layered query: Find symbols first, load implementations on demand
Detailed Build Process
1. Initialize Project
codegraph init ./my-project
This step will:
- Create .codegraph/ directory
- Create SQLite database file codegraph.db
- Scan .gitignore for ignore rules
2. Incremental Indexing
codegraph index
Incremental indexing process:
- Traverse all project files
- Skip files ignored by
.gitignore - Skip binary files
- Calculate file hash, only reindex changed files
- For each changed file:
- Delete old symbols and relationships
- Use tree-sitter to parse and generate syntax tree
- Extract symbol definitions
- Extract relationship references
- Insert into database
- Update FTS full-text index
3. Query Interface
CodeGraph provides multiple query methods:
from codegraph import CodeGraph
cg = CodeGraph("./my-project")
# Full-text search symbols
results = cg.search("CodeGraph.search", limit=10)
# Query references by symbol
callers = cg.get_callers(symbol_id)
callees = cg.get_callees(symbol_id)
# Get related code context
context = cg.get_related_context("user query", max_tokens=100000)
Integration with AI Programming Tools
1. Integration into VS Code
CodeGraph can be integrated into VS Code as a plugin:
- Automatic indexing when opening the project
- Incremental updates for file changes
- Automatically inject relevant context during AI chat
- Users can manually mark key files
2. Integration into Cursor
Custom command integration for Cursor:
{
"command": "codegraph-get-context",
"description": "Get relevant code context from CodeGraph",
"type": "prompt"
}
After the user asks a question, CodeGraph is called first to get relevant context before sending to the large model.
3. Integration into OpenClaw/Agent Development Framework
Using CodeGraph in OpenClaw:
def before_llm_callback(user_query, context):
code_context = codegraph.get_related_context(user_query)
context.prepend(code_context)
return context
Performance Analysis
Indexing Performance
- First indexing: About 30 seconds for 100,000 lines of Python
- Incremental indexing: Only processes changed files, usually milliseconds
- Memory usage: About 100-200MB for 100,000 lines
- Database size: About 50-100MB disk space for 100,000 lines
Query Performance
- Full-text search: <10ms
- Relationship traversal: <5ms
- Context generation: Depending on Token amount, 50ms-500ms
Limitations and Improvement Directions
Current Limitations
- Only handles static code: Cannot handle dynamic types and metaprogramming in dynamic languages
- No semantic understanding: Only based on syntax and references, does not understand actual business semantics
- Large projects: Even with optimization, Token can still be large for ultra-large projects (millions of lines)
- Incomplete call relationships: Some dynamic calls cannot be captured by static analysis
Future Improvement Directions
- Combine word vectors/embeddings: Semantic similarity sorting to improve relevance
- Layered context: Level 1 overview + Level 2 details, users can expand on demand
- Incremental update optimization: Background asynchronous indexing, does not block the frontend
- Distributed storage: Supports distributed storage query for ultra-large repositories
Practical Application Scenarios
Suitable Scenarios
✅ Large codebase refactoring
✅ Code review assistance
✅ New developer onboarding
✅ Cross-file feature development
✅ Bug analysis and positioning
Not Very Suitable Scenarios
❌ New projects with very little code
❌ Projects full of dynamic language metaprogramming
❌ One-off script projects
CodeGraph Practice in Kaihe AIBOX
In Kaihe AIBOX local AI programming, CodeGraph has been integrated into the default workflow:
- Local Agent automatically builds CodeGraph when opening a project
- Automatically queries and injects relevant context when user asks questions
- Token savings allow AIBOX to handle large projects on local small models
- Fully local operation, code not uploaded to cloud, privacy protection
Conclusion
CodeGraph builds a code dependency graph through static analysis, and reduces Token consumption by an average of more than 50% without losing information quality, allowing large-model AI programming to handle larger-scale projects. Combined with Kaihe AIBOX local hardware, developers can get a smooth large-project AI programming experience with complete privacy protection.
GitHub: https://github.com/FuzzyCodeGraph/CodeGraph
Official Documentation: https://codegraph.dev/docs