Evaluation & Benchmarks¶
Attocode includes a comprehensive evaluation framework for measuring code intelligence quality across repositories and languages.
Quick Start¶
Run the benchmark suite¶
# Benchmark on default 3 repos (attocode, gh-cli, redis)
python scripts/benchmark_ci.py
# Benchmark on specific repos
python scripts/benchmark_ci.py --repos attocode fastapi pandas
# Single run (faster, no median)
python scripts/benchmark_ci.py --repos attocode --num-runs 1
# Update baseline after improvements
python scripts/benchmark_ci.py --update-baseline
Run search quality evaluation¶
# Evaluate semantic search with ground-truth relevance judgments
python -m eval.search_quality
# Single repo
python -m eval.search_quality --repo attocode
# Generate markdown report
python -m eval.search_quality --report eval/search_quality_report.md
Run needle-in-haystack tasks¶
# All 15 deep code understanding tasks
python -m eval.needle_tasks
# Filter by type
python -m eval.needle_tasks --type trace_call_chain
python -m eval.needle_tasks --type architecture_quiz
python -m eval.needle_tasks --type impact_assessment
# Single task
python -m eval.needle_tasks --id arch_highest_fanin
Run competitive comparison¶
# Compare search quality and latency across repos
python -m eval.competitive
# Generate report
python -m eval.competitive --report eval/competitive_report.md
Benchmark Tasks (10 per repo)¶
| Task | What It Measures | Service Method |
|---|---|---|
| bootstrap | Project orientation speed and quality | svc.bootstrap() |
| symbol_discovery | Symbol search + cross-reference quality | svc.search_symbols() + svc.cross_references() |
| dependency_tracing | Forward/reverse dependency graph quality | svc.dependency_graph() + svc.impact_analysis() |
| architecture | Community detection + hotspot quality | svc.community_detection() + svc.hotspots() |
| code_navigation | File symbol listing + reference quality | svc.symbols() + svc.cross_references() |
| semantic_search | Ranked search result quality | svc.semantic_search() |
| dead_code | Unreferenced symbol detection | svc.dead_code_data() |
| distill | Code compression/signature extraction | svc.distill_data() |
| graph_dsl | Cypher-like dependency query | svc.graph_dsl() |
| code_evolution | Git history for a file | svc.code_evolution_data() |
Run 3-way comparison (grep vs ast-grep vs code-intel)¶
# Default 3 repos
python scripts/benchmark_3way.py
# All 49 repos
python scripts/benchmark_3way.py --repos all
# Canonical published 20-repo slice
python scripts/benchmark_3way.py --slice published_20
# Canonical 20 repos plus Linux (clone if missing)
python scripts/benchmark_3way.py --slice published_20_plus_linux --clone-missing
# Specific repos
python scripts/benchmark_3way.py --repos fastapi,redis,metabase
# Skip code-intel (quick grep vs ast-grep only)
python scripts/benchmark_3way.py --skip-code-intel
# Resume a long run from the structured sidecar results file
python scripts/benchmark_3way.py --slice published_20 --resume
Latest Results (v0.2.11, 20 repos)¶
| Metric | grep | ast-grep | code-intel |
|---|---|---|---|
| Avg Quality | 4.0/5 | 2.8/5 | 4.7/5 |
| Avg Bootstrap | 91ms | 538ms | 1.7s* |
| Perfect Scores (5/5) | 48/120 | 36/120 | 101/120 |
| Zero Scores (0/5) | 0 | 24 | 0 |
* Bootstrap time after progressive hydration. Pre-hydration large repo times were 7-25s.
Key findings: - Code-intel delivers the highest quality (4.7/5) with structured, concise output - grep is fast (91ms) and surprisingly competitive (4.0/5) for simple lookups - ast-grep adds limited value — slower than grep with lower quality (2.8/5) - Progressive hydration brings all repos under 4s bootstrap (cockroach: 24.5s → 1.2s)
Charts and per-repo analysis: eval/3WAY_BENCHMARK_REPORT.md
Configured Repos (49)¶
The 3-way benchmark covers 49 repositories across 30+ languages:
| Language | Repos |
|---|---|
| Python | attocode, fastapi, pandas, requests |
| Go | gh-cli, cockroach |
| Rust | deno, ripgrep, starship, nickel |
| C/C++ | redis, spdlog, cosmopolitan, protobuf |
| Java/Kotlin/Scala | spring-boot, okhttp, spark, cats-effect |
| JavaScript/TypeScript | express, prisma |
| Ruby | faker, rails |
| PHP | laravel, WordPress |
| Swift | SwiftFormat, vapor |
| Elixir/Erlang | phoenix, elixir, emqx, otp |
| Clojure | metabase, ring |
| Other | zls (Zig), luarocks (Lua), postgrest (Haskell), acme-sh (Bash), terraform-eks (HCL), crystal, dart-sdk, fsharp, ggplot2 (R), iTerm2 (Obj-C), julia, kemal (Crystal), mojo (Perl), Nim, ocaml, perl5 |
Search Quality Metrics¶
The ground-truth evaluation computes standard information retrieval metrics:
- MRR@10 (Mean Reciprocal Rank) — Position of the first relevant result
- NDCG@10 (Normalized Discounted Cumulative Gain) — Ranking quality
- Precision@10 — Fraction of top-10 results that are relevant
- Recall@20 — Fraction of relevant files found in top-20
Ground-Truth Format¶
Ground-truth files live in eval/ground_truth/ as YAML:
repo: attocode
queries:
- query: "token budget management and enforcement"
relevant_files:
- src/attocode/types/budget.py
- src/attocode/integrations/budget/economics.py
- src/attocode/core/context.py
Adding a New Repo¶
- Add the repo to
REPO_CONFIGSinscripts/benchmark_ci.py - Create
eval/ground_truth/<repo>.yamlwith 5-10 queries and verified relevant files - Run:
python -m eval.search_quality --repo <repo>
Needle-in-Haystack Tasks¶
Five task types that test deep code understanding:
| Type | What It Tests | Pass Criteria |
|---|---|---|
trace_call_chain |
Dependency tracing accuracy | Found callers match ground truth |
find_dead_code |
Unreferenced symbol detection | Non-empty results returned |
impact_assessment |
Blast radius estimation | Correct files identified as affected |
architecture_quiz |
Structural understanding | Answers match ground truth |
cross_file_symbol_resolution |
Symbol search completeness | All definitions + minimum usages found |
Tasks are defined in eval/needle_tasks/tasks.yaml.
Advanced: Online Benchmarks¶
Adapters exist for external benchmark datasets (require pip install datasets):
SWE-Atlas QnA¶
124 deep codebase understanding tasks from Scale Labs. Top models score <31.5%.
PyCG Call Graphs¶
Verified Python call graph ground truth for evaluating dependency tracing precision/recall.
python -m eval.pycg setup # Clone benchmark repo
python -m eval.pycg run # Evaluate
python -m eval.pycg report # Generate report
SWE-bench¶
Repository-level issue resolution (300 instances in Lite, 500 in Verified).
pip install datasets
python -m eval.swebench run --limit 10
python -m eval.swebench grade --run-id <id>
python -m eval.swebench leaderboard
Regression Detection¶
The benchmark CI pipeline detects regressions against a committed baseline:
- bootstrap_time_ms: 15% threshold (relaxed for timing jitter)
- symbol_count: 10% threshold
- quality_score: 10% threshold
Results are persisted in eval/benchmarks.db (SQLite) for time-series tracking.
File Cap¶
The file indexing cap controls how many files are analyzed during bootstrap. Default is 2,000 files. For large repos, increase it:
Higher caps improve coverage but increase bootstrap time.