Evaluation & Benchmarks¶

Attocode includes a comprehensive evaluation framework for measuring code intelligence quality across repositories and languages.

Quick Start¶

Run the benchmark suite¶

# Benchmark on default 3 repos (attocode, gh-cli, redis)
python scripts/benchmark_ci.py

# Benchmark on specific repos
python scripts/benchmark_ci.py --repos attocode fastapi pandas

# Single run (faster, no median)
python scripts/benchmark_ci.py --repos attocode --num-runs 1

# Update baseline after improvements
python scripts/benchmark_ci.py --update-baseline

Run search quality evaluation¶

# Evaluate semantic search with ground-truth relevance judgments
python -m eval.search_quality

# Single repo
python -m eval.search_quality --repo attocode

# Generate markdown report
python -m eval.search_quality --report eval/search_quality_report.md

Run needle-in-haystack tasks¶

# All 15 deep code understanding tasks
python -m eval.needle_tasks

# Filter by type
python -m eval.needle_tasks --type trace_call_chain
python -m eval.needle_tasks --type architecture_quiz
python -m eval.needle_tasks --type impact_assessment

# Single task
python -m eval.needle_tasks --id arch_highest_fanin

Run competitive comparison¶

# Compare search quality and latency across repos
python -m eval.competitive

# Generate report
python -m eval.competitive --report eval/competitive_report.md

Benchmark Tasks (10 per repo)¶

Task	What It Measures	Service Method
bootstrap	Project orientation speed and quality	`svc.bootstrap()`
symbol_discovery	Symbol search + cross-reference quality	`svc.search_symbols()` + `svc.cross_references()`
dependency_tracing	Forward/reverse dependency graph quality	`svc.dependency_graph()` + `svc.impact_analysis()`
architecture	Community detection + hotspot quality	`svc.community_detection()` + `svc.hotspots()`
code_navigation	File symbol listing + reference quality	`svc.symbols()` + `svc.cross_references()`
semantic_search	Ranked search result quality	`svc.semantic_search()`
dead_code	Unreferenced symbol detection	`svc.dead_code_data()`
distill	Code compression/signature extraction	`svc.distill_data()`
graph_dsl	Cypher-like dependency query	`svc.graph_dsl()`
code_evolution	Git history for a file	`svc.code_evolution_data()`

Run 3-way comparison (grep vs ast-grep vs code-intel)¶

# Default 3 repos
python scripts/benchmark_3way.py

# All 49 repos
python scripts/benchmark_3way.py --repos all

# Canonical published 20-repo slice
python scripts/benchmark_3way.py --slice published_20

# Canonical 20 repos plus Linux (clone if missing)
python scripts/benchmark_3way.py --slice published_20_plus_linux --clone-missing

# Specific repos
python scripts/benchmark_3way.py --repos fastapi,redis,metabase

# Skip code-intel (quick grep vs ast-grep only)
python scripts/benchmark_3way.py --skip-code-intel

# Resume a long run from the structured sidecar results file
python scripts/benchmark_3way.py --slice published_20 --resume

Latest Results (v0.2.11, 20 repos)¶

Metric	grep	ast-grep	code-intel
Avg Quality	4.0/5	2.8/5	4.7/5
Avg Bootstrap	91ms	538ms	1.7s*
Perfect Scores (5/5)	48/120	36/120	101/120
Zero Scores (0/5)	0	24	0

* Bootstrap time after progressive hydration. Pre-hydration large repo times were 7-25s.

Key findings: - Code-intel delivers the highest quality (4.7/5) with structured, concise output - grep is fast (91ms) and surprisingly competitive (4.0/5) for simple lookups - ast-grep adds limited value — slower than grep with lower quality (2.8/5) - Progressive hydration brings all repos under 4s bootstrap (cockroach: 24.5s → 1.2s)

Charts and per-repo analysis: eval/3WAY_BENCHMARK_REPORT.md

Configured Repos (49)¶

The 3-way benchmark covers 49 repositories across 30+ languages:

Language	Repos
Python	attocode, fastapi, pandas, requests
Go	gh-cli, cockroach
Rust	deno, ripgrep, starship, nickel
C/C++	redis, spdlog, cosmopolitan, protobuf
Java/Kotlin/Scala	spring-boot, okhttp, spark, cats-effect
JavaScript/TypeScript	express, prisma
Ruby	faker, rails
PHP	laravel, WordPress
Swift	SwiftFormat, vapor
Elixir/Erlang	phoenix, elixir, emqx, otp
Clojure	metabase, ring
Other	zls (Zig), luarocks (Lua), postgrest (Haskell), acme-sh (Bash), terraform-eks (HCL), crystal, dart-sdk, fsharp, ggplot2 (R), iTerm2 (Obj-C), julia, kemal (Crystal), mojo (Perl), Nim, ocaml, perl5

Search Quality Metrics¶

The ground-truth evaluation computes standard information retrieval metrics:

MRR@10 (Mean Reciprocal Rank) — Position of the first relevant result
NDCG@10 (Normalized Discounted Cumulative Gain) — Ranking quality
Precision@10 — Fraction of top-10 results that are relevant
Recall@20 — Fraction of relevant files found in top-20

Ground-Truth Format¶

Ground-truth files live in eval/ground_truth/ as YAML:

repo: attocode
queries:
  - query: "token budget management and enforcement"
    relevant_files:
      - src/attocode/types/budget.py
      - src/attocode/integrations/budget/economics.py
      - src/attocode/core/context.py

Adding a New Repo¶

Add the repo to REPO_CONFIGS in scripts/benchmark_ci.py
Create eval/ground_truth/<repo>.yaml with 5-10 queries and verified relevant files
Run: python -m eval.search_quality --repo <repo>

Needle-in-Haystack Tasks¶

Five task types that test deep code understanding:

Type	What It Tests	Pass Criteria
`trace_call_chain`	Dependency tracing accuracy	Found callers match ground truth
`find_dead_code`	Unreferenced symbol detection	Non-empty results returned
`impact_assessment`	Blast radius estimation	Correct files identified as affected
`architecture_quiz`	Structural understanding	Answers match ground truth
`cross_file_symbol_resolution`	Symbol search completeness	All definitions + minimum usages found

Tasks are defined in eval/needle_tasks/tasks.yaml.

Advanced: Online Benchmarks¶

Adapters exist for external benchmark datasets (require pip install datasets):

SWE-Atlas QnA¶

124 deep codebase understanding tasks from Scale Labs. Top models score <31.5%.

pip install datasets
python -m eval.sweatlas list
python -m eval.sweatlas run --limit 10

PyCG Call Graphs¶

Verified Python call graph ground truth for evaluating dependency tracing precision/recall.

python -m eval.pycg setup    # Clone benchmark repo
python -m eval.pycg run      # Evaluate
python -m eval.pycg report   # Generate report

SWE-bench¶

Repository-level issue resolution (300 instances in Lite, 500 in Verified).

pip install datasets
python -m eval.swebench run --limit 10
python -m eval.swebench grade --run-id <id>
python -m eval.swebench leaderboard

Regression Detection¶

The benchmark CI pipeline detects regressions against a committed baseline:

bootstrap_time_ms: 15% threshold (relaxed for timing jitter)
symbol_count: 10% threshold
quality_score: 10% threshold

Results are persisted in eval/benchmarks.db (SQLite) for time-series tracking.

File Cap¶

The file indexing cap controls how many files are analyzed during bootstrap. Default is 2,000 files. For large repos, increase it:

export ATTOCODE_FILE_CAP=5000
python scripts/benchmark_ci.py --repos fastapi

Higher caps improve coverage but increase bootstrap time.