Research Campaigns Guide¶

Research campaigns automate iterative experimentation: the orchestrator generates hypotheses, spawns agent experiments in isolated git worktrees, evaluates each result against a metric, and promotes winners -- all in a loop you can steer while it runs.

Overview¶

A research campaign is a structured search over code changes. You define:

A goal (natural language description of what you want to improve).
An eval command (a shell command that outputs a numeric metric).
Budget limits (max experiments, max cost, wall-clock timeout).

The orchestrator then:

Measures a baseline by running the eval command on the current code.
Generates hypotheses using a mix of strategies (explore, exploit, ablate, compose, reproduce).
Spawns each experiment in an isolated git worktree so the main branch stays clean.
Evaluates each experiment and applies an accept policy to decide whether the result improves on the current best.
Tracks everything in a SQLite database for inspection, comparison, and resume.

Campaigns can run unattended for hours, or you can steer them mid-flight with injected notes, manual holds, and promotions.

Quick Start¶

# Start a campaign to maximize test pass rate
attoswarm research start \
  "Improve the parser to handle edge cases" \
  --eval-command "python -m pytest tests/ -q --tb=no | tail -1" \
  --metric-name pass_rate \
  --metric-direction maximize \
  --target-files src/parser.py \
  --max-experiments 20 \
  --max-cost 10.0

# Check the leaderboard while it runs (from another terminal)
attoswarm research leaderboard --run-id <run-id>

# Inject a steering note to guide future hypotheses
attoswarm research inject <run-id> "Focus on unicode edge cases, not performance"

# View the full feed (leaderboard + findings + steering notes)
attoswarm research feed --run-id <run-id>

The run ID is printed at startup and is also stored in .agent/research/research.db.

How It Works¶

Worktree Isolation¶

Every experiment runs in its own git worktree under .agent/research/experiments/<experiment-id>/worktree. The worktree is branched from the current best commit (or HEAD for the first experiment).

.agent/research/
  research.db                          # SQLite experiment database
  experiments/
    <run-id>-e001-abcd/
      worktree/                        # isolated git worktree
    <run-id>-e002-ef01/
      worktree/

This means:

The main branch is never modified during a campaign.
Multiple experiments can run in parallel without conflicts.
Each experiment's worktree can be inspected after the run.
Set preserve_worktrees: false in config to clean up worktrees for rejected experiments automatically.

Evaluation¶

After an agent modifies code in a worktree, the orchestrator:

Commits all changes in the worktree.
Captures the diff.
Runs the eval command (with optional retries and repeats).
Parses the metric from stdout (last numeric line, or structured JSON).
Checks constraint checks (if the evaluator returns them).
Applies the accept policy against the current best value.

Accept / Reject / Promote¶

The accept policy decides whether an experiment's metric is good enough:

If accepted and promotion_repeats > 1, the experiment enters candidate status and queues reproduction runs for validation.
If accepted and promotion_repeats == 1, the experiment is immediately accepted and becomes the new best.
If rejected, the experiment is stored with its reason but does not affect the best value.

You can manually promote, hold, kill, or resume experiments at any time using CLI commands.

Campaign Lifecycle¶

1. Measure baseline
2. While budget remains:
   a. Plan a batch of experiments (strategy mix + promotion queue)
   b. Execute experiments in parallel (up to max_parallel)
   c. Evaluate each result
   d. Accept/reject via policy
   e. Reconcile candidates awaiting validation
   f. Record findings
   g. Checkpoint state to SQLite
3. Print final scoreboard

The campaign stops when any of these conditions is met:

total_max_experiments reached.
total_max_cost_usd exceeded.
total_max_wall_seconds elapsed.
No more experiments can be planned.

Experiment State Machine¶

Each experiment transitions through these states:

              +---> candidate --+--> validated --+--> accepted
              |                 |                |
running ------+---> accepted    +--> held -------+--> killed
              |                 |                |
              +---> rejected    +--> resumed ----+
              |                      (back to candidate)
              +---> invalid
              |
              +---> error

State	Meaning
`running`	Agent is executing in the worktree
`candidate`	Passed accept policy, awaiting validation repeats (`promotion_repeats > 1`)
`validated`	A reproduction run confirmed the candidate's result
`accepted`	Fully accepted as the new best (after validation or immediately)
`rejected`	Did not meet the accept policy threshold
`invalid`	Evaluation failed or constraint checks failed
`error`	Agent or infrastructure error before evaluation
`held`	Manually paused by operator via `research hold`
`killed`	Manually terminated by operator via `research kill`

Manual transitions:

promote: candidate or held -> accepted
hold: candidate or held -> held
kill: candidate, held, or killed -> killed
resume: held or killed -> candidate

Strategy Types¶

The orchestrator uses a configurable mix of strategies. The default strategy_mix is:

{
    "explore": 2,
    "exploit": 1,
    "ablate": 1,
    "compose": 1,
    "reproduce": 1,
}

The numbers are relative weights in a round-robin sequence. With the defaults above, the sequence is: explore, explore, exploit, ablate, compose, reproduce, explore, ...

explore¶

Generate a fresh hypothesis from scratch. The agent starts from the current best (or HEAD) and tries something new. Best for early diversification when the search space is wide.

exploit¶

Refine the current best experiment. The agent starts from the best branch and tries to improve it further. Use this to squeeze more performance out of a known-good approach.

ablate¶

Remove or simplify one mechanism from the current best and check whether the gain survives. This tests whether each component of a winning approach is actually contributing. The agent receives explicit instructions to remove rather than add.

compose¶

Combine two successful approaches. The orchestrator picks the best experiment and a "partner" (another accepted experiment with minimal file overlap and high metric value), applies the partner's diff to the worktree, and asks the agent to integrate both approaches. Falls back to exploit if no suitable partner exists.

reproduce¶

Re-run the evaluation on an existing experiment's commit without agent changes. Used automatically for promotion validation (promotion_repeats > 1) and available manually via research reproduce.

Configuration Reference¶

All fields on ResearchConfig with their defaults:

Field	Default	Description
`metric_name`	`"score"`	Display name for the metric being optimized
`metric_direction`	`"maximize"`	`"maximize"` or `"minimize"`
`experiment_timeout_seconds`	`300.0`	Timeout per experiment (agent execution)
`experiment_max_tokens`	`500_000`	Max tokens per experiment agent call
`experiment_max_cost_usd`	`2.0`	Max cost per single experiment
`total_max_experiments`	`100`	Stop after this many experiments
`total_max_cost_usd`	`50.0`	Stop when total cost exceeds this
`total_max_wall_seconds`	`28800.0`	Stop after this many seconds (default: 8 hours)
`min_improvement_threshold`	`0.0`	Minimum improvement to accept (used by ThresholdPolicy)
`eval_command`	`""`	Shell command that outputs the metric
`eval_repeat`	`1`	How many times to run eval per experiment (results are averaged)
`baseline_repeats`	`1`	How many times to evaluate the initial baseline
`promotion_repeats`	`1`	Reproduction passes required before promoting a candidate to accepted
`target_files`	`[]`	Files the agent should focus on modifying
`use_git_stash`	`true`	Whether to stash uncommitted changes before starting
`model`	`""`	LLM model for agent experiments
`backend`	`"claude"`	Agent backend (`claude`, `codex`, `aider`, `attocode`)
`max_parallel_experiments`	`1`	Max experiments to run concurrently per batch
`search_policy`	`"round_robin"`	Strategy scheduling policy
`experiment_workspace_mode`	`"worktree"`	Isolation mode (currently only `worktree` is supported)
`strategy_mix`	`{"explore": 2, "exploit": 1, "ablate": 1, "compose": 1, "reproduce": 1}`	Relative weights for each strategy in the round-robin
`steering_enabled`	`true`	Whether to apply injected steering notes to hypothesis generation
`preserve_worktrees`	`true`	Keep worktrees for rejected/invalid experiments (useful for debugging)
`working_dir`	`"."`	Repository root
`run_dir`	`".agent/research"`	Directory for run artifacts and the SQLite database

CLI Commands Reference¶

All commands live under attoswarm research. The database is resolved from --db (explicit path) or --run-dir (directory containing research.db). When neither is given, the default is .agent/research/research.db.

`research start`¶

Start a new research campaign (or resume an existing one).

attoswarm research start <GOAL> \
  --eval-command <cmd> \
  [--target-files <file>]... \
  [--max-experiments <n>] \
  [--max-parallel <n>] \
  [--experiment-timeout <seconds>] \
  [--metric-direction maximize|minimize] \
  [--metric-name <name>] \
  [--max-cost <usd>] \
  [--baseline-repeats <n>] \
  [--promotion-repeats <n>] \
  [--resume <run-id>] \
  [--config <path>] \
  [--db <path>] \
  [--working-dir <path>]

Flag	Short	Default	Description
`GOAL`		(required)	Natural language goal for the campaign
`--eval-command`	`-e`	(required)	Shell command that outputs the metric
`--target-files`	`-t`	`[]`	Files the agent should modify (repeatable)
`--max-experiments`		`100`	Maximum total experiments
`--max-parallel`		`1`	Experiments per batch
`--experiment-timeout`		`300.0`	Seconds before an experiment is killed
`--metric-direction`		`maximize`	`maximize` or `minimize`
`--metric-name`		`score`	Name of the metric
`--max-cost`		`50.0`	Total cost budget in USD
`--baseline-repeats`		`1`	Eval repeats for the baseline measurement
`--promotion-repeats`		`1`	Reproduction passes before accepting a candidate
`--resume`		`""`	Resume a previous run by ID
`--config`			Path to a swarm YAML config (provides spawn function)
`--db`			Path to experiment database
`--working-dir`	`-w`	`.`	Repository root

Example:

attoswarm research start \
  "Reduce average response latency in the API handler" \
  -e "python bench/latency.py" \
  -t src/api/handler.py \
  --metric-direction minimize \
  --metric-name latency_ms \
  --max-experiments 30 \
  --max-parallel 2 \
  --promotion-repeats 3

`research leaderboard`¶

Show the scoreboard for a research run.

attoswarm research leaderboard --run-id <id> [--db <path>] [--run-dir <path>] [--limit <n>]

Flag	Default	Description
`--run-id`	(required)	Research run ID
`--limit`	`10`	Maximum leaderboard rows

`research inject`¶

Inject a steering note into a running campaign. Active notes are included in hypothesis generation for subsequent experiments.

attoswarm research inject <RUN_ID> <NOTE> \
  [--scope global|strategy|experiment] \
  [--target <target>] \
  [--db <path>] [--run-dir <path>]

Flag	Default	Description
`RUN_ID`	(required)	Research run ID
`NOTE`	(required)	Steering note text
`--scope`	`global`	`global`, `strategy`, or `experiment`
`--target`	`""`	Optional target (strategy name or experiment ID)

Example:

attoswarm research inject abc12345 "Avoid modifying the database schema" --scope global
attoswarm research inject abc12345 "Try caching" --scope strategy --target exploit

`research feed`¶

Unified view: leaderboard + findings + active steering notes.

attoswarm research feed --run-id <id> \
  [--db <path>] [--run-dir <path>] \
  [--leaderboard-limit <n>] \
  [--findings-limit <n>] \
  [--notes-limit <n>]

Flag	Default	Description
`--run-id`	(required)	Research run ID
`--leaderboard-limit`	`5`	Max leaderboard rows
`--findings-limit`	`10`	Max findings to show
`--notes-limit`	`10`	Max steering notes to show

`research monitor`¶

Detailed view: summary + pending candidates + findings + steering notes.

attoswarm research monitor --run-id <id> \
  [--db <path>] [--run-dir <path>] \
  [--candidate-limit <n>] \
  [--findings-limit <n>] \
  [--notes-limit <n>]

Flag	Default	Description
`--run-id`	(required)	Research run ID
`--candidate-limit`	`10`	Max pending candidates to show
`--findings-limit`	`10`	Max findings to show
`--notes-limit`	`10`	Max steering notes to show

`research promote`¶

Manually promote a candidate or held experiment to accepted.

attoswarm research promote <RUN_ID> <EXPERIMENT_ID> \
  [--db <path>] [--run-dir <path>]

Only experiments in candidate or held status can be promoted. Already-accepted experiments are skipped with a message. A finding record is created automatically.

`research hold`¶

Pause a candidate experiment (prevents it from being validated or promoted by the orchestrator).

attoswarm research hold <RUN_ID> <EXPERIMENT_ID> \
  [--reason <text>] \
  [--db <path>] [--run-dir <path>]

Flag	Default	Description
`--reason`	`"held by operator"`	Reason for holding

`research resume`¶

Resume a held or killed experiment back to candidate status.

attoswarm research resume <RUN_ID> <EXPERIMENT_ID> \
  [--db <path>] [--run-dir <path>]

Only experiments in held or killed status can be resumed.

`research kill`¶

Permanently reject a candidate, held, or killed experiment.

attoswarm research kill <RUN_ID> <EXPERIMENT_ID> \
  [--reason <text>] \
  [--db <path>] [--run-dir <path>]

Flag	Default	Description
`--reason`	`"killed by operator"`	Reason for killing

`research compare`¶

Side-by-side comparison of two experiments.

attoswarm research compare <RUN_ID> <EXPERIMENT_A> <EXPERIMENT_B> \
  [--db <path>] [--run-dir <path>]

Prints each experiment's status, strategy, branch, hypothesis, metric value, and the delta between them (raw and quality-adjusted).

`research reproduce`¶

Manually reproduce an experiment or import a git ref into the campaign.

attoswarm research reproduce <RUN_ID> \
  [--experiment-id <id>] \
  [--ref <git-ref>] \
  [--eval-command <cmd>] \
  [--working-dir <path>] \
  [--db <path>] [--run-dir <path>]

Flag	Default	Description
`--experiment-id`	`""`	Existing experiment to reproduce
`--ref`	`""`	Git ref to import and score
`--eval-command`	`""`	Override eval command (falls back to run config)

Pass exactly one of --experiment-id or --ref. The command creates a worktree at the specified commit, runs the evaluator, and stores the result. If the result improves on the current best, it is accepted.

Example:

# Reproduce an existing experiment
attoswarm research reproduce abc12345 --experiment-id abc12345-e003-beef

# Import a branch from outside the campaign
attoswarm research reproduce abc12345 --ref feature/manual-fix

`research import-patch`¶

Apply a patch file to the campaign and evaluate it.

attoswarm research import-patch <RUN_ID> <PATCH_PATH> \
  [--base-experiment-id <id>] \
  [--base-ref <git-ref>] \
  [--eval-command <cmd>] \
  [--working-dir <path>] \
  [--db <path>] [--run-dir <path>]

Flag	Default	Description
`PATCH_PATH`	(required)	Path to a patch/diff file
`--base-experiment-id`	`""`	Experiment to apply patch on top of
`--base-ref`	`""`	Git ref to apply patch on top of
`--eval-command`	`""`	Override eval command

The patch is applied with git apply --3way. If it applies cleanly, the result is evaluated and stored. If it improves the current best, it is accepted. Pass at most one of --base-experiment-id or --base-ref.

Example:

# Import a colleague's patch on top of the current best
attoswarm research import-patch abc12345 fixes/parser-edge-case.patch

# Import on top of a specific experiment
attoswarm research import-patch abc12345 fixes/cleanup.patch \
  --base-experiment-id abc12345-e007-1234

Evaluation¶

The evaluator protocol is a single async method:

async def evaluate(self, working_dir: str) -> EvalResult

Four built-in evaluators are provided.

CommandEvaluator¶

Runs a shell command and parses the metric from stdout. This is the default evaluator used by research start --eval-command.

Metric parsing (in order of priority):

Structured JSON: if stdout is valid JSON (or the last non-empty line is JSON) with a primary_metric or metric key, the full structure is used:

{
  "primary_metric": 0.87,
  "secondary_metrics": {"precision": 0.91, "recall": 0.83},
  "constraint_checks": {"no_regressions": true},
  "artifacts": ["report.html"],
  "seed": 42
}

Last numeric line: the last line of stdout that contains a number. Supports integers, floats, and scientific notation.

Non-zero exit codes produce a failed EvalResult with the stderr captured.

ScriptEvaluator¶

Runs a Python script and expects JSON output with a "metric" key. Falls back to structured JSON parsing (same as CommandEvaluator). Useful when the evaluation logic is complex enough to warrant a script.

TestPassRateEvaluator¶

Runs pytest --tb=no -q and returns the pass rate (passed / total) as the metric. Optionally scoped to a specific test path.

evaluator = TestPassRateEvaluator(test_path="tests/unit/")

The metadata includes passed, failed, and total counts.

CompositeEvaluator¶

Weighted average of multiple evaluators. Useful for multi-objective optimization where you want a single scalar metric.

evaluator = CompositeEvaluator([
    (TestPassRateEvaluator("tests/"), 0.6),
    (CommandEvaluator("python bench/speed.py"), 0.4),
])

If some sub-evaluators fail, the composite uses only the successful ones (re-normalizing weights). If all fail, the result is a failure.

Constraint Checks¶

Evaluators can return constraint_checks in their result. These are hard gates: if any constraint fails, the experiment is marked invalid regardless of the metric value.

{
  "primary_metric": 0.95,
  "constraint_checks": {
    "no_new_warnings": true,
    "type_check": {"passed": true},
    "memory_limit": {"passed": false, "detail": "exceeded 512MB"}
  }
}

A constraint fails if its value is false, or if it is a dict with "passed": false.

Accept Policies¶

Accept policies decide whether an experiment's metric is good enough to be accepted. Three built-in policies are available.

NeverRegressPolicy (default)¶

Accept any improvement over the current best, reject any regression or tie. This is the simplest and most conservative policy.

Direction maximize: accept if candidate > baseline.
Direction minimize: accept if candidate < baseline.
Ties are rejected.

ThresholdPolicy¶

Accept only if the improvement exceeds a minimum threshold.

policy = ThresholdPolicy(threshold=0.01)

Direction maximize: accept if candidate - baseline > threshold.
Direction minimize: accept if baseline - candidate > threshold.

Useful when small fluctuations in the metric are noise and you only want to accept meaningful improvements.

StatisticalPolicy¶

Accept based on a z-test against the metric history. Requires at least min_samples (default: 5) historical values before applying the test. Falls back to simple comparison with fewer samples.

policy = StatisticalPolicy(confidence=0.95, min_samples=5)

The z-score is computed against the mean and standard deviation of all accepted metric values. Supported confidence levels: 0.90 (z=1.645), 0.95 (z=1.960), 0.99 (z=2.576).

Use this when your eval command has high variance and you want to avoid accepting lucky outliers.

Tips & Troubleshooting¶

Writing Good Eval Commands¶

The eval command is the most important part of a research campaign. Keep these guidelines in mind:

Deterministic: minimize randomness. If your eval has variance, use --eval-repeat 3 or --baseline-repeats 3 to average results.
Fast: the eval runs once per experiment (or more with repeats). A 60-second eval with 50 experiments is nearly an hour of eval time alone.
Informative: use structured JSON output to capture secondary metrics and constraint checks alongside the primary metric.
Exit code: a non-zero exit code means evaluation failure, not a low metric. Make sure your script exits 0 even when the metric is bad.

Structured JSON Eval Script Template¶

#!/usr/bin/env python
import json
import subprocess
import sys

# Run tests
result = subprocess.run(
    ["python", "-m", "pytest", "tests/", "-q", "--tb=no"],
    capture_output=True, text=True
)

passed = failed = 0
for line in result.stdout.splitlines():
    if "passed" in line:
        import re
        m = re.search(r"(\d+) passed", line)
        if m: passed = int(m.group(1))
    if "failed" in line:
        m = re.search(r"(\d+) failed", line)
        if m: failed = int(m.group(1))

total = passed + failed
pass_rate = passed / total if total > 0 else 0.0

print(json.dumps({
    "primary_metric": pass_rate,
    "secondary_metrics": {"passed": passed, "failed": failed},
    "constraint_checks": {"no_syntax_errors": result.returncode != 2},
}))
sys.exit(0)

Resuming a Campaign¶

Pass --resume <run-id> to research start to continue a previous run. The orchestrator loads the checkpoint from SQLite and continues from where it left off, preserving all experiment history and the current best.

Steering a Running Campaign¶

Steering notes are injected into hypothesis generation. Use them to redirect the search without restarting the campaign:

# Focus on a specific approach
attoswarm research inject <run-id> "Try memoization in the hot path"

# Avoid a dead end
attoswarm research inject <run-id> "Do not modify the database layer"

# Scope to a specific strategy
attoswarm research inject <run-id> "Use async IO" --scope strategy --target exploit

Promotion Repeats¶

Set --promotion-repeats 3 to require three successful reproduction runs before accepting a candidate. This guards against lucky eval results. The orchestrator automatically queues reproduce experiments for the top candidate until the required count is reached.

Parallel Experiments¶

Set --max-parallel 4 to run up to four experiments per batch. Each experiment gets its own worktree and agent process. The strategy mix determines which strategies are used in each batch slot.

Common Issues¶

Symptom	Cause	Fix
Baseline evaluation fails	Eval command exits non-zero on current code	Fix the eval script or the code first
All experiments are `invalid`	Eval command parsing fails	Ensure the command prints a number or valid JSON
All experiments are `rejected`	Metric does not improve	Lower the threshold, inject steering notes, or increase experiment count
Campaign stops immediately	Budget too low	Increase `--max-experiments` or `--max-cost`
Worktree creation fails	Dirty working tree	Commit or stash changes, or set `use_git_stash: true`
`No spawn function configured`	Missing `--config`	Pass a swarm YAML config with `--config` to provide the agent backend
Compose strategy falls back to exploit	No second accepted experiment	Normal early in a campaign -- compose needs two accepted experiments with different file sets

Inspecting the Database¶

The SQLite database at .agent/research/research.db contains all experiment data. You can query it directly:

sqlite3 .agent/research/research.db \
  "SELECT experiment_id, status, metric_value, strategy FROM experiments WHERE run_id='<id>' ORDER BY iteration"

Cleaning Up Worktrees¶

If preserve_worktrees is true (the default), worktrees accumulate in .agent/research/experiments/. To clean them up:

# Remove all experiment worktrees
rm -rf .agent/research/experiments/
git worktree prune

Or set preserve_worktrees: false in your config to auto-clean worktrees for non-accepted experiments.

Research Campaigns Guide¶

Overview¶

Quick Start¶

How It Works¶

Worktree Isolation¶

Evaluation¶

Accept / Reject / Promote¶

Campaign Lifecycle¶

Experiment State Machine¶

Strategy Types¶

explore¶

exploit¶

ablate¶

compose¶

reproduce¶

Configuration Reference¶

CLI Commands Reference¶

research start¶

research leaderboard¶

research inject¶

research feed¶

research monitor¶

research promote¶

research hold¶

research resume¶

research kill¶

research compare¶

research reproduce¶

research import-patch¶

Evaluation¶

CommandEvaluator¶

ScriptEvaluator¶

TestPassRateEvaluator¶

CompositeEvaluator¶

Constraint Checks¶

Accept Policies¶

NeverRegressPolicy (default)¶

ThresholdPolicy¶

StatisticalPolicy¶

Tips & Troubleshooting¶

Writing Good Eval Commands¶

Structured JSON Eval Script Template¶

Resuming a Campaign¶

Steering a Running Campaign¶

Promotion Repeats¶

Parallel Experiments¶

Common Issues¶

Inspecting the Database¶

Cleaning Up Worktrees¶

`research start`¶

`research leaderboard`¶

`research inject`¶

`research feed`¶

`research monitor`¶

`research promote`¶

`research hold`¶

`research resume`¶

`research kill`¶

`research compare`¶

`research reproduce`¶

`research import-patch`¶