Hybrid Swarm Operations Guide (attoswarm)¶
This is the practical runbook for attocodepy + attoswarm.
1. Prerequisites¶
- Python env installed (
pip install -e ".[dev]") attoswarmandattocodecommands available- Worker CLIs available in
PATHfor the backends you use: claudecodexaider(optional)attocode(optional)
If your CLI tools are already authenticated (claude login, codex login), you do not need to export API keys again for swarm runs.
The shipped .attocode/swarm.hybrid.yaml.example uses claude, codex, and
attocode only so doctor passes on a default install more often. If you
prefer aider for judge/review roles, swap that backend explicitly in your
project config.
2. Command Selection¶
Use attocode swarm as the user-facing wrapper. attoswarm is the engine
CLI underneath; both support the same run flows.
Initialize interactively (minimal or demo):
Preflight backend checks:
Scenario Matrix¶
| Scenario | Command | Use this when |
|---|---|---|
| New standalone swarm | attocode swarm start .attocode/swarm.hybrid.yaml "Implement a tiny feature and tests" |
Fresh run, fresh goal, no parent swarm lineage |
| New standalone swarm from a goal file | attocode swarm start .attocode/swarm.hybrid.yaml "$(cat tasks/goal.md)" |
Your high-level swarm goal lives in a Markdown file |
| New child swarm from previous output | attocode swarm continue .agent/hybrid-swarm/demo-1 --config .attocode/swarm.hybrid.yaml "$(cat tasks/goal-phase2.md)" |
Phase 2 / follow-up work should build on a previous swarm branch/output |
| Resume the same run | attoswarm resume .agent/hybrid-swarm/demo-1 |
Continue the exact same run directory after stop/interruption |
| Open dashboard only | attocode swarm monitor .agent/hybrid-swarm/demo-1 |
Inspect or reattach to an existing run without starting a new one |
| Quick no-config run | attoswarm quick "Implement a tiny feature and tests" |
Fast ad hoc swarm without a YAML config |
Start vs Continue vs Resume¶
start: creates a new standalone run. Use it for a new goal.continue: creates a new child run from a previous swarm's preserved branch or result ref. Use it for follow-up or phase-2 work.resume: keeps the same run directory and persisted goal. Use it only when you want to continue the exact same swarm.
If you changed the goal text or wrote a new goal-*.md, that is a new swarm.
Use start or continue, not resume.
Goal Files vs --tasks-file¶
There are two different inputs:
- High-level swarm goal file: pass the file contents as the positional goal text.
- Pre-defined decomposition file: use
--tasks-filewithtasks.yaml,tasks.yml, ortasks.md.
attocode swarm start .attocode/swarm.hybrid.yaml \
--tasks-file tasks/tasks.yaml \
"Implement the planned work"
--tasks-file is not for goal.md or goal-phase2.md. It is for
structured task decomposition files only.
Monitor, Detach, and Reattach¶
Single launcher (run + monitor in one command):
Start coordinator in background and return immediately:
Open the dashboard later:
Closing the dashboard detaches from the run. It does not stop the coordinator.
Terminal States and Finalization¶
You should interpret the final swarm phase based on what actually happened:
completed: execution finished normally and there is no pending work left in the saved DAG.shutdown: the swarm was intentionally stopped and can be resumed withattoswarm resume <run-dir>if pending tasks remain.planning_failed: task decomposition or planning failed before runnable shared-workspace execution could start. This is not the same as a worker-task failure.
When a run ends in shutdown or planning_failed, inspect the run before launching a fresh swarm:
Git finalization from the completion screen or CLI now routes through the same git safety path. Runtime bookkeeping under the swarm run directory is excluded from finalization so merge/keep actions only preserve product-code changes.
3. What Gets Written (Observability)¶
Run directory layout:
.agent/hybrid-swarm/<run>/
swarm.manifest.json
swarm.state.json
git_safety.json
index.snapshot.json
control.jsonl
agents/
agent-<id>.inbox.json
agent-<id>.outbox.json
tasks/
task-<id>.json
logs/
locks/
worktrees/
High-value files:
swarm.state.json: phase, active agents, DAG, budget, merge queue, cursors, attempts.swarm.manifest.json: task definitions for resume support (updated when tasks are added dynamically).git_safety.json: git branch/stash state for TUI completion screen merge/keep actions.control.jsonl: append-only control messages from TUI (approve, reject, skip, retry, add_task, edit_task).agents/agent-*.outbox.json: normalized events from worker subprocesses.tasks/task-*.json: per-task status, attempts, last error, assignment history.
Useful watch commands:
4. Recommended Minimal Configs¶
A. Two Claude workers¶
version: 1
run:
working_dir: .
run_dir: .agent/hybrid-swarm/two-cc
poll_interval_ms: 250
max_runtime_seconds: 180
roles:
- role_id: impl
role_type: worker
backend: claude
model: claude-sonnet-4-20250514
count: 2
write_access: true
workspace_mode: worktree
task_kinds: [implement]
- role_id: merger
role_type: merger
backend: claude
model: claude-sonnet-4-20250514
count: 1
write_access: true
workspace_mode: worktree
task_kinds: [merge]
budget:
max_tokens: 500000
max_cost_usd: 10
merge:
authority_role: merger
judge_roles: []
quality_threshold: 0.5
watchdog:
heartbeat_timeout_seconds: 45
retries:
max_task_attempts: 2
B. One Claude + one Codex¶
version: 1
run:
working_dir: .
run_dir: .agent/hybrid-swarm/cc-codex
poll_interval_ms: 250
max_runtime_seconds: 180
roles:
- role_id: impl
role_type: worker
backend: claude
model: claude-sonnet-4-20250514
count: 1
write_access: true
workspace_mode: worktree
task_kinds: [implement]
- role_id: merger
role_type: merger
backend: codex
model: gpt-5.3-codex
count: 1
write_access: true
workspace_mode: worktree
task_kinds: [merge]
budget:
max_tokens: 500000
max_cost_usd: 10
merge:
authority_role: merger
judge_roles: []
quality_threshold: 0.5
Tip: Replace
backend: codexwithbackend: codex-mcpto use the MCP server mode for multi-turn worker support (requires Codex v0.115+).
C. Claude + Codex MCP (multi-turn)¶
version: 1
run:
working_dir: .
run_dir: .agent/hybrid-swarm/cc-codex-mcp
poll_interval_ms: 250
max_runtime_seconds: 300
roles:
- role_id: impl
role_type: worker
backend: claude
model: claude-sonnet-4-20250514
count: 1
write_access: true
workspace_mode: worktree
task_kinds: [implement]
- role_id: merger
role_type: merger
backend: codex-mcp
model: gpt-5.3-codex
count: 1
write_access: true
workspace_mode: worktree
task_kinds: [merge]
budget:
max_tokens: 500000
max_cost_usd: 10
merge:
authority_role: merger
judge_roles: []
quality_threshold: 0.5
The codex-mcp backend spawns a codex mcp-server process and uses
JSON-RPC over stdio. The first task message creates a new Codex thread;
subsequent messages reuse the thread ID for multi-turn conversation.
This is useful for merger and reviewer roles that need iterative
dialogue with the model.
5. Test Matrix You Can Run Today¶
Deterministic local smoke tests (no real model calls):
Opt-in live smoke tests (real CLIs):
Notes:
- Live tests only require
ATTO_LIVE_SWARM=1and backend binaries inPATH. - They rely on existing CLI authentication state.
6. TUI Operations¶
From the dashboard:
p: pause/resumes: stop swarm (confirmation required)r: manual refreshi: inject control message into first active agent inboxn: add a new task dynamicallya: approve task plan (when in--previewmode)x: reject task plan (when in--previewmode)q: quit dashboard / detach
If a run was interrupted or stopped with pending work left, resume it with:
On completion, a summary screen appears with options:
[m]Merge: merge the swarm branch into the original branch[k]Keep: keep the swarm branch for manual review[q]Quit: exit without git changes
For deeper debugging, inspect inbox/outbox files in parallel while TUI is running.
Approval Mode (--preview)¶
Use --preview to review the decomposed task plan before execution starts:
attocode swarm start .attocode/swarm.hybrid.yaml --preview "Implement feature X"
attoswarm quick --preview "Refactor module Y"
The TUI shows the task plan and waits for approval (a) or rejection (x). On resume, a previously-approved run skips the approval gate automatically.
Note: --preview requires --monitor (the TUI). Using --preview --no-monitor automatically falls back to --dry-run since there is no TUI to approve.
Common Mistakes¶
- New goal file + old run dir: use
startorcontinue, notresume. goal.mdwith--tasks-file: wrong input type. Pass goal docs as positional text with$(cat ...).tasks.yaml/tasks.mdwithout--tasks-file: the orchestrator will only auto-detect those after they are copied into the run dir.- Expecting
qin the dashboard to stop the swarm: it only detaches. Usesfor an explicit stop. - Expecting
--preview --no-monitorto wait for approval: it degrades to--dry-run.
Dynamic Task Addition¶
Press n in the TUI to add a new task during execution. Added tasks:
- Are validated for dependency correctness (no cycles)
- Get code-intel enrichment (if enabled)
- Are persisted to the manifest (survive resume)
Git Safety¶
By default, swarm runs create a dedicated branch (attoswarm/<run-id>) and stash uncommitted changes. Disable with --no-git-safety:
attoswarm quick --no-git-safety "test task"
attocode swarm start config.yaml --no-git-safety "test task"
Git safety state is persisted to git_safety.json in the run directory for the TUI completion screen.
7. Common Failure Modes¶
Unsupported backend: typo inroles[].backend.- Immediate
failedphase: budget cap too low ormax_runtime_secondstoo low. - Task stuck as
ready: no matching role (role_hint/task_kindsmismatch). - Frequent restarts: watchdog timeout too aggressive for chosen backend.
- Run says
completedbut work is still conceptually unfinished: the swarm only knows the persisted run goal and task graph. Checkswarm.state.jsonandswarm.manifest.jsonfirst to see what that run was actually executing. - Run says
completedeven though tasks remain pending: inspectswarm.state.jsonfor pending nodes before trusting the phase alone. Treat that as a status bug / edge case, not proof that the product goal is fully satisfied.
8. Recommended First Real Task¶
Use a small, deterministic repository task first:
Then scale to multi-file implementation tasks once telemetry and merge behavior look healthy.
9. Agent Quality Features (v0.2.3+)¶
Shutdown Reason Tracking¶
When a run ends in shutdown, the state file now records why:
Possible values: signal:SIGTERM, signal:SIGINT, control:shutdown, control:reject, approval_timeout, budget_exhausted, unknown.
The events file also contains a diagnostic event:
Context Injection¶
Worker agents now receive enriched prompts containing:
- Symbol scope: relevant AST symbols from code-intel impact analysis (up to 15)
- Test map: related test files discovered by naming convention (e.g.,
test_<module>.py) - Test command: auto-detected test command (pytest, npm test, cargo test, go test)
- Learning context: patterns and antipatterns from previous runs
Inspect the full prompt an agent received:
Syntax Verification Gate¶
After a worker completes, modified Python and JSON files are parsed to catch syntax errors before the task is marked done. If a file doesn't parse:
- The task is marked failed
- A
warningevent is emitted with the specific syntax errors - The task enters the retry pipeline
This runs concurrently with test verification in the result pipeline.
PID Lockfile¶
A .orchestrator.pid file prevents concurrent orchestrators from corrupting the same run directory. Stale lockfiles are cleaned automatically by ensure_clean_slate on the next run.
Diagnostic Events¶
Every loop exit now emits a diagnostic event explaining why execution stopped:
- No ready tasks (with pending task list)
- Preflight blocked all tasks
- Budget gate blocked all tasks
- Batch safety bound reached
- Shutdown requested (with reason)