Skip to content

autoresearch

An infinite, multi-session improvement loop for any research artifact. One managing agent runs the whole loop β€” it makes a focused change, judges it against the current best, keeps the winner and reverts the rest. Its long-term memory is a per-loop wiki that it consults before every move and updates after every move. The loop never stops on its own.


Core rule β€” NEVER STOP

The loop runs until the human interrupts it. Period.

  • Never decide the artifact is "good enough" and exit
  • Never stop because the score plateaued or iterations look repetitive
  • Plateau is a signal to change direction (new angle, branch, or literature search) β€” not to stop
  • Open questions to the human are non-blocking β€” park them, keep going on a best guess
  • The only valid exit is the human pressing Ctrl-C or typing a stop command

Architecture β€” one agent, one brain

Two principles define this loop. Hold both.

1. One managing agent β€” no subagent fan-out. A single agent runs the entire loop and holds the thread of all iterations. It plays worker (makes the change) and evaluator (judges it) itself. The only optional exception is the evaluation step, which can use one fresh subagent when evaluation: fresh-eval is set (see Evaluation). Everything else is one agent, because context continuity across iterations is what lets it reason about the whole search instead of one move at a time.

2. The wiki is the brain. A single agent running an infinite loop will exhaust its context window. The per-loop wiki is the externalized memory that survives that β€” context is working memory, the wiki is long-term memory. This is not optional decoration. The agent reads the wiki before deciding every move and writes to it after every move. Without the wiki the agent is amnesiac: it re-treads dead ends, forgets why something failed, and goes in circles forever instead of getting smarter. The wiki is what makes an infinite single-agent loop compound rather than wander.

Treat the wiki the way you treat your own memory: you would never re-run an experiment you already know failed. Neither should the loop. Query the wiki first, always.


Folder structure

The loop folder is named {name}_autoresearch/ and lives next to the artifact being improved β€” not inside .neuroflow/ by default. Only a small pointer registry lives in project memory.

{location}/{name}_autoresearch/        ← e.g. scripts/analysis/connectivity_autoresearch/
β”œβ”€β”€ wiki/                  ← THE BRAIN β€” read before every move, written after every move
β”‚   β”œβ”€β”€ index.md           ← catalog of all pages
β”‚   β”œβ”€β”€ log.md             ← append-only: ## [iter NNN] {op} | {title}
β”‚   β”œβ”€β”€ schema.md          ← this loop's domain, criteria, conventions
β”‚   └── pages/
β”‚       β”œβ”€β”€ attempts/      ← one page per meaningful direction: what, why, verdict, delta, reasoning (wins AND dead-ends)
β”‚       β”œβ”€β”€ concepts/      ← domain knowledge about the artifact and each criterion
β”‚       β”œβ”€β”€ sources/       ← distilled findings from literature search
β”‚       └── synthesis/     ← patterns: "what consistently works / fails here", the current thesis
β”œβ”€β”€ program.md             ← task + criteria + config block (read every iteration)
β”œβ”€β”€ __thetask__.md         ← pointer manifest β€” which external files are tracked
β”œβ”€β”€ results.md             ← iteration table (numbers) β†’ dashboard source
β”œβ”€β”€ report.md              ← human-readable report β€” open questions on top, refreshed each round
β”œβ”€β”€ report.pdf             ← optional read-only snapshot (pandoc)
β”œβ”€β”€ answers.md             ← human answer inbox (detached mode)
β”œβ”€β”€ server.py              ← optional dashboard (only written if output_dashboard: on)
β”œβ”€β”€ flow.md
└── history/
    β”œβ”€β”€ v000/              ← baseline snapshot of tracked files
    β”œβ”€β”€ v001/              ← snapshot saved on each KEPT iteration
    └── ...

.neuroflow/{phase}/autoresearch-loops.md   ← POINTER REGISTRY ONLY (in project memory)

Naming: {name} defaults to a slug derived from the primary tracked file (connectivity.py β†’ connectivity), always overridable at setup. Multiple loops can coexist β€” e.g. intro_autoresearch/ and methods_autoresearch/ both under manuscript/.

Location: defaults to the directory of the primary tracked file. Always overridable (the user can put it in .neuroflow/, a sibling folder, anywhere). Everything β€” wiki, history, reports β€” travels with the artifact.

Pointer registry (.neuroflow/{phase}/autoresearch-loops.md) keeps project memory aware of every loop without holding the loop itself:

# Autoresearch loops β€” {phase}

| Name | Location | Iterations | Best | Status |
|------|----------|-----------|------|--------|
| connectivity | scripts/analysis/connectivity_autoresearch/ | 47 | v031 | running |
| intro | manuscript/intro_autoresearch/ | 12 | v009 | paused |

The wiki β€” the agent's brain

The loop wiki follows the neuroflow:wiki page format (frontmatter, index.md, log.md, wikilinks) but is scoped to this one loop and lives inside the loop folder. It is the fourth wiki level β€” local and disposable, with durable findings promoted up to the project wiki at loop end.

Page format

Every page in pages/ uses this frontmatter:

---
title: Citation density in Discussion plateaus after 3 additions
type: attempt            # attempt | concept | source | synthesis
iter: 042                # iteration this page was created / last touched
criterion: claim-support # which program.md criterion it relates to (if any)
verdict: WORSE           # attempt pages only: BETTER | WORSE | NO CHANGE
delta: -1                # attempt pages only
status: current          # current | superseded
created: YYYY-MM-DD
updated: YYYY-MM-DD
related: []              # file paths; in-body refs use [[Page Title]]
---

Wikilinks are mandatory for all in-body cross-references: [[Page Title]], never plain Markdown links. This is what makes the brain navigable.

Page types

Type Folder What it holds
attempt pages/attempts/ One page per meaningful direction tried. Records what changed, why it was tried, verdict, delta, and the reasoning for the outcome. Failures are the most valuable pages β€” they prune the search space.
concept pages/concepts/ Knowledge about the artifact and each criterion β€” what "good" looks like here, constraints, domain facts learned along the way.
source pages/sources/ One page per paper found via literature search β€” distilled claims and how they apply to this artifact. Keeps the bulk out of context.
synthesis pages/synthesis/ Patterns across attempts: "every citation-density change plateaus", "the weakest criterion is consistently X", the loop's evolving thesis on how to improve this artifact.

How the agent works the wiki (every iteration)

RECALL (before deciding a move) β€” mandatory: 1. Read wiki/index.md β€” the full map 2. Read the synthesis/ pages β€” the current thesis on what works and fails here 3. Read attempts/ pages relevant to the criterion being targeted β€” "have I tried anything near this? did it fail? why?" 4. Read relevant concepts/ and sources/ pages if the move touches them

RECORD (after the move is judged) β€” mandatory: 1. Write a new attempts/ page: what changed, why, verdict, delta, and the reasoning (especially for failures) 2. If a pattern emerged (e.g. third plateau on the same axis), create or update a synthesis/ page 3. If a human answer resolved an assumption, capture the decision in a concepts/ or synthesis/ page 4. Update index.md (add/update the row) and append to log.md (## [iter NNN] {op} | {title})

This read-then-write discipline is the loop's intelligence. Skipping RECALL makes the agent re-propose failed moves. Skipping RECORD makes the next iteration blind. Neither is ever skipped.

Promotion to the project wiki

Per promote_to_project_wiki in the config: - ask (default) β€” at loop end / interruption, surface durable findings and ask which to promote - on β€” promote durable findings automatically - off β€” keep everything local

A "durable finding" is a synthesis/ page or a confirmed concept that generalizes beyond this artifact (e.g. "averaging EEG reference before ICA consistently improves component separability"). Promote via neuroflow:wiki ingest into .neuroflow/wiki/. Micro-experiment attempts/ pages stay local β€” they would only clutter the project wiki.


program.md β€” task, criteria, and config

Read at the top of every iteration. Holds the task, the criteria (three layers β€” see Criteria), and the machine-followable config block.

# Autoresearch Program β€” {name} ({phase})
Started: YYYY-MM-DD

## Task
{one sentence: what is being improved and why}

## Tracked files
{listed from __thetask__.md for reference}

## Default criteria (phase: {phase})
{phase-specific criteria β€” see references/phase-criteria.md}

## User criteria
<!-- user additions, e.g. "Target Nature Neuroscience", "keep under 500 words" -->

## Improvement direction
{what "better" looks like β€” the guiding instruction each iteration}

## Out of scope
{what must NOT change between iterations}

## Loop configuration
loop_name: connectivity
artifact_location: scripts/analysis/connectivity_autoresearch/
promote_to_project_wiki: ask          # on | off | ask
branching: agent-decided              # off | agent-decided
max_alive_branches: 3                 # cost cap when branching
parameter_sweep: true                 # when a move tunes a scannable parameter, scan several values in ONE iteration and pick the best
literature_search: when-stuck         # off | when-stuck | agent-decided
literature_sources: pubmed, biorxiv   # MCP sources to query
literature_budget: 1 per 5 iterations # rate cap
evaluation: self                      # self | fresh-eval
output_dashboard: off                 # on | off
output_report_md: on                  # on | off
output_report_pdf: off                # on | off
report_cadence: every-round           # every-round | every-N
answer_channel: both                  # session | inbox | both
notify_on_plateau: true

## Iteration checklist β€” DO ALL, EVERY TIME, NEVER SKIP
<!-- This block is the contract. It is re-read at the start of every iteration so it
     can never drift out of context. Skipping ANY item is a loop failure. -->
1. RECALL β€” read this program.md (incl. this checklist), __thetask__.md, the wiki (index β†’ synthesis β†’ relevant attempts), and check answers.md + session for new answers
2. DECIDE β€” pick the weakest criterion and ONE move, informed by the wiki (never re-try a move the wiki shows failed)
3. SWEEP β€” if the move tunes a scannable parameter and parameter_sweep is on, scan several values this iteration and pick the best
4. ACT β€” make the change
5. JUDGE β€” compare to history/vBEST/ against the criteria β†’ BETTER | WORSE | NO CHANGE + delta
6. KEEP/REVERT β€” snapshot to history/vNNN/ on BETTER, else restore from vBEST/; append a row to results.md
7. WIKI β€” write an attempts/ page (what, why, verdict, delta, reasoning β€” especially failures); update synthesis/ on a pattern; update index.md + log.md
8. REPORT β€” rewrite report.md (open questions on top); update the pointer registry; regenerate PDF/dashboard per cadence
9. Items 7 and 8 are NOT optional and are NOT once-at-baseline β€” they run every single iteration. If you ever notice you skipped one, do it now before the next move.

The agent reads this config AND the iteration checklist at the start of every iteration and honors both exactly β€” check literature_budget before searching, respect branching / max_alive_branches / parameter_sweep, use the configured evaluation mode, refresh outputs per report_cadence, and complete every checklist item including the wiki write and report refresh.


thetask.md β€” tracked-file manifest

# Task Manifest

> EVERY ITERATION: follow the "## Iteration checklist" in program.md in full β€”
> including the wiki write (step 7) and report.md refresh (step 8). Never skip them.

## Tracked files
- `../connectivity.py`
- `../helpers/graph_metrics.py`

## Task description
Improve the connectivity analysis until it is reproducible and statistically sound.

## Current best snapshot
history/v031/

## Iterations run
47 (last: YYYY-MM-DD)

Paths are relative to the loop folder. The agent modifies the real files; the evaluator compares current state to history/vBEST/.


INIT β€” setup interview (first run only)

HARD GATE β€” the loop must NOT begin until the user has explicitly signed off on the full config block. Never set silent defaults and jump into iterations. Every configuration option below is asked one at a time, not assumed β€” branching, parameter sweep, literature search (+ sources + budget), evaluation mode, outputs (dashboard / report.md / PDF) + cadence, answer channel, and wiki promotion. If the user gives a partial answer, ask the rest; if they say "use defaults", still show the resulting config block and get an explicit "yes" before iterating. Starting iterations with any unasked option is the failure mode this gate exists to prevent.

  1. Read project_config.md β†’ determine active phase
  2. Which files should this loop improve? (or infer from --target)
  3. Name and location: derive a default name from the primary tracked file and a default location = that file's directory. Show both: "Loop folder: scripts/analysis/connectivity_autoresearch/. OK, or change name/location?"
  4. Build criteria β€” Layer 1 (phase defaults from references/phase-criteria.md) + Layer 2 (context-inferred) + Layer 3 (user input) β†’ program.md
  5. Loop configuration interview β€” go slowly, ONE question at a time. Ask each option as a separate message (or a clearly numbered walk-through), state the default and the trade-off, wait for the answer, then move to the next. Do NOT batch all options into one wall of text and do NOT rush to the loop β€” a hurried interview is exactly the failure this step guards against. Record each answer into the config block:
  6. Branching: "When you see two equally promising directions, may I try both and keep the winner? (agent-decided / single-track)" β†’ if agent-decided, "max directions to keep open at once?"
  7. Parameter sweep (default yes): "When a move tunes a parameter that makes sense to scan over a range β€” a threshold, filter cutoff, number of components, regularization strength β€” may I scan several values within a single iteration and pick the best, instead of one value per iteration? (yes / no)"
  8. Literature search: "May I search papers when I run out of ideas or want grounding? (when-stuck / anytime / off)" β†’ sources? β†’ budget (e.g. 1 per 5 iterations)?
  9. Evaluation: "Should I judge my own changes (faster, full context) or have a fresh independent check each time (slower, unbiased)? (self / fresh-eval)"
  10. Outputs: "Live dashboard server? Human report.md (default on)? Also a PDF snapshot?" β†’ cadence?
  11. Answers: "Answer my questions in this session, via an answers.md inbox, or both?"
  12. Wiki promotion: "At loop end, promote durable findings to the project wiki? (ask / auto / off)"
  13. Confirm the full config (the gate). Render the complete ## Loop configuration block back to the user with every value filled in, and ask for an explicit go-ahead: "This is the full configuration. Confirm to start the loop, or tell me what to change." Do not proceed to step 7 until the user confirms. No iteration runs before this sign-off.
  14. Create the loop folder at the chosen location; initialize wiki/ (index.md, log.md, schema.md, pages/ subfolders) β€” write a starter schema.md describing the artifact, the criteria, and the wikilink convention
  15. Snapshot tracked files β†’ history/v000/; write baseline row to results.md
  16. Write program.md (with the confirmed config block and the "## Iteration checklist" block β€” both are mandatory), __thetask__.md (with the iteration reminder at top), flow.md
  17. Add a row to .neuroflow/{phase}/autoresearch-loops.md (create the registry if absent)
  18. If output_dashboard: on, write server.py from scripts/server.py in this skill and tell the user the URL
  19. Write the first report.md
  20. Start the loop

Loop protocol β€” NEVER STOP

REPEAT FOREVER until the human interrupts:

  RECALL
    a. Read program.md β€” INCLUDING its "## Iteration checklist" β€” + __thetask__.md (resolve tracked paths).
       The checklist is the contract for this iteration; follow every item, never skip the wiki write or report refresh.
    b. Read tracked files (current state) + history/vBEST/ (current best)
    c. Read the wiki: index.md β†’ synthesis/ β†’ attempts/ for the target criterion β†’ relevant concepts/sources
    d. Check answers.md and the session for new human answers (match Q-ids; see Q&A channel)

  DECIDE
    e. Pick the single weakest criterion and ONE focused move to improve it,
       informed by the wiki β€” do NOT re-propose a move the wiki shows already failed.
    f. If out of fresh ideas OR the wiki shows the obvious moves are exhausted:
         - If literature_search allows and budget permits β†’ search papers (MCP tools),
           distill into wiki/sources/, synthesize a new direction, record it.
    g. If branching is enabled and two directions look equally promising:
         - Try one this iteration; note the fork so the other is tried next from the SAME vBEST.
           Keep at most max_alive_branches forks open; prune losers once a winner emerges.

  ACT
    h. Make ONE surgical change to the tracked files. Not a rewrite β€” one move.
       PARAMETER SWEEP: if parameter_sweep is on AND the move is tuning a parameter with a
       sensible range of values (threshold, filter cutoff, n_components, regularization,
       k folds, window length, learning rate, …), scan several values WITHIN THIS ONE
       iteration: try each, measure each against the criteria, and pick the best value to
       apply. The scan is internal scratch β€” only the chosen value is written to the tracked
       files. Record the swept values and the choice in one wiki attempts/ page (the curve).
       A sweep is one axis Γ— many values; branching (g) is many competing directions β€” don't conflate them.

  JUDGE  (self, or one fresh subagent if evaluation: fresh-eval)
    i. Compare current tracked files to history/vBEST/ against the criteria.
       Return: VERDICT (BETTER | WORSE | NO CHANGE), Delta (βˆ’5..+5),
               per-criterion notes, numeric values if applicable,
               and the single weakest area to target next.
       If self-evaluating: judge it COLD β€” be skeptical of your own change.

  KEEP / REVERT
    j. If BETTER: snapshot tracked files β†’ history/vNNN/; update __thetask__.md
                  (iterations, best snapshot); append KEPT row to results.md.
       If WORSE / NO CHANGE: restore tracked files from history/vBEST/; append REVERTED row.

  RECORD  (the brain β€” mandatory, EVERY round, no exceptions)
    k. Write an attempts/ page (what, why, verdict, delta, reasoning β€” especially for failures).
       Update synthesis/ if a pattern emerged. Update index.md + log.md.
    l. Refresh ALL THREE every round: the wiki (k above), results.md, AND report.md
       (open questions on top). report.md is not write-once-at-baseline β€” it is rewritten
       each iteration so the human's live view and open-questions list stay current.
       Update the pointer registry. Regenerate report.pdf / dashboard data per cadence.

  STEER
    m. Plateau (5 consecutive REVERTs): if notify_on_plateau, note it in report.md and the session,
       then CHANGE APPROACH β€” new angle from the wiki, a branch, or a literature search. DO NOT STOP.

  n. Go to RECALL. Never stop on your own.

Evaluation

Mode Behaviour Trade-off
self (default) The managing agent judges its own change cold against vBEST + criteria Keeps full context, faster; instruct it to be skeptical of its own work; the wiki catches "you rejected this before"
fresh-eval One fresh general-purpose subagent judges the change with no loop context Independent, unbiased; the only place a subagent is spawned; slower

The bias risk of self is real β€” an agent grading its own work tends to like it. Mitigations: judge against the explicit vBEST snapshot and named criteria, and let the wiki hold it honest. Choose fresh-eval when evaluation rigor matters more than speed.


Outputs

Each surface has one job. All optional except report.md.

File Audience Job
results.md dashboard numeric iteration table (verdict, delta, running)
report.md human narrative + open questions β€” the steering surface
report.pdf human optional read-only snapshot (pandoc report.md -o report.pdf)
server.py human optional live dashboard at localhost:8765 β€” renders both the report (open questions pinned at top + narrative) and the numeric trend charts on one page; template in scripts/server.py
wiki/ agent the brain

The dashboard is the one-stop web view: it reads report.md and results.md on every request, so a glance shows the quality curve and the open questions awaiting an answer. Use ?watch=1 for auto-refresh.

results.md format

# Autoresearch Results β€” {name}
Started: YYYY-MM-DD HH:MM

| # | Verdict | Ξ” | Running | Decision | Next focus |
|---|---------|---|---------|----------|------------|
| 000 | β€” | 0 | 0 | KEPT (baseline) | β€” |
| 001 | BETTER | +3 | 3 | KEPT | Intro–methods transition |
| 002 | WORSE | -1 | 3 | REVERTED | Overcomplicated methods |

Running: KEPT adds delta; REVERTED leaves it unchanged. Append numeric columns (power, RΒ², word_count…) after Next focus for phases with numeric criteria.

report.md format β€” human steering surface

Open questions lead the file. Answered questions are deleted from the report (their resolution goes to the wiki, not an archive section here).

# Autoresearch Report β€” {name}
Iteration 47 Β· Best: v031 Β· Running quality: +18 Β· Updated HH:MM

## Open questions for you
- **Q7** β€” about to delete the third control analysis (~200 lines, hard to reconstruct). Confirm? (iter 46)
- **Q3** β€” Target Nature Neuro or eLife? Affects how aggressively I trim. (iter 40)

## This round
Tried tightening the methods reproducibility statement. Verdict BETTER (+2), kept as v031.

## Current direction
Citation density in the Discussion is the weakest criterion β€” working that next.

Q&A channel

The loop asks the human questions without ever stopping.

  • All open questions sit in the top section of report.md. Short, live list.
  • Persistent, stable ids. A question keeps its id forever (Q3 stays Q3, never renumbered) so "answer 3) ..." always maps to the right one.
  • Non-blocking. The agent asks, makes its best-guess move, keeps going. It revisits when the answer arrives β€” and because every state is a history/ snapshot, it can re-branch from an earlier vBEST if the human steers it elsewhere.
  • Answering: depending on answer_channel β€”
  • session: human types A3: eLife or answer 3) eLife β†’ agent reads it at the next RECALL
  • inbox: human writes the same into answers.md β†’ agent reads it each RECALL
  • both: either works
  • On resolution: the agent acts on the answer, deletes the question from report.md, and records the decision + what it did in the wiki (a concepts/ or synthesis/ page). Knowledge survives; the report stays clean.

A costly/irreversible move (large deletion, expensive recompute) should be raised as a question before doing it, placed at the top of the open-questions list. The loop still doesn't block β€” it proceeds on its best guess and the snapshot makes it reversible β€” but the human sees it first.


Session logging & registry

Append to .neuroflow/sessions/YYYY-MM-DD.md: - Loop start: ## HH:MM β€” [autoresearch/{name}] started β€” tracking {N} file(s) at {location} - Every 10 iterations: ## HH:MM β€” [autoresearch/{name}] iter {N} β€” running {R} β€” best {snapshot} - Plateau: ## HH:MM β€” [autoresearch/{name}] PLATEAU β€” changing approach - Interrupt: ## HH:MM β€” [autoresearch/{name}] interrupted at iter {N} β€” best {snapshot}

Keep the pointer registry (.neuroflow/{phase}/autoresearch-loops.md) current: iterations, best, status (running / paused / interrupted).


Criteria initialization

Build program.md criteria in three layers on first run:

  • Layer 1 β€” Phase defaults. Always included. Full per-phase criteria tables are in references/phase-criteria.md β€” read it during INIT and copy the active phase's criteria into program.md.
  • Layer 2 β€” Context-inferred. Read existing .neuroflow/ files and add relevant criteria:
If this exists Add criterion
.neuroflow/ideation/research-question.md Alignment with the stated research question
.neuroflow/preregistration/ Adherence to preregistered hypotheses / analysis plan
project_config.md has target_journal: Meets [journal] editorial standards
.neuroflow/grant-proposal/ names a funder Meets [funder] reviewer criteria
.neuroflow/data-analyze/analysis-plan.md Covers all hypotheses in the analysis plan
.neuroflow/objectives.md Addresses all project objectives
  • Layer 3 β€” User input. After printing layers 1+2, ask: "Add your own criteria? (Enter to skip)" β†’ append under ## User criteria.

Resume

If .neuroflow/{phase}/autoresearch-loops.md lists one or more loops: - One loop β†’ confirm: "Resume autoresearch '{name}' at {location}? {N} iterations logged, best {snapshot}." - Multiple β†’ list them and ask which to resume - On resume: read that loop's program.md, __thetask__.md, results.md, and the wiki (index + synthesis), then go straight to the loop (skip INIT)


Slash command

/autoresearch or any phase command invoked with the keyword autoresearch in the prompt.


Bundled resources

  • references/phase-criteria.md β€” per-phase Layer 1 default criteria (read during INIT)
  • scripts/server.py β€” optional dashboard template (write to the loop folder only if output_dashboard: on)