autoresearch¶
An infinite, multi-session improvement loop for any research artifact. A worker agent makes one focused change per iteration; an evaluator compares the result to the previous best and returns BETTER / WORSE / NO CHANGE. The best version is kept; worse versions are reverted. The loop never stops on its own.
Core rule โ NEVER STOP¶
The loop runs until the human interrupts it. Period.
- Never decide the artifact is "good enough" and exit
- Never stop because the score plateaued
- Never stop because iterations look repetitive
- Plateau detection is a notification, not a termination condition
- The only valid exit is the user pressing Ctrl-C or typing a stop command
File structure (created in user's project)¶
.neuroflow/{phase}/autoresearch/
โโโ flow.md
โโโ program.md # task + criteria (phase defaults + context-inferred + user-added)
โโโ __thetask__.md # pointer manifest โ lists which external files are tracked
โโโ results.md # iteration log (verdict, delta, running, decision, next focus)
โโโ server.py # local dashboard โ serves http://localhost:8765
โโโ history/
โโโ v000/ # baseline snapshot of tracked files
โโโ v001/ # snapshot saved on each KEPT iteration
โโโ ...
__thetask__.md is a pointer, not the artifact itself. It lists paths to the external files being improved (e.g. manuscript/introduction.md, .neuroflow/ideation/hypothesis.md, scripts/analysis/pipeline.py). Workers modify those files directly. The evaluator compares current file state to the last history/vBEST/ snapshot.
__thetask__.md format¶
# Task Manifest
## Tracked files
- `../../../manuscript/introduction.md`
- `../../../manuscript/methods.md`
## Task description
Continuously improve the introduction and methods until they pass peer review.
## Current best snapshot
history/v004/
## Iterations run
12 (last: YYYY-MM-DD)
program.md template¶
# Autoresearch Program โ {phase}
Started: YYYY-MM-DD
## Task
{one sentence: what is being improved and why}
## Tracked files
{listed from __thetask__.md for reference}
## Default criteria (phase: {phase})
{phase-specific criteria โ see per-phase table below}
## User criteria
<!-- Add your own criteria here, e.g.:
- Must cite at least 3 papers from 2022โ2025
- Keep under 500 words
- Target: Nature Neuroscience -->
## Improvement direction
{what "better" looks like โ guiding instruction for the worker each iteration}
## Out of scope
{what must NOT change between iterations}
Criteria initialization โ three layers¶
On first run, build program.md criteria in three layers:
Layer 1 โ Phase defaults (always included; see per-phase table in this skill)
Layer 2 โ Context-inferred (read existing .neuroflow/ files and infer relevant additions):
| If this exists | Add criterion |
|---|---|
.neuroflow/ideation/research-question.md |
"Alignment with stated research question" |
.neuroflow/preregistration/ |
"Adherence to preregistered hypotheses / analysis plan" |
project_config.md has target_journal: |
"Meets [journal] editorial standards" |
.neuroflow/grant-proposal/ has a named funder |
"Meets [funder] reviewer criteria (Significance / Innovation / Approach)" |
.neuroflow/data-analyze/analysis-plan.md |
"Covers all hypotheses from the analysis plan" |
.neuroflow/objectives.md |
"Addresses all project objectives" |
Layer 3 โ User input
After printing layers 1+2, ask:
Append any user additions toprogram.md under ## User criteria.
Per-phase default criteria (Layer 1)¶
paper¶
Drawn from agents/paper-critic.md โ six evaluation areas:
1. Language, style, terminology โ spelling, grammar, undefined abbreviations, causality language errors (e.g. "X activates Y" from correlational data)
2. Internal consistency โ all figures referenced exist and are numbered correctly; numerical values match across sections; subject counts consistent
3. Claim support โ every claim has evidence; no causality creep; no functional-connectivity overclaims; no over-generalization beyond the sample
4. Statistics โ power justification, correct test choice, effect sizes reported, multiple-comparison correction stated and justified
5. Methods reproducibility โ COBIDAS compliance for fMRI; ARRIVE 2.0 for animals; electrode montage and reference stated for EEG; data and code availability statement
6. Contribution and novelty โ novelty grounded relative to specific prior papers; alternative interpretations addressed; journal fit justified
grant-proposal¶
Drawn from skills/phase-grant-proposal/SKILL.md:
1. Scope โ all aims achievable within the stated timeline; no aim requires success of another unless stated
2. Power analysis โ formal power analysis per aim with effect size cited from published literature
3. Hypothesis in Approach โ every aim has a testable prediction, not just a description of methods
4. Funder alignment โ Significance framed for funder priority (NIH: disease burden / mechanism; ERC: frontier science; Wellcome: scientific opportunity)
5. Preliminary data โ at least one result figure per aim with statistics visible
6. Budget justification โ every budget line has a rationale; FTE fractions stated; equipment identified by model
ideation¶
- Novelty โ question not already answered in the cited literature; state the closest prior paper
- Testability โ can be empirically tested with standard neuroscience methods in a reasonable timeframe
- Specificity โ stated as one sentence with named independent variable, dependent variable, and population
- Feasibility โ achievable by a neuroscience lab given realistic equipment, sample, and timeline constraints
- Mechanistic grounding โ proposes a biological or computational mechanism, not just a correlational observation
data-analyze¶
- Plan precedes code โ analysis-plan.md written and accepted before any analysis script
- Assumption audit โ normality, sphericity, and independence checked explicitly before test selection
- Multiple comparison correction โ method named (FWE, FDR, Bonferroni) and justified for the design
- Reproducibility โ script is self-contained and re-runnable from raw inputs alone
- Coverage โ all hypotheses listed in project_config.md are addressed
- Numeric โ statistical power (target โฅ 0.8), effect size (Cohen's d or ฮทยฒ), N per condition reported
experiment¶
- Ecological validity โ experimental conditions reflect the real-world scenario being studied
- Control conditions โ every independent variable has a matched control condition
- Counterbalancing โ order effects addressed; counterbalancing scheme stated
- Confound identification โ known confounds listed; design choices explain how each is controlled
- Numeric โ formal power analysis with target power โฅ 0.8; trial count per condition stated
preregistration¶
- Specificity โ hypothesis statement has no wiggle room; can be unambiguously confirmed or disconfirmed
- Prior grounding โ at least one prior result cited per directional prediction
- Falsifiability โ defined rejection criterion (threshold, direction) for each hypothesis
- Analysis plan completeness โ exact statistical tests, thresholds, exclusion rules, and dependent variable operationalization stated
- Deviation protocol โ explicitly states what will be done if a planned analysis cannot run as specified
brain-build¶
- Biological plausibility โ all parameters fall within physiologically reported ranges (cite sources)
- Formal completeness โ every equation and free parameter defined; no undefined symbols
- Testability โ model makes at least two specific, falsifiable empirical predictions
- Parameter justifiability โ each free parameter sourced from data, prior fit, or justified literature estimate
- Data relationship โ relationship between model output and empirical recordings explicitly stated
brain-optimize¶
- Convergence evidence โ optimization converged (loss curve shown or stability criterion met)
- Objective alignment โ cost function reflects the scientific question being asked
- Sensitivity justification โ parameters the optimizer was most sensitive to are identified and discussed
- Generalisability โ fit not only to training data; held-out or cross-validated performance reported
- Numeric โ final loss / Rยฒ / correlation with empirical data reported per iteration
brain-run¶
- Output clarity โ outputs are labelled, units stated, axes named
- Parameter documentation โ full parameter set used for the run is saved alongside outputs
- Reproducibility โ run is reproducible from the saved parameter set alone
- Interpretation soundness โ results interpreted within the bounds of model assumptions
- Limitation acknowledgment โ at least one key model limitation noted in context of the outputs
data-preprocess¶
- Pipeline completeness โ all steps from raw to analysis-ready documented in order
- Artifact handling โ ocular, muscle, and line-noise artifacts addressed; strategy stated
- BIDS compliance โ output folder structure matches BIDS specification
- Reproducibility โ pipeline re-runnable from the script alone with no manual steps
- Numeric โ channel rejection rate (flag if > 20%), epoch rejection rate, and SNR estimate reported
poster / slideshow / write-report¶
- Visual / structural hierarchy โ most important claim is the most prominent element
- Core claim clarity โ the main message is readable or identifiable within 5 seconds
- Evidence density โ every claim has at least one supporting data point or citation visible
- Audience targeting โ vocabulary and technical depth match the stated audience
- Narrative flow โ logical order; each panel or section leads naturally to the next
all other phases¶
Clarity, Completeness, Scientific rigour, Feasibility, Audience alignment
Loop protocol¶
INIT (first run only)¶
- Read
project_config.mdโ determine active phase - Create
.neuroflow/{phase}/autoresearch/ - Ask: "Which files should autoresearch improve?" (or infer from
--targetflag in the invocation) - Build criteria: Layer 1 + Layer 2 (from context) + Layer 3 (user input) โ write to
program.md - Copy current state of tracked files into
history/v000/(baseline snapshot) - Write baseline row to
results.md - Write
server.pyinto.neuroflow/{phase}/autoresearch/server.pyusing the template in the Dashboard server template section of this skill - Tell the user: "Dashboard: run
python .neuroflow/{phase}/autoresearch/server.pyโ http://localhost:8765" - Write
flow.mdfor the autoresearch folder - Start loop
LOOP โ NEVER STOP¶
REPEAT FOREVER until the human interrupts:
a. Read program.md + __thetask__.md โ resolve tracked file paths
b. Read tracked files (current state)
c. Read results.md tail (last 5 rows) โ what was tried recently
d. Read history/vBEST/ snapshot (the current best version)
e. WORKER โ spawn general-purpose agent:
Prompt contains:
- Phase skill content (neuroflow:phase-{phase})
- program.md (task, criteria, improvement direction, out of scope)
- Current content of tracked files
- results.md tail for context
- Instruction: "Make ONE focused improvement targeting the weakest criterion.
Do NOT rewrite everything. Make one surgical change.
Return only the modified file(s) with the change applied."
f. EVALUATOR โ spawn general-purpose agent:
Prompt contains:
- Criteria from program.md
- Current tracked files (post-worker)
- history/vBEST/ snapshot (previous best)
- Instruction: "Compare these two versions of the tracked files.
Is the new version BETTER, WORSE, or NO CHANGE relative to the previous best?
Return exactly:
VERDICT: BETTER | WORSE | NO CHANGE
Delta: integer โ5 (much worse) to +5 (much better)
Criteria notes: per-criterion one-line assessment
Numeric values: extract any numeric criteria values if applicable
(power, Rยฒ, rejection rate, loss, word count, citation count, etc.)
Next focus: one sentence โ the single weakest area to target next"
g. If BETTER:
- Save current state of tracked files โ history/vNNN/ (N = zero-padded iteration number)
- Update __thetask__.md: increment "Iterations run", update "Current best snapshot"
- Append KEPT row to results.md
- Update flow.md
h. If WORSE or NO CHANGE:
- Restore tracked files from history/vBEST/ (overwrite tracked files with snapshot content)
- Append REVERTED row to results.md
i. Plateau detection โ if 5 consecutive REVERTs:
- Append "--- PLATEAU DETECTED (5 consecutive REVERTs) ---" to results.md
- Print: "5 consecutive reversions with no improvement.
Consider adding new directions to program.md under '## User criteria'
or '## Improvement direction'. Continuing loop."
- DO NOT STOP โ continue the loop
j. Go to step a. NEVER stop on your own.
Evaluator output format¶
VERDICT: BETTER
Delta: +3
Criteria notes:
- Language/style: no change โ prose quality unchanged
- Claim support: improved โ mechanism sentence added, previously missing
- Statistics: improved โ power value now cited (0.74)
- Methods reproducibility: no change
- Contribution/novelty: no change
Numeric values:
- power: 0.74
- word_count: 487
Next focus: The intro-to-methods transition is abrupt โ add a single bridging sentence.
results.md format¶
# Autoresearch Results โ {phase}
Started: YYYY-MM-DD HH:MM
| # | Verdict | ฮ | Running | Decision | Next focus |
|---|---------|---|---------|----------|------------|
| 000 | โ | 0 | 0 | KEPT (baseline) | โ |
| 001 | BETTER | +3 | 3 | KEPT | Introโmethods transition |
| 002 | WORSE | -1 | 3 | REVERTED | Overcomplicated methods |
| 003 | BETTER | +2 | 5 | KEPT | Citation density in Discussion |
For phases with numeric criteria, append columns after Next focus (e.g. power, R2, word_count).
Running column rules: - KEPT: running = previous running + delta - REVERTED: running = unchanged (file was restored; quality is the same as before)
Session logging¶
Append to .neuroflow/sessions/YYYY-MM-DD.md at:
- Loop start: ## HH:MM โ [autoresearch/{phase}] loop started โ tracking {N} file(s)
- Every 10 iterations: ## HH:MM โ [autoresearch/{phase}] iteration {N} โ running quality: {R} โ best: {snapshot}
- Plateau detection: ## HH:MM โ [autoresearch/{phase}] PLATEAU โ 5 consecutive REVERTs
- Loop interrupted: ## HH:MM โ [autoresearch/{phase}] loop interrupted at iteration {N} โ best: history/{snapshot}/
Slash command¶
/autoresearch or any phase command invoked with the keyword autoresearch in the prompt.
Dashboard server template¶
Write the following Python script verbatim to .neuroflow/{phase}/autoresearch/server.py during INIT. It uses Python stdlib only plus Chart.js from CDN โ no pip installs required.
#!/usr/bin/env python3
"""
Autoresearch dashboard โ serves http://localhost:8765
Reads results.md on every request; auto-refreshes with ?watch=1
Usage: python server.py [--port 8765]
"""
import argparse
import csv
import io
import json
import os
import re
from http.server import BaseHTTPRequestHandler, HTTPServer
RESULTS_FILE = os.path.join(os.path.dirname(__file__), "results.md")
THETASK_FILE = os.path.join(os.path.dirname(__file__), "__thetask__.md")
def parse_results():
"""Parse results.md table into list of dicts."""
rows = []
if not os.path.exists(RESULTS_FILE):
return rows
with open(RESULTS_FILE, encoding="utf-8") as f:
content = f.read()
in_table = False
headers = []
for line in content.splitlines():
line = line.strip()
if line.startswith("| #") or line.startswith("|#"):
headers = [h.strip() for h in line.strip("|").split("|")]
in_table = True
continue
if in_table and line.startswith("|---"):
continue
if in_table and line.startswith("|"):
cells = [c.strip() for c in line.strip("|").split("|")]
if len(cells) >= len(headers):
rows.append(dict(zip(headers, cells)))
elif in_table and not line.startswith("|"):
if line.startswith("---"):
continue # section divider in results
return rows
def parse_thetask():
"""Return task description and tracked files from __thetask__.md."""
if not os.path.exists(THETASK_FILE):
return "", [], "history/v000", 0
with open(THETASK_FILE, encoding="utf-8") as f:
content = f.read()
desc = re.search(r"## Task description\n(.+?)(?:\n##|\Z)", content, re.S)
desc = desc.group(1).strip() if desc else ""
files_section = re.search(r"## Tracked files\n(.+?)(?:\n##|\Z)", content, re.S)
files = []
if files_section:
for line in files_section.group(1).splitlines():
line = line.strip().strip("-").strip().strip("`")
if line:
files.append(line)
best = re.search(r"## Current best snapshot\n(.+)", content)
best = best.group(1).strip() if best else "history/v000"
iters = re.search(r"## Iterations run\n(\d+)", content)
iters = int(iters.group(1)) if iters else 0
return desc, files, best, iters
def build_html(rows, desc, files, best, iters, watch):
labels = [r.get("#", "") for r in rows]
running = []
for r in rows:
try:
running.append(float(r.get("Running", 0)))
except ValueError:
running.append(0)
# collect numeric columns (anything after "Next focus")
all_keys = []
if rows:
all_keys = list(rows[0].keys())
std_keys = {"#", "Verdict", "ฮ", "Running", "Decision", "Next focus"}
num_keys = [k for k in all_keys if k not in std_keys and k]
num_datasets = []
for key in num_keys:
vals = []
for r in rows:
try:
vals.append(float(r.get(key, "").replace("โ", "").replace("nan", "") or "nan"))
except ValueError:
vals.append(None)
num_datasets.append({"label": key, "data": vals})
kept_points = [
{"x": r.get("#", ""), "y": float(r.get("Running", 0))}
for r in rows if "KEPT" in r.get("Decision", "")
if r.get("#") and r.get("Running")
]
reverted_points = [
{"x": r.get("#", ""), "y": float(r.get("Running", 0))}
for r in rows if "REVERTED" in r.get("Decision", "")
if r.get("#") and r.get("Running")
]
last_focus = rows[-1].get("Next focus", "โ") if rows else "โ"
plateau = any("PLATEAU" in r.get("Decision", "") for r in rows)
refresh = '<meta http-equiv="refresh" content="30">' if watch else ""
num_charts_html = ""
for ds in num_datasets:
clean_vals = [v if v is not None else "null" for v in ds["data"]]
num_charts_html += f"""
<div class="chart-wrap">
<canvas id="chart_{ds['label']}"></canvas>
</div>
<script>
new Chart(document.getElementById('chart_{ds["label"]}'), {{
type: 'line',
data: {{
labels: {json.dumps(labels)},
datasets: [{{
label: '{ds["label"]}',
data: {json.dumps(clean_vals)},
borderColor: '#a78bfa',
backgroundColor: 'rgba(167,139,250,0.15)',
tension: 0.3,
spanGaps: true,
}}]
}},
options: {{ responsive: true, plugins: {{ legend: {{ display: true }} }} }}
}});
</script>
"""
return f"""<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
{refresh}
<title>Autoresearch Dashboard</title>
<script src="https://cdn.jsdelivr.net/npm/chart.js"></script>
<style>
body {{ font-family: system-ui, sans-serif; background: #0f0f13; color: #e2e8f0; margin: 0; padding: 24px; }}
h1 {{ font-size: 1.4rem; margin-bottom: 4px; color: #c4b5fd; }}
.meta {{ font-size: 0.82rem; color: #94a3b8; margin-bottom: 20px; }}
.cards {{ display: flex; gap: 16px; flex-wrap: wrap; margin-bottom: 24px; }}
.card {{ background: #1e1e2e; border-radius: 10px; padding: 16px 20px; min-width: 160px; }}
.card-label {{ font-size: 0.75rem; color: #94a3b8; text-transform: uppercase; letter-spacing: .05em; }}
.card-value {{ font-size: 1.6rem; font-weight: 700; color: #c4b5fd; }}
.plateau {{ color: #f59e0b; font-weight: bold; }}
.chart-wrap {{ background: #1e1e2e; border-radius: 10px; padding: 16px; margin-bottom: 20px; }}
.focus-box {{ background: #1e1e2e; border-left: 3px solid #c4b5fd; padding: 12px 16px;
border-radius: 0 8px 8px 0; margin-bottom: 20px; font-size: 0.9rem; }}
.files {{ font-size: 0.8rem; color: #64748b; margin-top: 4px; }}
</style>
</head>
<body>
<h1>Autoresearch Dashboard</h1>
<div class="meta">{desc}</div>
<div class="files">Tracked: {" ยท ".join(files)}</div>
<div class="meta">Best snapshot: {best} ยท Iterations: {iters}</div>
<div class="cards">
<div class="card"><div class="card-label">Iterations</div><div class="card-value">{iters}</div></div>
<div class="card"><div class="card-label">Running quality</div>
<div class="card-value">{running[-1] if running else 0:+.0f}</div></div>
<div class="card"><div class="card-label">Last verdict</div>
<div class="card-value" style="font-size:1.1rem">{rows[-1].get("Verdict","โ") if rows else "โ"}</div></div>
{"<div class='card'><div class='card-label plateau'>โ Plateau</div><div class='card-value plateau'>5 REVERTs</div></div>" if plateau else ""}
</div>
<div class="focus-box"><strong>Next focus:</strong> {last_focus}</div>
<div class="chart-wrap">
<canvas id="qualityChart"></canvas>
</div>
<script>
new Chart(document.getElementById('qualityChart'), {{
type: 'line',
data: {{
labels: {json.dumps(labels)},
datasets: [
{{
label: 'Running quality',
data: {json.dumps(running)},
borderColor: '#818cf8',
backgroundColor: 'rgba(129,140,248,0.1)',
tension: 0.2,
fill: true,
}},
{{
label: 'KEPT',
data: {json.dumps([r.get("Running") if "KEPT" in r.get("Decision","") else None for r in rows])},
borderColor: 'rgba(0,0,0,0)',
backgroundColor: '#34d399',
pointRadius: 7,
pointHoverRadius: 9,
showLine: false,
spanGaps: false,
}},
{{
label: 'REVERTED',
data: {json.dumps([r.get("Running") if "REVERTED" in r.get("Decision","") else None for r in rows])},
borderColor: 'rgba(0,0,0,0)',
backgroundColor: '#f87171',
pointRadius: 6,
pointHoverRadius: 8,
showLine: false,
spanGaps: false,
}},
]
}},
options: {{
responsive: true,
plugins: {{ legend: {{ display: true }} }},
scales: {{ y: {{ grid: {{ color: '#2d2d3d' }}, ticks: {{ color: '#94a3b8' }} }},
x: {{ grid: {{ color: '#2d2d3d' }}, ticks: {{ color: '#94a3b8' }} }} }}
}}
}});
</script>
{num_charts_html}
</body>
</html>"""
class Handler(BaseHTTPRequestHandler):
def log_message(self, format, *args):
pass # suppress request logs
def do_GET(self):
watch = "watch=1" in self.path
rows = parse_results()
desc, files, best, iters = parse_thetask()
html = build_html(rows, desc, files, best, iters, watch)
self.send_response(200)
self.send_header("Content-Type", "text/html; charset=utf-8")
self.end_headers()
self.wfile.write(html.encode("utf-8"))
def main():
parser = argparse.ArgumentParser()
parser.add_argument("--port", type=int, default=8765)
args = parser.parse_args()
print(f"Autoresearch dashboard โ http://localhost:{args.port}")
print(f"Auto-refresh: http://localhost:{args.port}?watch=1")
print("Ctrl-C to stop")
HTTPServer(("", args.port), Handler).serve_forever()
if __name__ == "__main__":
main()