Nelli is a plan-driven AI-scientist harness. An agent frames a problem, declares
that run's own concrete deliverables, and is held to recorded evidence for each
one. Completion is gated on the deliverables the planner declared for that run,
not a fixed checklist, so the same harness applies across scientific domains. The
primary application is discovery and characterization of new lineages of life
from omics data: recovering and placing novel microbial and viral genomes, then
holding novelty claims to evidence. The bio domain (--domain bio) layers a
built-in bioinformatics requirement set on top of the planner's declared
deliverables.
Nelli also runs as a benchmark harness for scientific agent prompts (see Benchmark Snapshot).
Gating is plan-driven and domain-agnostic. The planner calls the
declare_deliverables tool, which writes plan/deliverables.json. Deliverable
keys are free-form and planner-defined. Readiness then requires that every
declared deliverable is backed by a recorded evidence artifact that is tagged
with the deliverable's exact key and carries a non-empty evidence_paths file. A
prose mention of the key, or a narrative summary with no artifact file, does not
satisfy a deliverable. A run that declares nothing is never trivially "ready".
--domain bio is opt-in. It is not forced by keywords. It accepts
bio, biology, omics, genomics, and metagenomics, and adds the built-in
omics requirement set (functional annotation, comparative genomics, viral/lineage
quality, phylogenomic placement, novelty) on top of the declared deliverables.
For the bio domain, this repo drives the vendored bioinformatics skillpack in
vendor/omics-skills/. It turns those workflow contracts (read QC, assembly,
MAG/viral QC, gene calling, annotation/taxonomy, pangenomes, phylogenomics,
viromics) into gated, reproducible agent runs.
The agent backend is selectable via nelli.toml [backend].provider or
--provider. The default backend is the OpenAI Codex SDK (--provider codex,
model gpt-5.5, ChatGPT-login auth). The alternatives are claude-code (Claude
Code headless via claude -p), cborg, openrouter, and local-gemma. The
codex and claude-code backends both run agentically inside the project
workspace.
Codex runs with maximum freedom by default. codex_sandbox defaults to
danger-full-access; set it to workspace-write or read-only via
nelli.toml [sandbox].codex_sandbox or $NELLI_CODEX_SANDBOX. The codex
backend keeps its own execution inside the Codex sandbox. claude-code with
bypassPermissions runs as a host process under Claude Code's own permission
model and is not confined by Nelli's bwrap sandbox.
| Benchmark | Metric | Sonnet 4.6 | GLM-5 | GPT-5.5 |
|---|---|---|---|---|
| SGI Task 1: Deep Research | Step-Level Accuracy | 66.4% | 64.6% | 74.0% |
| SGI Task 1: Deep Research | Exact Match | 15.9% | 11.7% | 56.7% |
| SGI Task 2: Idea Generation | Overall Quality (0-10) | 6.2 | 5.5 | 8.0 |
| SGI Task 3.1: Dry Experiment | Code Correctness (est.) | 55.0% | 42.8% | 79.0% |
| SGI Task 3.2: Wet Experiment | Protocol Quality (0-10) | 10.0 | 9.8 | 8.7 † |
| SGI Task 4: Experimental Reasoning | MC Accuracy | 43.7% | 34.5% | n/a ‡ |
| ScienceAgentBench | Valid Code Generation | 100% | 100% | 100% |
| ScienceAgentBench | Tasks Completed | 10/27 | 8/27 | 10/10 § |
The GPT-5.5 column is the current default backbone (Codex SDK, gpt-5.5,
reasoning_effort=xhigh, ChatGPT-login auth), run on 2026-06-04. The Sonnet 4.6
and GLM-5 columns are the original 2026-03-10 CBORG runs, kept unchanged. GPT-5.5
was scored with the same sampled, local-Claude-judge methodology as the baseline,
on the first N valid baseline rows per task (Task 1 n=30, Task 2 n=24, Task 3.1
n=20, Task 3.2 n=24; ScienceAgentBench 10 bio tasks). It used the same benchmark
source files but not the baseline's exact scored sample, and the per-task sizes
differ from the baseline's. SGI scores are sampled LLM-judge results; read them as
directional, not as full-set leaderboard placements.
Caveats:
- † Task 3.2 was re-graded with a stricter rubric. The report flags the baseline's near-perfect 10.0/9.8 as "likely generous", so this is not a like-for-like comparison. GPT-5.5's protocols score 9.0/10 on completeness, graded harder.
- ‡ Task 4 interprets experimental figures (vision). The Codex SDK path is text-only and the SGI image data is gated, so GPT-5.5 was not evaluated on it.
- § The baseline "Tasks Completed X/27" reflects a CBORG budget cap ($50). The Codex SDK has no API budget cap, so GPT-5.5 produced valid programs for all 10 sampled bio tasks (10/10, 133–389 LOC each).
Headline numbers come from
docs/reports/science-gym-benchmark-report.md.
The in-repo dev suite in benchmarks/dev_suite.json runs fully from this repo,
but it is a smoke harness, not a research benchmark.
src/nelli_ai_scientist/
codex_sdk.py # OpenAI Codex SDK runtime and direct chat client
model_clients.py # Provider selection; Codex SDK default
cborg.py # CBORG/OpenAI-compatible HTTP client
agent_validation.py # Pydantic validation for chat payloads and tool calls
pydantic_ai_validation.py # Optional Pydantic AI structured-output helper
agent.py # Multi-turn agent loop with tool dispatch
external_benchmarks.py # External suite config, validation, and execution
research/ # Plan-driven deliverables, evidence schemas, convergence
tools/ # Workspace-scoped file, shell, and literature tools
tool_call_parser.py # Native + text fallback tool parsing
harness.py # Benchmark loading, prompting, scoring, run artifacts
cli.py # validate / run-benchmark / run-agent / run-council / external suites
src/nelli_ccco/
cli.py # Claude/Codex orchestrator plus Codex worker harness
domains/ # Generic and bio domain packs for stage gates
contracts.py # Required output, worker manifest, and claim gates
loop.py # Capped convergence loop over durable research runs
benchmarks/ # Local dev suite, fixtures, and external-suite config template
tests/ # Unit tests for client, harness, tools, and agent loop
vendor/ # Vendored prompt and skill context
docs/reports/ # Benchmark writeups and methodology notes
scripts/ # Helper scripts for local execution
The same core loop powers interactive agent runs and benchmark runs. Benchmark
mode adds case loading, expectation-based scoring, and a summary.json artifact.
flowchart LR
A[roles.json + vendor] --> C[build_messages]
B[dev_suite.json case] --> C
C --> D[Model client]
D --> E[LLM]
E --> F{tool call?}
F -->|yes| G[ToolRegistry]
G --> H[workspace tools]
H --> D
F -->|no| I[final answer]
B --> J[score_response]
I --> J
J --> K[runs summary.json]
-
Python 3.11–3.12 (provided by the project's pixi / conda-forge environment).
-
For the default
codexprovider: the Codex CLI onPATHand a logged-in Codex/ChatGPT account. The Codex CLI is a Node.js tool published as@openai/codex:npm install -g @openai/codex # needs a recent Node.js codex login # complete the ChatGPT sign-in once codex --version # confirm it is on PATH
Nelli locates the binary via
PATH(override withNELLI_CODEX_BINARY) and defaults to modelgpt-5.5atreasoning_effort=xhigh. -
For the
cborg/ OpenAI-compatible HTTP path only: a CBORG or OpenAI-compatible API key (see Configure a model provider).
# 1. install the environment (editable package + the openai-codex SDK)
pixi install
# 2. log the Codex CLI in (one time; see Requirements)
codex login
# 3. sanity-check the install
pixi run validate
# 4. run the agent on a task: question in --prompt, data in --workspace
pixi run python -m nelli_ai_scientist run-agent \
--provider codex --model gpt-5.5 \
--role omics_scientist \
--prompt "Analyze the FASTA files in the workspace and compare gene content across bins" \
--workspace /path/to/your/data--promptis the task as inline text.run-councilalso accepts--prompt-file <path>for long prompts;run-agentis inline--promptonly.--workspaceis the directory the agent may read and write (default./workspace).- Results land under
runs/<timestamp>-<model>-agent/(transcriptsession.jsonl,summary.json). - Under pixi the package is installed editable, so
PYTHONPATH=srcis not needed. ThePYTHONPATH=src python -m nelli_ai_scientist …form used in the rest of this README is the equivalent for a plain venv.
Codex SDK is the default provider. It reuses an existing Codex/ChatGPT login (set
up once with codex login; see Requirements) and needs no CBORG
key:
PYTHONPATH=src python -m nelli_ai_scientist run-agent \
--provider codex \
--model gpt-5.5 \
--role omics_scientist \
--prompt "Analyze the FASTA files in the workspace" \
--workspace /path/to/dataFor the CBORG path, keep CBORG credentials in ~/.secrets/apis.txt and use the
helper:
source scripts/use-cborg.shOr set the OpenAI-compatible variables directly:
export OPENAI_API_KEY="your-api-key"
export OPENAI_BASE_URL="https://your-api-endpoint.com"CBORG_API_KEY and CBORG_BASE_URL are also supported for --provider cborg.
If no base URL is set, the CBORG client defaults to https://api.cborg.lbl.gov.
Provider modes:
codex: default. OpenAI Codex Python SDK with the local Codex authentication/session. Defaults togpt-5.5unless--model,NELLI_MODEL, orNELLI_CODEX_MODELis set.claude-code: Claude Code headless viaclaude -p, run agentically in the workspace. Tune[backend].claude_permission_mode/claude_binary(envNELLI_CLAUDE_PERMISSION_MODE/NELLI_CLAUDE_BINARY).cborg: usesCBORG_API_KEYorOPENAI_API_KEY. With no model given (no--model,NELLI_MODEL, orCBORG_MODEL), it defaults to the free, LBL-hostedlbl/cborg-coder, so runs do not spend the paid budget by accident. Pass--model(for exampleclaude-sonnet,claude-opus) to opt into a paid model.openrouter: usesOPENROUTER_API_KEY, plus optionalOPENROUTER_BASE_URL.local-gemma: local vLLM Gemma service athttp://127.0.0.1:8010, default modelgemma4-31b.
The local Gemma defaults match the ~/bester-hosting/services/gemma4
deployment: the OpenAI API base is http://127.0.0.1:8010/v1, /health returns
200, and /v1/models lists gemma4-31b. That deployment runs without an
OpenAI tool-call parser, so Nelli disables native tool calls for
--provider local-gemma and relies on text tool-call fallback parsing.
Run against local Gemma without an API key:
PYTHONPATH=src python -m nelli_ai_scientist run-agent \
--provider local-gemma \
--role omics_scientist \
--pydantic-ai-final-validation \
--prompt "Inspect the workspace and summarize the available files." \
--workspace /path/to/dataProvider options are available on run-benchmark, run-agent, and
run-research-agent:
PYTHONPATH=src python -m nelli_ai_scientist run-research-agent \
--provider local-gemma \
--run research_runs/<run_id> \
--stage analysis \
--prompt "Continue the next convergence step."For the loop script, set NELLI_PROVIDER=local-gemma; the script lets the
provider default the model to gemma4-31b when NELLI_MODEL and run metadata do
not specify one:
NELLI_PROVIDER=local-gemma scripts/nelli-once.sh research_runs/<run_id>Some model routes reject sampling fields that older routes accept. Nelli omits
temperature for claude-opus-4-7; use --omit-temperature with run-agent or
run-research-agent for any other route that rejects the temperature field.
The repo drives external benchmark checkouts through a local TOML config. A
versioned template lives at benchmarks/external_suites.toml.example; the default
local config path is benchmarks/external_suites.toml.
List configured suites:
PYTHONPATH=src python -m nelli_ai_scientist list-external-benchmarksPreview a configured command without running it:
PYTHONPATH=src python -m nelli_ai_scientist run-external-benchmark \
scienceagentbench \
--dry-run \
--set model=claude-sonnet-4-6Run a configured external suite command:
PYTHONPATH=src python -m nelli_ai_scientist run-external-benchmark \
sgi_bench \
--set model=claude-sonnet-4-6PYTHONPATH=src python -m nelli_ai_scientist validate
PYTHONPATH=src python -m unittest discover -s tests -vvalidate also checks benchmarks/external_suites.toml when that local file is
present.
These code-aligned checks confirm the harness still matches its documented runtime contract (providers, validation boundaries, reflection/literature checkpoints, run artifacts):
pixi run test
pixi run validate
pixi run validate-catalog
pixi run python -m compileall -q src tests
git diff --checkThe memory channels backing these runs (cross-session memd, per-run CCCO
memory, the MEMORY.md failure log, and tasks/lessons.md) are mapped in
docs/memory.md.
PYTHONPATH=src python -m nelli_ai_scientist run-benchmark \
--model claude-opus-4-6 \
--output-dir runs/opus46-smokeFor a single smoke case:
PYTHONPATH=src python -m nelli_ai_scientist run-benchmark \
--model claude-opus-4-6 \
--case omics_mag_plan \
--output-dir runs/opus46-omics-smokePYTHONPATH=src python -m nelli_ai_scientist run-agent \
--model claude-sonnet-4-6 \
--role omics_scientist \
--prompt "Analyze the FASTA files in the workspace" \
--workspace /path/to/datarun-agent writes a durable session.jsonl transcript and summary.json under
runs/<timestamp>-<model>-agent/ by default. Use --output-dir to choose the run
artifact directory.
Shell tool execution is confined by bubblewrap (tools/sandbox.py). Three
properties matter for analysis runs:
- Network is on by default for agent runs. The agent and council reach the
internet (download, install, and run tools in the workspace) without a flag.
Pass
--no-allow-networkto isolate a run. The secureSandboxPolicydefault is network-off, so isolation is the floor and the CLI opens network for interactive agent runs at the registration sites. - The project/workspace dir is writable. The host root is read-only and
credential dirs (
~/.ssh,~/.secrets,~/.aws, and similar) are masked. - The shared reference-database dir is mounted read-only inside the sandbox and
exported to tools as
$DB_PATH(also$BIO_DB_ROOT,$NELLI_SHARED_DB_ROOT). Its location resolves in order: envDB_PATH>nelli.toml[sandbox].db_path/media/shared-expansion/db.
DB_PATH lives in ~/.secrets/nelli-ai-scientist.env (loaded at CLI startup; the
real environment always wins over the file). Copy nelli.toml.example to
nelli.toml (gitignored) to override db_path, allow_network, and the Codex
codex_sandbox / codex_network settings without touching the environment.
run-council runs a sequential review council (omics scientist, literature
expert, skeptic, methods/reproducibility reviewer), then hands off to a single
AgentLoop executor that does the workspace-mutating analysis. The council shares
the executor engine with run-agent.
PYTHONPATH=src python -m nelli_ai_scientist run-council \
--provider codex --model gpt-5.5 \
--prompt "Characterize the FASTA assemblies in the workspace" \
--workspace /path/to/dataKey flags:
--live-councildrives each council role with a live model call (real deliberation) instead of the offline deterministic review.--enforce-scaffold/--no-enforce-scaffold(council default: on) make the scientific scaffold a hard requirement (see below).--role-set omics,--mode {sequential,debug,...},--max-rounds,--allow-network, and the shared--provider/ model / turn-budget flags.
When enabled, a run cannot finalize until the full scientific scaffold is present. Hard gates in the executor enforce this, not prompt text alone:
- Literature first: the first tool call must be
search_literature_dovmed(the polars-dovmed PMC/bioRxiv backend). Analysis tools and the PubMedsearch_literaturetool are rejected until it runs. -
=5 hypotheses:
results/hypotheses.jsonlmust hold at least five distinct hypotheses, each with anext_discriminating_analysis. - Research plan:
results/research_plan.jsonmust give adiscriminating_testfor every hypothesis. - Reflection: a
results/reflection_<n>.jsonmust assess every hypothesis against the evidence. - Hypothesis adjustment: every hypothesis must leave the initial
activestatus (supported / weakened / rejected / superseded / unresolved).
run-agent --enforce-scaffold applies the same gates (opt-in; off by default so
the benchmark suite is unaffected). The validators live in council/ledger.py;
the finalization gate is in agent.py and gates.py:gate_scaffold_deliverable.
run-council is the discovery-grade default; run-agent and run-research-agent
are generic unless --enforce-scaffold is passed. For real lineage-discovery
work, run run-council (or pass --enforce-scaffold) so the literature-first,
hypothesis, plan, reflection, and finalization gates are enforced rather than left
to prompt text. This scaffold layer is separate from the plan-driven deliverables
gate above; a run can use either, or both.
Nelli validates model I/O at several generic boundaries:
CodexSDKChatClientis the default runtime for--provider codex.CborgClientvalidates outgoing chat payload shape and incoming chat-completion response envelopes with Pydantic.AgentLoopvalidates outgoing message history, parsed native/text tool calls, and final text before accepting completion.run-agentandrun-research-agentcan add a conservative Pydantic AI final-answer validator with--pydantic-ai-final-validation. This validator may confirm the final text but cannot rewrite it.- Malformed tool arguments are never executed as
{}. They consume the configured retry budget and either recover through a corrected model response or stop withstopped_reason: "error". - Blank final responses retry or fail instead of counting as clean completion.
For model-authored structured output beyond the core agent loop, use the optional Pydantic AI helper. Local Gemma should use prompted output unless the vLLM structured-output behavior is revalidated after a serving change.
from nelli_ai_scientist.pydantic_ai_validation import (
ValidatedTextOutput,
build_pydantic_ai_output_agent,
)
agent = build_pydantic_ai_output_agent(
provider="local-gemma",
output_type=ValidatedTextOutput,
output_mode="prompted",
)
result = agent.run_sync("Return a short validated status message.")
validated = result.outputResearch runs keep plans, events, artifacts, metric contracts, jobs, and
summaries in one workspace under research_runs/.
PYTHONPATH=src python -m nelli_ai_scientist init-research \
--title "Estuary sulfur MAG analysis" \
--goal "Recover MAGs and compare sulfur metabolism across salinity zones" \
--primary-metric f1Record a baseline with a metric contract:
PYTHONPATH=src python -m nelli_ai_scientist record-baseline \
--run research_runs/<run_id> \
--title "Initial baseline" \
--metrics '{"f1": 0.70}' \
--metric-contract research_runs/<run_id>/metric_contract.jsonRecord a comparable experiment:
PYTHONPATH=src python -m nelli_ai_scientist record-experiment \
--run research_runs/<run_id> \
--baseline <baseline_artifact_id> \
--hypothesis "Improve the primary metric with one controlled change" \
--metrics '{"f1": 0.73}'Run and inspect a durable research job:
PYTHONPATH=src python -m nelli_ai_scientist research-exec start \
--run research_runs/<run_id> -- python script.py
PYTHONPATH=src python -m nelli_ai_scientist research-exec list \
--run research_runs/<run_id>
PYTHONPATH=src python -m nelli_ai_scientist research-status \
--run research_runs/<run_id>Stage-aware research runs reuse durable memory, compact prior session context, reduce tools by stage, and expose research-run tools for memory, artifact inspection, and diagnostics.
Research agents carry an iterative discovery protocol. When
--reflection-checkpoints is enabled and search_literature is available, the
agent performs an initial literature/current-methods search while reviewing the
initial plan, before any substantive analysis tool or final answer is accepted. It
then records a concise initial literature summary and adjusted plan with
write_file; if it bundles other calls with that record, only the record call
runs until the checkpoint is satisfied. After each later material result, the
agent reconciles the finding against literature/current methods, refines the plan,
runs the strongest missing follow-up analysis, and repeats until manuscript
readiness is no longer blocked.
Two literature tools exist. search_literature queries NCBI PubMed E-utilities
(no API key, citation metadata only). search_literature_dovmed queries
open-access full text (PMC / bioRxiv) through the polars-dovmed backend and is the
tool required under --enforce-scaffold; it shells out to the polars-dovmed helper
(NELLI_DOVMED_SCRIPT / NELLI_DOVMED_PYTHON override its location). Run separate
--corpus pmc and --corpus biorxiv queries rather than combining them. The
writing stage refuses to run by default until convergence evidence is recorded;
use --allow-unconverged-writing only for draft scaffolds that should list
missing evidence rather than claim completion.
For bio/omics signals, the readiness checklist expects gene/protein functional annotation plus comparative genomics or gene-content comparison before final writing. For viral or giant-virus signals, it also expects evidence for quality/completeness/contamination, taxonomy beyond best-hit labels, reference genome or protein retrieval, phylogenomic placement, and genome-size novelty checks. Current methods such as eggNOG-mapper, InterProScan/Pfam, MMseqs2, ProteinOrtho, geNomad, CheckV, GVClass, vConTACT3, and IQ-TREE are expected when installed in the workspace environment, with explicit blockers recorded when a tool or database is unavailable.
The functional and comparative gates are schema-backed. Text that says an
annotation or orthogroup analysis was done is not enough. Record evidence with
record_research_evidence and include an evidence_paths entry pointing to a JSON
manifest:
nelli.functional_annotation.v1: requiresmethod,database.name,database.version,outputs.annotations,metrics.protein_count, andmetrics.annotated_protein_count. The annotation output path must exist inside the run directory and be non-empty.nelli.comparative_genomics.v1: requiresmethod,outputs.orthogroups,outputs.presence_absence,metrics.genome_count, andmetrics.orthogroup_count. Both output paths must exist inside the run directory and be non-empty.
Every other readiness requirement is schema-backed too, so keyword text is only a
non-deciding hint. The remaining gates use a generic record-table manifest that
requires method, outputs.table, and metrics.record_count, where the output
table must exist inside the run directory and hold exactly record_count data
rows (a one-byte placeholder is rejected):
nelli.literature_reconciled.v1: result-triggered literature reconciliation.nelli.plan_refined.v1: plan refined after new evidence.nelli.viral_quality.v1: viral quality/completeness/contamination.nelli.viral_taxonomy.v1: viral taxonomy beyond best-hit labels.nelli.reference_genomes.v1: reference genomes/proteins recorded.nelli.phylogenomics.v1: phylogenomic placement.nelli.genome_size_novelty.v1: genome-size novelty checked against known viruses.nelli.phylogenetic_placement.v1: phylogenetic placement for any lineage (the domain-neutral, non-NCLDV counterpart ofnelli.phylogenomics.v1).nelli.novelty_assessment.v1: novelty quantified against references for any lineage by genome size, ANI/AAI, or marker/16S identity (the domain-neutral, non-NCLDV counterpart ofnelli.genome_size_novelty.v1).
Schema documents live in schemas/. A blocked tool/database is acceptable only
when recorded as a manifest with status: "blocked", a blocker, and
attempted_methods; if the manifest declares a failure_note, that file must
exist inside the run directory.
Write and search memory cards:
PYTHONPATH=src python -m nelli_ai_scientist memory write \
--run research_runs/<run_id> \
--kind knowledge \
--stage baseline \
--title "Metric contract is mandatory" \
--summary "Every comparable run must use the recorded metric contract."
PYTHONPATH=src python -m nelli_ai_scientist memory search \
--run research_runs/<run_id> \
--stage baseline \
--query "metric contract baseline"Preview the stage-specific context before calling a model:
PYTHONPATH=src python -m nelli_ai_scientist research-context \
--run research_runs/<run_id> \
--stage baseline \
--prompt "Plan the next comparable baseline check."Run a stage-aware research agent:
PYTHONPATH=src python -m nelli_ai_scientist run-research-agent \
--run research_runs/<run_id> \
--stage literature \
--model claude-sonnet-4-6 \
--prompt "Reconcile the newest results against the literature and refine the plan."Run a capped Ralph-style convergence loop:
NELLI_MODEL=google/glm-5 scripts/afk-nelli.sh research_runs/<run_id> 12The loop runs one stage-aware iteration at a time. It chooses literature,
refinement, implementation, analysis, verifier, or writing from the
current manuscript-readiness gaps, then stops only when convergence is ready,
paper/verification_acceptance.md contains ACCEPTED, and
paper/final_manuscript.md or .pdf exists. Planning-stage research contexts
include search_literature when that tool is available, so initial plan review
can be grounded and recorded before execution. Use
NELLI_DRY_RUN=1 scripts/nelli-once.sh research_runs/<run_id> to inspect the next
selected stage without calling a model.
For manuscript benchmarks, prefer init-research plus this capped loop over a
bare run-agent call. run-agent remains useful for ad hoc workspace tasks, but
it does not enforce convergence criteria. Agent run artifacts include the full
session.jsonl, final_response.txt, summary.json, and structured
tool_error_count / tool_errors fields. Runs with reflection checkpoints also
record initial_literature_review: true. Malformed final tool calls, malformed
tool arguments, blank final outputs, and partial failed analyses are visible
instead of being treated as clean completion.
The omics scientist roles and bio-relevant research stages include a vendored
bioinformatics skillpack guide and the full bio-* source skill directories under
vendor/omics-skills/skills/. This gives Nelli concrete workflow contracts for
foundation setup, read QC/mapping, assembly QC, MAG QC, gene calling,
annotation/taxonomy, pangenomes, phylogenomics, viromics, structure annotation,
stats/reporting, and methods documentation. The guide is prompt context, not a
Codex slash-skill runtime; Nelli executes the corresponding tools through its
file/shell/artifact interface and records blockers when a tool or database is
unavailable.
Resume from a prior research session:
PYTHONPATH=src python -m nelli_ai_scientist run-research-agent \
--run research_runs/<run_id> \
--stage analysis \
--model claude-sonnet-4-6 \
--resume analysis-20260502T120000Z \
--prompt "Continue the metric comparison."Diagnose failed research jobs:
PYTHONPATH=src python -m nelli_ai_scientist diagnose-job \
--run research_runs/<run_id> \
--job <job_id>Use the task catalog:
PYTHONPATH=src python -m nelli_ai_scientist validate-catalog
PYTHONPATH=src python -m nelli_ai_scientist list-catalog
PYTHONPATH=src python -m nelli_ai_scientist init-research \
--from-catalog dev-omics-baseline \
--run-id dev-omics-baseline-runThe benchmark harness also supports deterministic weighted rubric scoring through
case expectations.scorers and expectations.rubric. llm_judge is opt-in and
skips cleanly unless a caller wires a judge model/client.
nelli_ccco is an optional orchestrator that ships in this repo but is not part of
the core agent path: the nelli_ai_scientist package never imports nelli_ccco
(verified by tests/test_docs_accuracy.py). It is reachable only through its own
nelli-ccco console entrypoint (python -m nelli_ccco) and its own test suite.
Running run-agent / run-council / run-research-agent does not touch it.
It is a problem-agnostic harness where Claude Code or Codex plans/verifies stages
and the Codex CLI performs implementation work. It preserves durable
research_runs/<run_id>/ state, schema-backed evidence gates, claim-context
validation before final reports, event logs, cost logs, and optional per-run
memd memory.
python -m nelli_ccco init \
--title "Estuary sulfur MAGs" \
--goal "Recover MAGs and compare sulfur metabolism across salinity zones" \
--domain bio
python -m nelli_ccco run-once \
--run research_runs/<run_id> \
--orchestrator-cli claude
python -m nelli_ccco run \
--run research_runs/<run_id> \
--orchestrator-cli codex \
--max-iterations 12--domain generic is the default for non-bio work. --domain bio adds
gene-calling, functional-annotation, and comparative-genomics evidence gates. The
full memd agent skill is vendored under vendor/memd-skill/ for self-contained
local memory setup.
MkDocs Material configuration lives in mkdocs.yml and builds the docs under
docs/.
pixi run docs-buildReproducible from this repository today:
- The local dev benchmark suite in
benchmarks/dev_suite.json, backed by concrete fixture files underbenchmarks/fixtures/. - Interactive agent runs with workspace-scoped file and shell tools.
- Durable research-run workspaces under
research_runs/, including event logs, baseline/experiment artifacts, metric comparisons, job logs, and run status checks. - Unit tests under
tests/. - Run artifacts written under
runs/<run>/, including per-case responses andsummary.json. - Repo-driven execution of external benchmark checkouts once their paths are
configured in
benchmarks/external_suites.toml.
Documented here but not yet fully encapsulated in this repository:
- The SGI-Bench, ScienceAgentBench, and SciCode result reports in docs/reports/science-gym-benchmark-report.md and docs/reports/scicode-benchmark-report.md.
- Full external benchmark data, judge traces, and every generated output artifact for the published report numbers.
- The benchmark-specific adapter scripts that live in those external checkouts rather than in this package.
The local harness is reproducible from this tree, and the repo can launch configured external suites from one place. The published external benchmark claims still depend on those external checkouts and their data.
MIT