Nelli AI Scientist

Nelli is a plan-driven AI-scientist harness. An agent frames a problem, declares that run's own concrete deliverables, and is held to recorded evidence for each one. Completion is gated on the deliverables the planner declared for that run, not a fixed checklist, so the same harness applies across scientific domains. The primary application is discovery and characterization of new lineages of life from omics data: recovering and placing novel microbial and viral genomes, then holding novelty claims to evidence. The bio domain (--domain bio) layers a built-in bioinformatics requirement set on top of the planner's declared deliverables.

Nelli also runs as a benchmark harness for scientific agent prompts (see Benchmark Snapshot).

How gating works

Gating is plan-driven and domain-agnostic. The planner calls the declare_deliverables tool, which writes plan/deliverables.json. Deliverable keys are free-form and planner-defined. Readiness then requires that every declared deliverable is backed by a recorded evidence artifact that is tagged with the deliverable's exact key and carries a non-empty evidence_paths file. A prose mention of the key, or a narrative summary with no artifact file, does not satisfy a deliverable. A run that declares nothing is never trivially "ready".

--domain bio is opt-in. It is not forced by keywords. It accepts bio, biology, omics, genomics, and metagenomics, and adds the built-in omics requirement set (functional annotation, comparative genomics, viral/lineage quality, phylogenomic placement, novelty) on top of the declared deliverables.

For the bio domain, this repo drives the vendored bioinformatics skillpack in vendor/omics-skills/. It turns those workflow contracts (read QC, assembly, MAG/viral QC, gene calling, annotation/taxonomy, pangenomes, phylogenomics, viromics) into gated, reproducible agent runs.

Backends

The agent backend is selectable via nelli.toml [backend].provider or --provider. The default backend is the OpenAI Codex SDK (--provider codex, model gpt-5.5, ChatGPT-login auth). The alternatives are claude-code (Claude Code headless via claude -p), cborg, openrouter, and local-gemma. The codex and claude-code backends both run agentically inside the project workspace.

Codex runs with maximum freedom by default. codex_sandbox defaults to danger-full-access; set it to workspace-write or read-only via nelli.toml [sandbox].codex_sandbox or $NELLI_CODEX_SANDBOX. The codex backend keeps its own execution inside the Codex sandbox. claude-code with bypassPermissions runs as a host process under Claude Code's own permission model and is not confined by Nelli's bwrap sandbox.

Benchmark Snapshot

Benchmark	Metric	Sonnet 4.6	GLM-5	GPT-5.5
SGI Task 1: Deep Research	Step-Level Accuracy	66.4%	64.6%	74.0%
SGI Task 1: Deep Research	Exact Match	15.9%	11.7%	56.7%
SGI Task 2: Idea Generation	Overall Quality (0-10)	6.2	5.5	8.0
SGI Task 3.1: Dry Experiment	Code Correctness (est.)	55.0%	42.8%	79.0%
SGI Task 3.2: Wet Experiment	Protocol Quality (0-10)	10.0	9.8	8.7 †
SGI Task 4: Experimental Reasoning	MC Accuracy	43.7%	34.5%	n/a ‡
ScienceAgentBench	Valid Code Generation	100%	100%	100%
ScienceAgentBench	Tasks Completed	10/27	8/27	10/10 §

The GPT-5.5 column is the current default backbone (Codex SDK, gpt-5.5, reasoning_effort=xhigh, ChatGPT-login auth), run on 2026-06-04. The Sonnet 4.6 and GLM-5 columns are the original 2026-03-10 CBORG runs, kept unchanged. GPT-5.5 was scored with the same sampled, local-Claude-judge methodology as the baseline, on the first N valid baseline rows per task (Task 1 n=30, Task 2 n=24, Task 3.1 n=20, Task 3.2 n=24; ScienceAgentBench 10 bio tasks). It used the same benchmark source files but not the baseline's exact scored sample, and the per-task sizes differ from the baseline's. SGI scores are sampled LLM-judge results; read them as directional, not as full-set leaderboard placements.

Caveats:

† Task 3.2 was re-graded with a stricter rubric. The report flags the baseline's near-perfect 10.0/9.8 as "likely generous", so this is not a like-for-like comparison. GPT-5.5's protocols score 9.0/10 on completeness, graded harder.
‡ Task 4 interprets experimental figures (vision). The Codex SDK path is text-only and the SGI image data is gated, so GPT-5.5 was not evaluated on it.
§ The baseline "Tasks Completed X/27" reflects a CBORG budget cap ($50). The Codex SDK has no API budget cap, so GPT-5.5 produced valid programs for all 10 sampled bio tasks (10/10, 133–389 LOC each).

Headline numbers come from docs/reports/science-gym-benchmark-report.md. The in-repo dev suite in benchmarks/dev_suite.json runs fully from this repo, but it is a smoke harness, not a research benchmark.

Project Layout

src/nelli_ai_scientist/
  codex_sdk.py          # OpenAI Codex SDK runtime and direct chat client
  model_clients.py      # Provider selection; Codex SDK default
  cborg.py              # CBORG/OpenAI-compatible HTTP client
  agent_validation.py   # Pydantic validation for chat payloads and tool calls
  pydantic_ai_validation.py  # Optional Pydantic AI structured-output helper
  agent.py              # Multi-turn agent loop with tool dispatch
  external_benchmarks.py  # External suite config, validation, and execution
  research/             # Plan-driven deliverables, evidence schemas, convergence
  tools/                # Workspace-scoped file, shell, and literature tools
  tool_call_parser.py   # Native + text fallback tool parsing
  harness.py            # Benchmark loading, prompting, scoring, run artifacts
  cli.py                # validate / run-benchmark / run-agent / run-council / external suites
src/nelli_ccco/
  cli.py                # Claude/Codex orchestrator plus Codex worker harness
  domains/              # Generic and bio domain packs for stage gates
  contracts.py          # Required output, worker manifest, and claim gates
  loop.py               # Capped convergence loop over durable research runs
benchmarks/             # Local dev suite, fixtures, and external-suite config template
tests/                  # Unit tests for client, harness, tools, and agent loop
vendor/                 # Vendored prompt and skill context
docs/reports/           # Benchmark writeups and methodology notes
scripts/                # Helper scripts for local execution

Workflow

The same core loop powers interactive agent runs and benchmark runs. Benchmark mode adds case loading, expectation-based scoring, and a summary.json artifact.

flowchart LR
  A[roles.json + vendor] --> C[build_messages]
  B[dev_suite.json case] --> C
  C --> D[Model client]
  D --> E[LLM]
  E --> F{tool call?}
  F -->|yes| G[ToolRegistry]
  G --> H[workspace tools]
  H --> D
  F -->|no| I[final answer]
  B --> J[score_response]
  I --> J
  J --> K[runs summary.json]

Requirements

Python 3.11–3.12 (provided by the project's pixi / conda-forge environment).
For the default codex provider: the Codex CLI on PATH and a logged-in Codex/ChatGPT account. The Codex CLI is a Node.js tool published as @openai/codex:
```
npm install -g @openai/codex   # needs a recent Node.js
codex login                    # complete the ChatGPT sign-in once
codex --version                # confirm it is on PATH
```
Nelli locates the binary via PATH (override with NELLI_CODEX_BINARY) and defaults to model gpt-5.5 at reasoning_effort=xhigh.
For the cborg / OpenAI-compatible HTTP path only: a CBORG or OpenAI-compatible API key (see Configure a model provider).

Quickstart

# 1. install the environment (editable package + the openai-codex SDK)
pixi install

# 2. log the Codex CLI in (one time; see Requirements)
codex login

# 3. sanity-check the install
pixi run validate

# 4. run the agent on a task: question in --prompt, data in --workspace
pixi run python -m nelli_ai_scientist run-agent \
  --provider codex --model gpt-5.5 \
  --role omics_scientist \
  --prompt "Analyze the FASTA files in the workspace and compare gene content across bins" \
  --workspace /path/to/your/data

--prompt is the task as inline text. run-council also accepts --prompt-file <path> for long prompts; run-agent is inline --prompt only.
--workspace is the directory the agent may read and write (default ./workspace).
Results land under runs/<timestamp>-<model>-agent/ (transcript session.jsonl, summary.json).
Under pixi the package is installed editable, so PYTHONPATH=src is not needed. The PYTHONPATH=src python -m nelli_ai_scientist … form used in the rest of this README is the equivalent for a plain venv.

Setup

1. Configure a model provider

Codex SDK is the default provider. It reuses an existing Codex/ChatGPT login (set up once with codex login; see Requirements) and needs no CBORG key:

PYTHONPATH=src python -m nelli_ai_scientist run-agent \
  --provider codex \
  --model gpt-5.5 \
  --role omics_scientist \
  --prompt "Analyze the FASTA files in the workspace" \
  --workspace /path/to/data

For the CBORG path, keep CBORG credentials in ~/.secrets/apis.txt and use the helper:

source scripts/use-cborg.sh

Or set the OpenAI-compatible variables directly:

export OPENAI_API_KEY="your-api-key"
export OPENAI_BASE_URL="https://your-api-endpoint.com"

CBORG_API_KEY and CBORG_BASE_URL are also supported for --provider cborg. If no base URL is set, the CBORG client defaults to https://api.cborg.lbl.gov.

Provider modes:

codex: default. OpenAI Codex Python SDK with the local Codex authentication/session. Defaults to gpt-5.5 unless --model, NELLI_MODEL, or NELLI_CODEX_MODEL is set.
claude-code: Claude Code headless via claude -p, run agentically in the workspace. Tune [backend].claude_permission_mode / claude_binary (env NELLI_CLAUDE_PERMISSION_MODE / NELLI_CLAUDE_BINARY).
cborg: uses CBORG_API_KEY or OPENAI_API_KEY. With no model given (no --model, NELLI_MODEL, or CBORG_MODEL), it defaults to the free, LBL-hosted lbl/cborg-coder, so runs do not spend the paid budget by accident. Pass --model (for example claude-sonnet, claude-opus) to opt into a paid model.
openrouter: uses OPENROUTER_API_KEY, plus optional OPENROUTER_BASE_URL.
local-gemma: local vLLM Gemma service at http://127.0.0.1:8010, default model gemma4-31b.

The local Gemma defaults match the ~/bester-hosting/services/gemma4 deployment: the OpenAI API base is http://127.0.0.1:8010/v1, /health returns 200, and /v1/models lists gemma4-31b. That deployment runs without an OpenAI tool-call parser, so Nelli disables native tool calls for --provider local-gemma and relies on text tool-call fallback parsing.

Run against local Gemma without an API key:

PYTHONPATH=src python -m nelli_ai_scientist run-agent \
  --provider local-gemma \
  --role omics_scientist \
  --pydantic-ai-final-validation \
  --prompt "Inspect the workspace and summarize the available files." \
  --workspace /path/to/data

Provider options are available on run-benchmark, run-agent, and run-research-agent:

PYTHONPATH=src python -m nelli_ai_scientist run-research-agent \
  --provider local-gemma \
  --run research_runs/<run_id> \
  --stage analysis \
  --prompt "Continue the next convergence step."

For the loop script, set NELLI_PROVIDER=local-gemma; the script lets the provider default the model to gemma4-31b when NELLI_MODEL and run metadata do not specify one:

NELLI_PROVIDER=local-gemma scripts/nelli-once.sh research_runs/<run_id>

Some model routes reject sampling fields that older routes accept. Nelli omits temperature for claude-opus-4-7; use --omit-temperature with run-agent or run-research-agent for any other route that rejects the temperature field.

2. Configure external benchmark suite locations

The repo drives external benchmark checkouts through a local TOML config. A versioned template lives at benchmarks/external_suites.toml.example; the default local config path is benchmarks/external_suites.toml.

List configured suites:

PYTHONPATH=src python -m nelli_ai_scientist list-external-benchmarks

Preview a configured command without running it:

PYTHONPATH=src python -m nelli_ai_scientist run-external-benchmark \
  scienceagentbench \
  --dry-run \
  --set model=claude-sonnet-4-6

Run a configured external suite command:

PYTHONPATH=src python -m nelli_ai_scientist run-external-benchmark \
  sgi_bench \
  --set model=claude-sonnet-4-6

3. Validate the repo and run tests

PYTHONPATH=src python -m nelli_ai_scientist validate
PYTHONPATH=src python -m unittest discover -s tests -v

validate also checks benchmarks/external_suites.toml when that local file is present.

Runtime contract verification

These code-aligned checks confirm the harness still matches its documented runtime contract (providers, validation boundaries, reflection/literature checkpoints, run artifacts):

pixi run test
pixi run validate
pixi run validate-catalog
pixi run python -m compileall -q src tests
git diff --check

The memory channels backing these runs (cross-session memd, per-run CCCO memory, the MEMORY.md failure log, and tasks/lessons.md) are mapped in docs/memory.md.

4. Run the in-repo benchmark harness

PYTHONPATH=src python -m nelli_ai_scientist run-benchmark \
  --model claude-opus-4-6 \
  --output-dir runs/opus46-smoke

For a single smoke case:

PYTHONPATH=src python -m nelli_ai_scientist run-benchmark \
  --model claude-opus-4-6 \
  --case omics_mag_plan \
  --output-dir runs/opus46-omics-smoke

5. Run the agent interactively

PYTHONPATH=src python -m nelli_ai_scientist run-agent \
  --model claude-sonnet-4-6 \
  --role omics_scientist \
  --prompt "Analyze the FASTA files in the workspace" \
  --workspace /path/to/data

run-agent writes a durable session.jsonl transcript and summary.json under runs/<timestamp>-<model>-agent/ by default. Use --output-dir to choose the run artifact directory.

Sandbox, network, and databases

Shell tool execution is confined by bubblewrap (tools/sandbox.py). Three properties matter for analysis runs:

Network is on by default for agent runs. The agent and council reach the internet (download, install, and run tools in the workspace) without a flag. Pass --no-allow-network to isolate a run. The secure SandboxPolicy default is network-off, so isolation is the floor and the CLI opens network for interactive agent runs at the registration sites.
The project/workspace dir is writable. The host root is read-only and credential dirs (~/.ssh, ~/.secrets, ~/.aws, and similar) are masked.
The shared reference-database dir is mounted read-only inside the sandbox and exported to tools as $DB_PATH (also $BIO_DB_ROOT, $NELLI_SHARED_DB_ROOT). Its location resolves in order: env DB_PATH > nelli.toml [sandbox].db_path

/media/shared-expansion/db.

DB_PATH lives in ~/.secrets/nelli-ai-scientist.env (loaded at CLI startup; the real environment always wins over the file). Copy nelli.toml.example to nelli.toml (gitignored) to override db_path, allow_network, and the Codex codex_sandbox / codex_network settings without touching the environment.

5a. Run the multi-agent research council

run-council runs a sequential review council (omics scientist, literature expert, skeptic, methods/reproducibility reviewer), then hands off to a single AgentLoop executor that does the workspace-mutating analysis. The council shares the executor engine with run-agent.

PYTHONPATH=src python -m nelli_ai_scientist run-council \
  --provider codex --model gpt-5.5 \
  --prompt "Characterize the FASTA assemblies in the workspace" \
  --workspace /path/to/data

Key flags:

--live-council drives each council role with a live model call (real deliberation) instead of the offline deterministic review.
--enforce-scaffold / --no-enforce-scaffold (council default: on) make the scientific scaffold a hard requirement (see below).
--role-set omics, --mode {sequential,debug,...}, --max-rounds, --allow-network, and the shared --provider / model / turn-budget flags.

Scaffold enforcement (`--enforce-scaffold`)

When enabled, a run cannot finalize until the full scientific scaffold is present. Hard gates in the executor enforce this, not prompt text alone:

Literature first: the first tool call must be search_literature_dovmed (the polars-dovmed PMC/bioRxiv backend). Analysis tools and the PubMed search_literature tool are rejected until it runs.
=5 hypotheses: results/hypotheses.jsonl must hold at least five distinct hypotheses, each with a next_discriminating_analysis.
Research plan: results/research_plan.json must give a discriminating_test for every hypothesis.
Reflection: a results/reflection_<n>.json must assess every hypothesis against the evidence.
Hypothesis adjustment: every hypothesis must leave the initial active status (supported / weakened / rejected / superseded / unresolved).

run-agent --enforce-scaffold applies the same gates (opt-in; off by default so the benchmark suite is unaffected). The validators live in council/ledger.py; the finalization gate is in agent.py and gates.py:gate_scaffold_deliverable.

run-council is the discovery-grade default; run-agent and run-research-agent are generic unless --enforce-scaffold is passed. For real lineage-discovery work, run run-council (or pass --enforce-scaffold) so the literature-first, hypothesis, plan, reflection, and finalization gates are enforced rather than left to prompt text. This scaffold layer is separate from the plan-driven deliverables gate above; a run can use either, or both.

Pydantic and Pydantic AI validation

Nelli validates model I/O at several generic boundaries:

CodexSDKChatClient is the default runtime for --provider codex.
CborgClient validates outgoing chat payload shape and incoming chat-completion response envelopes with Pydantic.
AgentLoop validates outgoing message history, parsed native/text tool calls, and final text before accepting completion.
run-agent and run-research-agent can add a conservative Pydantic AI final-answer validator with --pydantic-ai-final-validation. This validator may confirm the final text but cannot rewrite it.
Malformed tool arguments are never executed as {}. They consume the configured retry budget and either recover through a corrected model response or stop with stopped_reason: "error".
Blank final responses retry or fail instead of counting as clean completion.

For model-authored structured output beyond the core agent loop, use the optional Pydantic AI helper. Local Gemma should use prompted output unless the vLLM structured-output behavior is revalidated after a serving change.

from nelli_ai_scientist.pydantic_ai_validation import (
    ValidatedTextOutput,
    build_pydantic_ai_output_agent,
)

agent = build_pydantic_ai_output_agent(
    provider="local-gemma",
    output_type=ValidatedTextOutput,
    output_mode="prompted",
)
result = agent.run_sync("Return a short validated status message.")
validated = result.output

6. Create a durable research run

Research runs keep plans, events, artifacts, metric contracts, jobs, and summaries in one workspace under research_runs/.

PYTHONPATH=src python -m nelli_ai_scientist init-research \
  --title "Estuary sulfur MAG analysis" \
  --goal "Recover MAGs and compare sulfur metabolism across salinity zones" \
  --primary-metric f1

Record a baseline with a metric contract:

PYTHONPATH=src python -m nelli_ai_scientist record-baseline \
  --run research_runs/<run_id> \
  --title "Initial baseline" \
  --metrics '{"f1": 0.70}' \
  --metric-contract research_runs/<run_id>/metric_contract.json

Record a comparable experiment:

PYTHONPATH=src python -m nelli_ai_scientist record-experiment \
  --run research_runs/<run_id> \
  --baseline <baseline_artifact_id> \
  --hypothesis "Improve the primary metric with one controlled change" \
  --metrics '{"f1": 0.73}'

Run and inspect a durable research job:

PYTHONPATH=src python -m nelli_ai_scientist research-exec start \
  --run research_runs/<run_id> -- python script.py

PYTHONPATH=src python -m nelli_ai_scientist research-exec list \
  --run research_runs/<run_id>

PYTHONPATH=src python -m nelli_ai_scientist research-status \
  --run research_runs/<run_id>

Stage-aware research runs reuse durable memory, compact prior session context, reduce tools by stage, and expose research-run tools for memory, artifact inspection, and diagnostics.

Research agents carry an iterative discovery protocol. When --reflection-checkpoints is enabled and search_literature is available, the agent performs an initial literature/current-methods search while reviewing the initial plan, before any substantive analysis tool or final answer is accepted. It then records a concise initial literature summary and adjusted plan with write_file; if it bundles other calls with that record, only the record call runs until the checkpoint is satisfied. After each later material result, the agent reconciles the finding against literature/current methods, refines the plan, runs the strongest missing follow-up analysis, and repeats until manuscript readiness is no longer blocked.

Two literature tools exist. search_literature queries NCBI PubMed E-utilities (no API key, citation metadata only). search_literature_dovmed queries open-access full text (PMC / bioRxiv) through the polars-dovmed backend and is the tool required under --enforce-scaffold; it shells out to the polars-dovmed helper (NELLI_DOVMED_SCRIPT / NELLI_DOVMED_PYTHON override its location). Run separate --corpus pmc and --corpus biorxiv queries rather than combining them. The writing stage refuses to run by default until convergence evidence is recorded; use --allow-unconverged-writing only for draft scaffolds that should list missing evidence rather than claim completion.

For bio/omics signals, the readiness checklist expects gene/protein functional annotation plus comparative genomics or gene-content comparison before final writing. For viral or giant-virus signals, it also expects evidence for quality/completeness/contamination, taxonomy beyond best-hit labels, reference genome or protein retrieval, phylogenomic placement, and genome-size novelty checks. Current methods such as eggNOG-mapper, InterProScan/Pfam, MMseqs2, ProteinOrtho, geNomad, CheckV, GVClass, vConTACT3, and IQ-TREE are expected when installed in the workspace environment, with explicit blockers recorded when a tool or database is unavailable.

The functional and comparative gates are schema-backed. Text that says an annotation or orthogroup analysis was done is not enough. Record evidence with record_research_evidence and include an evidence_paths entry pointing to a JSON manifest:

nelli.functional_annotation.v1: requires method, database.name, database.version, outputs.annotations, metrics.protein_count, and metrics.annotated_protein_count. The annotation output path must exist inside the run directory and be non-empty.
nelli.comparative_genomics.v1: requires method, outputs.orthogroups, outputs.presence_absence, metrics.genome_count, and metrics.orthogroup_count. Both output paths must exist inside the run directory and be non-empty.

Every other readiness requirement is schema-backed too, so keyword text is only a non-deciding hint. The remaining gates use a generic record-table manifest that requires method, outputs.table, and metrics.record_count, where the output table must exist inside the run directory and hold exactly record_count data rows (a one-byte placeholder is rejected):

nelli.literature_reconciled.v1: result-triggered literature reconciliation.
nelli.plan_refined.v1: plan refined after new evidence.
nelli.viral_quality.v1: viral quality/completeness/contamination.
nelli.viral_taxonomy.v1: viral taxonomy beyond best-hit labels.
nelli.reference_genomes.v1: reference genomes/proteins recorded.
nelli.phylogenomics.v1: phylogenomic placement.
nelli.genome_size_novelty.v1: genome-size novelty checked against known viruses.
nelli.phylogenetic_placement.v1: phylogenetic placement for any lineage (the domain-neutral, non-NCLDV counterpart of nelli.phylogenomics.v1).
nelli.novelty_assessment.v1: novelty quantified against references for any lineage by genome size, ANI/AAI, or marker/16S identity (the domain-neutral, non-NCLDV counterpart of nelli.genome_size_novelty.v1).

Schema documents live in schemas/. A blocked tool/database is acceptable only when recorded as a manifest with status: "blocked", a blocker, and attempted_methods; if the manifest declares a failure_note, that file must exist inside the run directory.

Write and search memory cards:

PYTHONPATH=src python -m nelli_ai_scientist memory write \
  --run research_runs/<run_id> \
  --kind knowledge \
  --stage baseline \
  --title "Metric contract is mandatory" \
  --summary "Every comparable run must use the recorded metric contract."

PYTHONPATH=src python -m nelli_ai_scientist memory search \
  --run research_runs/<run_id> \
  --stage baseline \
  --query "metric contract baseline"

Preview the stage-specific context before calling a model:

PYTHONPATH=src python -m nelli_ai_scientist research-context \
  --run research_runs/<run_id> \
  --stage baseline \
  --prompt "Plan the next comparable baseline check."

Run a stage-aware research agent:

PYTHONPATH=src python -m nelli_ai_scientist run-research-agent \
  --run research_runs/<run_id> \
  --stage literature \
  --model claude-sonnet-4-6 \
  --prompt "Reconcile the newest results against the literature and refine the plan."

Run a capped Ralph-style convergence loop:

NELLI_MODEL=google/glm-5 scripts/afk-nelli.sh research_runs/<run_id> 12

The loop runs one stage-aware iteration at a time. It chooses literature, refinement, implementation, analysis, verifier, or writing from the current manuscript-readiness gaps, then stops only when convergence is ready, paper/verification_acceptance.md contains ACCEPTED, and paper/final_manuscript.md or .pdf exists. Planning-stage research contexts include search_literature when that tool is available, so initial plan review can be grounded and recorded before execution. Use NELLI_DRY_RUN=1 scripts/nelli-once.sh research_runs/<run_id> to inspect the next selected stage without calling a model.

For manuscript benchmarks, prefer init-research plus this capped loop over a bare run-agent call. run-agent remains useful for ad hoc workspace tasks, but it does not enforce convergence criteria. Agent run artifacts include the full session.jsonl, final_response.txt, summary.json, and structured tool_error_count / tool_errors fields. Runs with reflection checkpoints also record initial_literature_review: true. Malformed final tool calls, malformed tool arguments, blank final outputs, and partial failed analyses are visible instead of being treated as clean completion.

The omics scientist roles and bio-relevant research stages include a vendored bioinformatics skillpack guide and the full bio-* source skill directories under vendor/omics-skills/skills/. This gives Nelli concrete workflow contracts for foundation setup, read QC/mapping, assembly QC, MAG QC, gene calling, annotation/taxonomy, pangenomes, phylogenomics, viromics, structure annotation, stats/reporting, and methods documentation. The guide is prompt context, not a Codex slash-skill runtime; Nelli executes the corresponding tools through its file/shell/artifact interface and records blockers when a tool or database is unavailable.

Resume from a prior research session:

PYTHONPATH=src python -m nelli_ai_scientist run-research-agent \
  --run research_runs/<run_id> \
  --stage analysis \
  --model claude-sonnet-4-6 \
  --resume analysis-20260502T120000Z \
  --prompt "Continue the metric comparison."

Diagnose failed research jobs:

PYTHONPATH=src python -m nelli_ai_scientist diagnose-job \
  --run research_runs/<run_id> \
  --job <job_id>

Use the task catalog:

PYTHONPATH=src python -m nelli_ai_scientist validate-catalog
PYTHONPATH=src python -m nelli_ai_scientist list-catalog
PYTHONPATH=src python -m nelli_ai_scientist init-research \
  --from-catalog dev-omics-baseline \
  --run-id dev-omics-baseline-run

The benchmark harness also supports deterministic weighted rubric scoring through case expectations.scorers and expectations.rubric. llm_judge is opt-in and skips cleanly unless a caller wires a judge model/client.

CCCO Orchestration Harness

nelli_ccco is an optional orchestrator that ships in this repo but is not part of the core agent path: the nelli_ai_scientist package never imports nelli_ccco (verified by tests/test_docs_accuracy.py). It is reachable only through its own nelli-ccco console entrypoint (python -m nelli_ccco) and its own test suite. Running run-agent / run-council / run-research-agent does not touch it.

It is a problem-agnostic harness where Claude Code or Codex plans/verifies stages and the Codex CLI performs implementation work. It preserves durable research_runs/<run_id>/ state, schema-backed evidence gates, claim-context validation before final reports, event logs, cost logs, and optional per-run memd memory.

python -m nelli_ccco init \
  --title "Estuary sulfur MAGs" \
  --goal "Recover MAGs and compare sulfur metabolism across salinity zones" \
  --domain bio

python -m nelli_ccco run-once \
  --run research_runs/<run_id> \
  --orchestrator-cli claude

python -m nelli_ccco run \
  --run research_runs/<run_id> \
  --orchestrator-cli codex \
  --max-iterations 12

--domain generic is the default for non-bio work. --domain bio adds gene-calling, functional-annotation, and comparative-genomics evidence gates. The full memd agent skill is vendored under vendor/memd-skill/ for self-contained local memory setup.

Documentation Site

MkDocs Material configuration lives in mkdocs.yml and builds the docs under docs/.

pixi run docs-build

Reproducibility Notes

Reproducible from this repository today:

The local dev benchmark suite in benchmarks/dev_suite.json, backed by concrete fixture files under benchmarks/fixtures/.
Interactive agent runs with workspace-scoped file and shell tools.
Durable research-run workspaces under research_runs/, including event logs, baseline/experiment artifacts, metric comparisons, job logs, and run status checks.
Unit tests under tests/.
Run artifacts written under runs/<run>/, including per-case responses and summary.json.
Repo-driven execution of external benchmark checkouts once their paths are configured in benchmarks/external_suites.toml.

Documented here but not yet fully encapsulated in this repository:

The SGI-Bench, ScienceAgentBench, and SciCode result reports in docs/reports/science-gym-benchmark-report.md and docs/reports/scicode-benchmark-report.md.
Full external benchmark data, judge traces, and every generated output artifact for the published report numbers.
The benchmark-specific adapter scripts that live in those external checkouts rather than in this package.

The local harness is reproducible from this tree, and the repo can launch configured external suites from one place. The published external benchmark claims still depend on those external checkouts and their data.

License

MIT

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Nelli AI Scientist

How gating works

Backends

Benchmark Snapshot

Project Layout

Workflow

Requirements

Quickstart

Setup

1. Configure a model provider

2. Configure external benchmark suite locations

3. Validate the repo and run tests

Runtime contract verification

4. Run the in-repo benchmark harness

5. Run the agent interactively

Sandbox, network, and databases

5a. Run the multi-agent research council

Scaffold enforcement (`--enforce-scaffold`)

Pydantic and Pydantic AI validation

6. Create a durable research run

CCCO Orchestration Harness

Documentation Site

Reproducibility Notes

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
.github/workflows		.github/workflows
benchmarks		benchmarks
docs		docs
research_runs		research_runs
schemas		schemas
scripts		scripts
src		src
tests		tests
vendor		vendor
.gitignore		.gitignore
README.md		README.md
mkdocs.yml		mkdocs.yml
nelli.toml.example		nelli.toml.example
pixi.lock		pixi.lock
pixi.toml		pixi.toml
pyproject.toml		pyproject.toml

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Nelli AI Scientist

How gating works

Backends

Benchmark Snapshot

Project Layout

Workflow

Requirements

Quickstart

Setup

1. Configure a model provider

2. Configure external benchmark suite locations

3. Validate the repo and run tests

Runtime contract verification

4. Run the in-repo benchmark harness

5. Run the agent interactively

Sandbox, network, and databases

5a. Run the multi-agent research council

Scaffold enforcement (--enforce-scaffold)

Pydantic and Pydantic AI validation

6. Create a durable research run

CCCO Orchestration Harness

Documentation Site

Reproducibility Notes

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Scaffold enforcement (`--enforce-scaffold`)

Packages