docs(paper): preserve archived JSS §7 agent-eval material under Paper-AgentBench

brycewang-stanford · brycewang-stanford · commit 12860efd509d · 2026-05-05T20:28:36.000-07:00
Companion-paper workspace for CausalAgentBench, the LLM-agent behavioural
RCT spun out of the JSS submission during the P0-3 reviewer-response trim.
Splitting into a separate manuscript at the JSS revision stage preserves
the OSF pre-registration's scientific value (deposited before trials run)
and lets the JSS paper close under a single, falsifiable scope.

  archive-from-jss/
    jss-section7-agent-eval-original.tex   (130 lines, pre-trim §7)
    jss-context-around-causalagentbench.tex (§1.3 item 4 + §9 mentions)
  manuscript/
    notes/osf-preregistration.md            (working OSF pre-reg copy)

Manuscript drafting itself is gated on (i) OSF deposit, (ii) production-API
budget approval, (iii) JSS submission of the parent paper — only the
archive + protocol notes land here today.
diff --git a/Paper-AgentBench/README.md b/Paper-AgentBench/README.md
@@ -0,0 +1,54 @@
+# Paper-AgentBench — Companion paper workspace
+
+This directory is the working home for the **CausalAgentBench** companion
+benchmark paper, a planned follow-up to the StatsPAI JSS submission. It
+exists so that material we deliberately removed from the JSS draft
+(the production RCT for LLM-agent behavioural evaluation, the 900-trial
+$3 \times 2$ factorial design, the OSF pre-registration protocol, the
+mock-LLM dry-run results) is **preserved verbatim**, not deleted.
+
+The split is the original two-paper plan recorded in
+`Paper-JSS/JSS-research-plan.md`: JSS publishes the unified-package +
+parity story; the companion paper publishes the agent-behavioural RCT.
+Splitting them into separate manuscripts at the JSS revision stage
+preserves the pre-registration's scientific value (CausalAgentBench is
+deposited on OSF before its trials run) and lets the JSS paper close
+under a single, falsifiable scope.
+
+## Layout
+
+```
+Paper-AgentBench/
+├── README.md                                       (this file)
+├── archive-from-jss/                               (preserved verbatim)
+│   ├── jss-section7-agent-eval-original.tex        (JSS §7, 130 lines, pre-trim)
+│   └── jss-context-around-causalagentbench.tex     (§1.3 item 4 + §9 mentions)
+└── manuscript/                                     (companion-paper draft home)
+    ├── sections/                                   (empty; populated when drafting begins)
+    └── notes/
+        └── osf-preregistration.md                  (copy of Paper-JSS/notes/...)
+```
+
+## Status (2026-05-05)
+
+- **Archive**: complete. Every passage cut from the JSS draft during the
+  P0-3 reviewer-response trim is preserved under `archive-from-jss/`.
+- **Manuscript**: not yet drafted. The companion paper is gated on
+  (i) OSF pre-registration deposit, (ii) production-API budget approval,
+  (iii) JSS submission of the parent paper.
+- **Pre-registration protocol**: see
+  `manuscript/notes/osf-preregistration.md` (also kept in
+  `Paper-JSS/notes/` for legacy paths).
+
+## Working principles
+
+1. **Never delete archive content.** If the companion paper's draft
+   diverges from the archive, leave the archive intact and reword in
+   `manuscript/sections/` instead.
+2. **JSS draft references the companion paper as a forward pointer.**
+   The JSS §7 / §1.3 / §9 passages now redirect readers here rather
+   than promising results that have not yet been collected.
+3. **OSF pre-registration is the canonical protocol.** Any future
+   methodological change to CausalAgentBench is tracked through OSF
+   amendments, with `manuscript/notes/osf-preregistration.md` as the
+   working copy.
diff --git a/Paper-AgentBench/archive-from-jss/jss-context-around-causalagentbench.tex b/Paper-AgentBench/archive-from-jss/jss-context-around-causalagentbench.tex
@@ -0,0 +1,71 @@
+%% =====================================================================
+%% Archive of JSS-manuscript passages mentioning CausalAgentBench / Track D
+%% Original location: Paper-JSS/manuscript/sections/{01,07,09}-*.tex
+%% Captured before P0-3 (CausalAgentBench split) trimmed the JSS draft.
+%%
+%% Purpose: preserve hard-won wording so the companion benchmark paper
+%% can recover any phrasing it needs without going through git history.
+%%
+%% This file is NOT compiled. It is a reference archive only.
+%% =====================================================================
+
+
+%% ---------------------------------------------------------------------
+%% From sections/01-introduction.tex -- §1.3 "Contribution and roadmap"
+%% Item 4 of the four-fold contributions list. Track D production RCT
+%% language in particular belongs to the companion paper.
+%% ---------------------------------------------------------------------
+
+\item We propose and partially implement the \emph{Causal Inference
+Decathlon} benchmark
+(Section~\ref{sec:parity}), a representative-not-exhaustive panel of
+twelve estimators evaluated on four tracks: Track~A combines
+analytical recovery, numerical parity against canonical
+\proglang{R} references, and selected \proglang{Stata} bridge checks
+for migration-critical commands; Track~B measures Monte Carlo
+coverage against known data-generating processes; Track~C measures
+computational performance; and Track~D measures behavioural
+performance of LLM agents using the platform versus alternative tool
+stacks. We report results from the
+package's existing reference-parity, \proglang{R}-parity,
+selected \proglang{Stata}-bridge, external-parity, and Monte Carlo
+coverage suites; we also report measured Track~C performance results
+and a mock-LLM dry run of the Track~D harness while leaving the
+production agent RCT for pre-registration.
+
+
+%% ---------------------------------------------------------------------
+%% From sections/09-discussion.tex -- "What StatsPAI is and is not"
+%% paragraph. The CausalAgentBench falsifiability framing belongs to
+%% the companion paper.
+%% ---------------------------------------------------------------------
+
+% Original sentence (ending of paragraph):
+% "...Nor do we claim that the agent-native API replaces the human
+%  researcher; we claim only that it lowers the cost of obtaining a
+%  correct first-pass empirical analysis from an LLM agent, which is
+%  the falsifiable hypothesis \textsc{CausalAgentBench} is designed
+%  to test."
+
+
+%% ---------------------------------------------------------------------
+%% From sections/09-discussion.tex -- "Roadmap (1.14--1.20)" T2 theme.
+%% ---------------------------------------------------------------------
+
+% Original phrasing (within T2):
+% "...extending Tracks A and B to all twelve estimators at $B = 1{,}000$
+%  in CI, and running \textsc{CausalAgentBench} once the OSF
+%  pre-registration is deposited and API budget is approved."
+
+
+%% ---------------------------------------------------------------------
+%% From sections/09-discussion.tex -- concluding remarks paragraph.
+%% ---------------------------------------------------------------------
+
+% Original sentence:
+% "Whether the platform actually delivers in agent-mediated empirical
+%  work is the empirical question \textsc{CausalAgentBench} will
+%  answer in a forthcoming companion benchmark paper
+%  (Section~\ref{sec:agentbench-behavioural}); the Bayesian and
+%  causal-discovery companion papers will close the corresponding
+%  evaluation gaps for those sub-packages."
diff --git a/Paper-AgentBench/archive-from-jss/jss-section7-agent-eval-original.tex b/Paper-AgentBench/archive-from-jss/jss-section7-agent-eval-original.tex
@@ -0,0 +1,130 @@
+\section{Behavioural evaluation of LLM agents}\label{sec:agentbench}
+
+The agent-native design described in Section~\ref{sec:agent} is the
+most contestable claim in this paper: a reviewer entitled to
+scepticism may reasonably ask whether a machine-readable schema
+\emph{actually} improves the behaviour of an LLM agent on a real
+empirical task, or whether the \emph{statsmodels-plus-prompting}
+baseline performs equivalently well. We treat that question as
+falsifiable. This section reports the
+\emph{mechanical} and \emph{procedural} evidence we can ship with
+the present draft, sketches the \emph{behavioural} evidence we have
+prepared for production, and defers the production RCT itself to
+a forthcoming companion benchmark paper to preserve its
+pre-registration value. We refer to \citet{patil2023gorilla} and
+\citet{liang2023helm} as methodological precedents for tool-use and
+holistic-LLM benchmarks respectively.
+
+\subsection{Mechanical evidence}\label{sec:agentbench-mechanical}
+
+The function-schema layer is statically inspectable. We report in
+Table~\ref{tab:schema-coverage} that 980 of 980 \statspai{} public
+functions return a non-empty OpenAI-style JSON Schema; that 214
+functions have hand-written registry specifications; and that 78
+functions currently expose richer agent metadata cards with
+assumptions, pre-conditions, failure modes, and recovery hints. The
+closest comparable \proglang{Python} statistics packages
+(\pkg{statsmodels}, \pkg{linearmodels}, \pkg{scikit-learn}) expose
+zero package-wide typed tool schemas.
+
+\begin{table}[t]
+\centering\small
+\caption{Mechanical schema coverage at \statspai{} 1.14.0 against
+representative comparators. Counts refer to public functions or
+estimator classes.}
+\label{tab:schema-coverage}
+\begin{tabular}{lcccc}
+\toprule
+& \statspai{} & \pkg{statsmodels} & \pkg{linearmodels} & \pkg{scikit-learn} \\
+\midrule
+Public surface (functions/classes)              & 980 & $\sim$2{,}000 & $\sim$150 & $\sim$170 estimators \\
+Emits OpenAI-style tool schema                  & 980 & 0   & 0   & 0 \\
+Hand-written registry specification             & 214 & 0   & 0   & 0 \\
+Curated agent metadata card                     & 78  & 0   & 0   & 0 \\
+Known limitations surfaced before execution     & 10  & 0   & 0   & 0 \\
+\bottomrule
+\end{tabular}
+\end{table}
+
+\subsection{Procedural evidence}\label{sec:agentbench-procedural}
+
+The agent-native surface is exposed through an MCP server
+(Section~\ref{sec:mcp}) that is conformant to the open Model Context
+Protocol specification. Conformance is asserted by the local
+\code{tests/test\_mcp\_protocol.py}, \code{tests/test\_agent\_schema.py},
+and \code{tests/test\_registry.py} suites: they check that registered
+functions emit tool schemas, that agent-card filters behave as
+documented, and that MCP tool discovery sees the same registry-backed
+surface as \code{sp.function\_schema()}. This guarantees that the
+schema layer is, at a minimum, correctly wired up before any
+behavioural claim is made.
+
+For this draft we also execute one deterministic end-to-end trace
+through the live MCP server. The script
+\code{replication/scripts/ex07\_agent\_trace.py} writes a temporary
+\texttt{mpdta.csv}, calls \code{tools/list}, fits
+\code{callaway\_santanna} with \code{as\_handle=true}, sends the
+returned handle to \code{audit\_result}, and then sends the same
+handle to \code{honest\_did\_from\_result}. Table~\ref{tab:agent-trace}
+summarises the trace; Appendix~\ref{app:transcript} prints the
+normalised transcript. This is not a substitute for the production
+RCT, but it verifies the agent-specific claim that the estimate
+\(\to\) audit \(\to\) sensitivity chain is executable without
+ferrying arrays or hand-copying citations through the model context.
+
+\begin{table}[t]
+\centering\small
+\caption{Deterministic MCP trace generated by
+\code{replication/scripts/ex07\_agent\_trace.py}. Ephemeral result
+handles are normalised in the appendix transcript.}
+\label{tab:agent-trace}
+\input{../replication/tables/ex07_agent_trace.tex}
+\end{table}
+
+\subsection{Behavioural evidence (deferred to companion paper)}\label{sec:agentbench-behavioural}
+
+Mechanical and procedural evidence do not by themselves answer
+whether an LLM agent invoked through this interface produces
+\emph{better empirical analyses} than one invoked against
+\pkg{statsmodels} or against \proglang{R} packages exposed through a
+Jupyter MCP shim. The benchmark that closes this gap --
+\textsc{CausalAgentBench} -- is an RCT-style study that holds the
+agent's language model fixed and varies only the toolset.
+
+The protocol consists of fifty causal-inference research prompts
+distributed across three difficulty levels (twenty L1 method-named
+prompts, twenty L2 indirect prompts, ten L3 workflow prompts), six
+experimental cells in a $3 \times 2$ factorial design (\statspai{}
+with MCP, a Pythonic statsmodels/linearmodels/DoubleML/grf-python
+stack, and \proglang{R}-via-MCP, each crossed with a frozen
+Anthropic Claude release and a frozen OpenAI GPT release), and a
+total of 900 trials at three repetitions per (prompt, cell). Eight
+metrics span task success, method correctness, code-execution
+success, token efficiency, hallucination rate, diagnostic
+completeness, reproducibility, and time-to-result. Five hypotheses
+will be pre-registered on OSF; failed trials are classified into
+five qualitative failure modes and reported with redacted
+transcripts. The cluster bootstrap with prompt as the clustering
+unit, $\alpha = 0.05$ two-sided with Bonferroni correction across
+the five hypotheses, is the planned statistical test.
+
+The complete protocol, gold answers, grading rubric, and a
+deterministic mock-LLM harness are shipped at
+\code{tests/agent\_bench/} in the source tree; an OSF
+pre-registration draft frozen at
+\code{tests/agent\_bench/prompts/\_protocol.md} lists the file
+hashes to deposit before the first production trial. A 900-trial
+mock-LLM dry run completes in $<1$~s on the benchmarking machine,
+validates the harness end-to-end, and produces the same per-cell
+table format that the production run will report. Flipping
+\code{runner.py} from \code{--mock} to \code{--api both} is the
+single change that swaps the mock LLM for the real ones; the
+scoring, aggregation, and statistical-test pipeline downstream of
+that flip is identical. The production run is gated on two
+external steps that we deliberately do not auto-execute: the OSF
+pre-registration deposit and an API budget approval. The full RCT
+and its statistical analysis are the subject of a forthcoming
+companion benchmark paper; we defer reporting them here both to
+preserve the pre-registration's scientific value and to keep the
+present manuscript focused on the unification, parity, and
+schema-layer claims that constitute the core JSS contribution.
diff --git a/Paper-AgentBench/manuscript/notes/osf-preregistration.md b/Paper-AgentBench/manuscript/notes/osf-preregistration.md