Skip to content

Commit 12860ef

Browse files
docs(paper): preserve archived JSS §7 agent-eval material under Paper-AgentBench
Companion-paper workspace for CausalAgentBench, the LLM-agent behavioural RCT spun out of the JSS submission during the P0-3 reviewer-response trim. Splitting into a separate manuscript at the JSS revision stage preserves the OSF pre-registration's scientific value (deposited before trials run) and lets the JSS paper close under a single, falsifiable scope. archive-from-jss/ jss-section7-agent-eval-original.tex (130 lines, pre-trim §7) jss-context-around-causalagentbench.tex (§1.3 item 4 + §9 mentions) manuscript/ notes/osf-preregistration.md (working OSF pre-reg copy) Manuscript drafting itself is gated on (i) OSF deposit, (ii) production-API budget approval, (iii) JSS submission of the parent paper — only the archive + protocol notes land here today.
1 parent c339063 commit 12860ef

4 files changed

Lines changed: 481 additions & 0 deletions

File tree

Paper-AgentBench/README.md

Lines changed: 54 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,54 @@
1+
# Paper-AgentBench — Companion paper workspace
2+
3+
This directory is the working home for the **CausalAgentBench** companion
4+
benchmark paper, a planned follow-up to the StatsPAI JSS submission. It
5+
exists so that material we deliberately removed from the JSS draft
6+
(the production RCT for LLM-agent behavioural evaluation, the 900-trial
7+
$3 \times 2$ factorial design, the OSF pre-registration protocol, the
8+
mock-LLM dry-run results) is **preserved verbatim**, not deleted.
9+
10+
The split is the original two-paper plan recorded in
11+
`Paper-JSS/JSS-research-plan.md`: JSS publishes the unified-package +
12+
parity story; the companion paper publishes the agent-behavioural RCT.
13+
Splitting them into separate manuscripts at the JSS revision stage
14+
preserves the pre-registration's scientific value (CausalAgentBench is
15+
deposited on OSF before its trials run) and lets the JSS paper close
16+
under a single, falsifiable scope.
17+
18+
## Layout
19+
20+
```
21+
Paper-AgentBench/
22+
├── README.md (this file)
23+
├── archive-from-jss/ (preserved verbatim)
24+
│ ├── jss-section7-agent-eval-original.tex (JSS §7, 130 lines, pre-trim)
25+
│ └── jss-context-around-causalagentbench.tex (§1.3 item 4 + §9 mentions)
26+
└── manuscript/ (companion-paper draft home)
27+
├── sections/ (empty; populated when drafting begins)
28+
└── notes/
29+
└── osf-preregistration.md (copy of Paper-JSS/notes/...)
30+
```
31+
32+
## Status (2026-05-05)
33+
34+
- **Archive**: complete. Every passage cut from the JSS draft during the
35+
P0-3 reviewer-response trim is preserved under `archive-from-jss/`.
36+
- **Manuscript**: not yet drafted. The companion paper is gated on
37+
(i) OSF pre-registration deposit, (ii) production-API budget approval,
38+
(iii) JSS submission of the parent paper.
39+
- **Pre-registration protocol**: see
40+
`manuscript/notes/osf-preregistration.md` (also kept in
41+
`Paper-JSS/notes/` for legacy paths).
42+
43+
## Working principles
44+
45+
1. **Never delete archive content.** If the companion paper's draft
46+
diverges from the archive, leave the archive intact and reword in
47+
`manuscript/sections/` instead.
48+
2. **JSS draft references the companion paper as a forward pointer.**
49+
The JSS §7 / §1.3 / §9 passages now redirect readers here rather
50+
than promising results that have not yet been collected.
51+
3. **OSF pre-registration is the canonical protocol.** Any future
52+
methodological change to CausalAgentBench is tracked through OSF
53+
amendments, with `manuscript/notes/osf-preregistration.md` as the
54+
working copy.
Lines changed: 71 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,71 @@
1+
%% =====================================================================
2+
%% Archive of JSS-manuscript passages mentioning CausalAgentBench / Track D
3+
%% Original location: Paper-JSS/manuscript/sections/{01,07,09}-*.tex
4+
%% Captured before P0-3 (CausalAgentBench split) trimmed the JSS draft.
5+
%%
6+
%% Purpose: preserve hard-won wording so the companion benchmark paper
7+
%% can recover any phrasing it needs without going through git history.
8+
%%
9+
%% This file is NOT compiled. It is a reference archive only.
10+
%% =====================================================================
11+
12+
13+
%% ---------------------------------------------------------------------
14+
%% From sections/01-introduction.tex -- §1.3 "Contribution and roadmap"
15+
%% Item 4 of the four-fold contributions list. Track D production RCT
16+
%% language in particular belongs to the companion paper.
17+
%% ---------------------------------------------------------------------
18+
19+
\item We propose and partially implement the \emph{Causal Inference
20+
Decathlon} benchmark
21+
(Section~\ref{sec:parity}), a representative-not-exhaustive panel of
22+
twelve estimators evaluated on four tracks: Track~A combines
23+
analytical recovery, numerical parity against canonical
24+
\proglang{R} references, and selected \proglang{Stata} bridge checks
25+
for migration-critical commands; Track~B measures Monte Carlo
26+
coverage against known data-generating processes; Track~C measures
27+
computational performance; and Track~D measures behavioural
28+
performance of LLM agents using the platform versus alternative tool
29+
stacks. We report results from the
30+
package's existing reference-parity, \proglang{R}-parity,
31+
selected \proglang{Stata}-bridge, external-parity, and Monte Carlo
32+
coverage suites; we also report measured Track~C performance results
33+
and a mock-LLM dry run of the Track~D harness while leaving the
34+
production agent RCT for pre-registration.
35+
36+
37+
%% ---------------------------------------------------------------------
38+
%% From sections/09-discussion.tex -- "What StatsPAI is and is not"
39+
%% paragraph. The CausalAgentBench falsifiability framing belongs to
40+
%% the companion paper.
41+
%% ---------------------------------------------------------------------
42+
43+
% Original sentence (ending of paragraph):
44+
% "...Nor do we claim that the agent-native API replaces the human
45+
% researcher; we claim only that it lowers the cost of obtaining a
46+
% correct first-pass empirical analysis from an LLM agent, which is
47+
% the falsifiable hypothesis \textsc{CausalAgentBench} is designed
48+
% to test."
49+
50+
51+
%% ---------------------------------------------------------------------
52+
%% From sections/09-discussion.tex -- "Roadmap (1.14--1.20)" T2 theme.
53+
%% ---------------------------------------------------------------------
54+
55+
% Original phrasing (within T2):
56+
% "...extending Tracks A and B to all twelve estimators at $B = 1{,}000$
57+
% in CI, and running \textsc{CausalAgentBench} once the OSF
58+
% pre-registration is deposited and API budget is approved."
59+
60+
61+
%% ---------------------------------------------------------------------
62+
%% From sections/09-discussion.tex -- concluding remarks paragraph.
63+
%% ---------------------------------------------------------------------
64+
65+
% Original sentence:
66+
% "Whether the platform actually delivers in agent-mediated empirical
67+
% work is the empirical question \textsc{CausalAgentBench} will
68+
% answer in a forthcoming companion benchmark paper
69+
% (Section~\ref{sec:agentbench-behavioural}); the Bayesian and
70+
% causal-discovery companion papers will close the corresponding
71+
% evaluation gaps for those sub-packages."
Lines changed: 130 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,130 @@
1+
\section{Behavioural evaluation of LLM agents}\label{sec:agentbench}
2+
3+
The agent-native design described in Section~\ref{sec:agent} is the
4+
most contestable claim in this paper: a reviewer entitled to
5+
scepticism may reasonably ask whether a machine-readable schema
6+
\emph{actually} improves the behaviour of an LLM agent on a real
7+
empirical task, or whether the \emph{statsmodels-plus-prompting}
8+
baseline performs equivalently well. We treat that question as
9+
falsifiable. This section reports the
10+
\emph{mechanical} and \emph{procedural} evidence we can ship with
11+
the present draft, sketches the \emph{behavioural} evidence we have
12+
prepared for production, and defers the production RCT itself to
13+
a forthcoming companion benchmark paper to preserve its
14+
pre-registration value. We refer to \citet{patil2023gorilla} and
15+
\citet{liang2023helm} as methodological precedents for tool-use and
16+
holistic-LLM benchmarks respectively.
17+
18+
\subsection{Mechanical evidence}\label{sec:agentbench-mechanical}
19+
20+
The function-schema layer is statically inspectable. We report in
21+
Table~\ref{tab:schema-coverage} that 980 of 980 \statspai{} public
22+
functions return a non-empty OpenAI-style JSON Schema; that 214
23+
functions have hand-written registry specifications; and that 78
24+
functions currently expose richer agent metadata cards with
25+
assumptions, pre-conditions, failure modes, and recovery hints. The
26+
closest comparable \proglang{Python} statistics packages
27+
(\pkg{statsmodels}, \pkg{linearmodels}, \pkg{scikit-learn}) expose
28+
zero package-wide typed tool schemas.
29+
30+
\begin{table}[t]
31+
\centering\small
32+
\caption{Mechanical schema coverage at \statspai{} 1.14.0 against
33+
representative comparators. Counts refer to public functions or
34+
estimator classes.}
35+
\label{tab:schema-coverage}
36+
\begin{tabular}{lcccc}
37+
\toprule
38+
& \statspai{} & \pkg{statsmodels} & \pkg{linearmodels} & \pkg{scikit-learn} \\
39+
\midrule
40+
Public surface (functions/classes) & 980 & $\sim$2{,}000 & $\sim$150 & $\sim$170 estimators \\
41+
Emits OpenAI-style tool schema & 980 & 0 & 0 & 0 \\
42+
Hand-written registry specification & 214 & 0 & 0 & 0 \\
43+
Curated agent metadata card & 78 & 0 & 0 & 0 \\
44+
Known limitations surfaced before execution & 10 & 0 & 0 & 0 \\
45+
\bottomrule
46+
\end{tabular}
47+
\end{table}
48+
49+
\subsection{Procedural evidence}\label{sec:agentbench-procedural}
50+
51+
The agent-native surface is exposed through an MCP server
52+
(Section~\ref{sec:mcp}) that is conformant to the open Model Context
53+
Protocol specification. Conformance is asserted by the local
54+
\code{tests/test\_mcp\_protocol.py}, \code{tests/test\_agent\_schema.py},
55+
and \code{tests/test\_registry.py} suites: they check that registered
56+
functions emit tool schemas, that agent-card filters behave as
57+
documented, and that MCP tool discovery sees the same registry-backed
58+
surface as \code{sp.function\_schema()}. This guarantees that the
59+
schema layer is, at a minimum, correctly wired up before any
60+
behavioural claim is made.
61+
62+
For this draft we also execute one deterministic end-to-end trace
63+
through the live MCP server. The script
64+
\code{replication/scripts/ex07\_agent\_trace.py} writes a temporary
65+
\texttt{mpdta.csv}, calls \code{tools/list}, fits
66+
\code{callaway\_santanna} with \code{as\_handle=true}, sends the
67+
returned handle to \code{audit\_result}, and then sends the same
68+
handle to \code{honest\_did\_from\_result}. Table~\ref{tab:agent-trace}
69+
summarises the trace; Appendix~\ref{app:transcript} prints the
70+
normalised transcript. This is not a substitute for the production
71+
RCT, but it verifies the agent-specific claim that the estimate
72+
\(\to\) audit \(\to\) sensitivity chain is executable without
73+
ferrying arrays or hand-copying citations through the model context.
74+
75+
\begin{table}[t]
76+
\centering\small
77+
\caption{Deterministic MCP trace generated by
78+
\code{replication/scripts/ex07\_agent\_trace.py}. Ephemeral result
79+
handles are normalised in the appendix transcript.}
80+
\label{tab:agent-trace}
81+
\input{../replication/tables/ex07_agent_trace.tex}
82+
\end{table}
83+
84+
\subsection{Behavioural evidence (deferred to companion paper)}\label{sec:agentbench-behavioural}
85+
86+
Mechanical and procedural evidence do not by themselves answer
87+
whether an LLM agent invoked through this interface produces
88+
\emph{better empirical analyses} than one invoked against
89+
\pkg{statsmodels} or against \proglang{R} packages exposed through a
90+
Jupyter MCP shim. The benchmark that closes this gap --
91+
\textsc{CausalAgentBench} -- is an RCT-style study that holds the
92+
agent's language model fixed and varies only the toolset.
93+
94+
The protocol consists of fifty causal-inference research prompts
95+
distributed across three difficulty levels (twenty L1 method-named
96+
prompts, twenty L2 indirect prompts, ten L3 workflow prompts), six
97+
experimental cells in a $3 \times 2$ factorial design (\statspai{}
98+
with MCP, a Pythonic statsmodels/linearmodels/DoubleML/grf-python
99+
stack, and \proglang{R}-via-MCP, each crossed with a frozen
100+
Anthropic Claude release and a frozen OpenAI GPT release), and a
101+
total of 900 trials at three repetitions per (prompt, cell). Eight
102+
metrics span task success, method correctness, code-execution
103+
success, token efficiency, hallucination rate, diagnostic
104+
completeness, reproducibility, and time-to-result. Five hypotheses
105+
will be pre-registered on OSF; failed trials are classified into
106+
five qualitative failure modes and reported with redacted
107+
transcripts. The cluster bootstrap with prompt as the clustering
108+
unit, $\alpha = 0.05$ two-sided with Bonferroni correction across
109+
the five hypotheses, is the planned statistical test.
110+
111+
The complete protocol, gold answers, grading rubric, and a
112+
deterministic mock-LLM harness are shipped at
113+
\code{tests/agent\_bench/} in the source tree; an OSF
114+
pre-registration draft frozen at
115+
\code{tests/agent\_bench/prompts/\_protocol.md} lists the file
116+
hashes to deposit before the first production trial. A 900-trial
117+
mock-LLM dry run completes in $<1$~s on the benchmarking machine,
118+
validates the harness end-to-end, and produces the same per-cell
119+
table format that the production run will report. Flipping
120+
\code{runner.py} from \code{--mock} to \code{--api both} is the
121+
single change that swaps the mock LLM for the real ones; the
122+
scoring, aggregation, and statistical-test pipeline downstream of
123+
that flip is identical. The production run is gated on two
124+
external steps that we deliberately do not auto-execute: the OSF
125+
pre-registration deposit and an API budget approval. The full RCT
126+
and its statistical analysis are the subject of a forthcoming
127+
companion benchmark paper; we defer reporting them here both to
128+
preserve the pre-registration's scientific value and to keep the
129+
present manuscript focused on the unification, parity, and
130+
schema-layer claims that constitute the core JSS contribution.

0 commit comments

Comments
 (0)