|
| 1 | +\section{Behavioural evaluation of LLM agents}\label{sec:agentbench} |
| 2 | + |
| 3 | +The agent-native design described in Section~\ref{sec:agent} is the |
| 4 | +most contestable claim in this paper: a reviewer entitled to |
| 5 | +scepticism may reasonably ask whether a machine-readable schema |
| 6 | +\emph{actually} improves the behaviour of an LLM agent on a real |
| 7 | +empirical task, or whether the \emph{statsmodels-plus-prompting} |
| 8 | +baseline performs equivalently well. We treat that question as |
| 9 | +falsifiable. This section reports the |
| 10 | +\emph{mechanical} and \emph{procedural} evidence we can ship with |
| 11 | +the present draft, sketches the \emph{behavioural} evidence we have |
| 12 | +prepared for production, and defers the production RCT itself to |
| 13 | +a forthcoming companion benchmark paper to preserve its |
| 14 | +pre-registration value. We refer to \citet{patil2023gorilla} and |
| 15 | +\citet{liang2023helm} as methodological precedents for tool-use and |
| 16 | +holistic-LLM benchmarks respectively. |
| 17 | + |
| 18 | +\subsection{Mechanical evidence}\label{sec:agentbench-mechanical} |
| 19 | + |
| 20 | +The function-schema layer is statically inspectable. We report in |
| 21 | +Table~\ref{tab:schema-coverage} that 980 of 980 \statspai{} public |
| 22 | +functions return a non-empty OpenAI-style JSON Schema; that 214 |
| 23 | +functions have hand-written registry specifications; and that 78 |
| 24 | +functions currently expose richer agent metadata cards with |
| 25 | +assumptions, pre-conditions, failure modes, and recovery hints. The |
| 26 | +closest comparable \proglang{Python} statistics packages |
| 27 | +(\pkg{statsmodels}, \pkg{linearmodels}, \pkg{scikit-learn}) expose |
| 28 | +zero package-wide typed tool schemas. |
| 29 | + |
| 30 | +\begin{table}[t] |
| 31 | +\centering\small |
| 32 | +\caption{Mechanical schema coverage at \statspai{} 1.14.0 against |
| 33 | +representative comparators. Counts refer to public functions or |
| 34 | +estimator classes.} |
| 35 | +\label{tab:schema-coverage} |
| 36 | +\begin{tabular}{lcccc} |
| 37 | +\toprule |
| 38 | +& \statspai{} & \pkg{statsmodels} & \pkg{linearmodels} & \pkg{scikit-learn} \\ |
| 39 | +\midrule |
| 40 | +Public surface (functions/classes) & 980 & $\sim$2{,}000 & $\sim$150 & $\sim$170 estimators \\ |
| 41 | +Emits OpenAI-style tool schema & 980 & 0 & 0 & 0 \\ |
| 42 | +Hand-written registry specification & 214 & 0 & 0 & 0 \\ |
| 43 | +Curated agent metadata card & 78 & 0 & 0 & 0 \\ |
| 44 | +Known limitations surfaced before execution & 10 & 0 & 0 & 0 \\ |
| 45 | +\bottomrule |
| 46 | +\end{tabular} |
| 47 | +\end{table} |
| 48 | + |
| 49 | +\subsection{Procedural evidence}\label{sec:agentbench-procedural} |
| 50 | + |
| 51 | +The agent-native surface is exposed through an MCP server |
| 52 | +(Section~\ref{sec:mcp}) that is conformant to the open Model Context |
| 53 | +Protocol specification. Conformance is asserted by the local |
| 54 | +\code{tests/test\_mcp\_protocol.py}, \code{tests/test\_agent\_schema.py}, |
| 55 | +and \code{tests/test\_registry.py} suites: they check that registered |
| 56 | +functions emit tool schemas, that agent-card filters behave as |
| 57 | +documented, and that MCP tool discovery sees the same registry-backed |
| 58 | +surface as \code{sp.function\_schema()}. This guarantees that the |
| 59 | +schema layer is, at a minimum, correctly wired up before any |
| 60 | +behavioural claim is made. |
| 61 | + |
| 62 | +For this draft we also execute one deterministic end-to-end trace |
| 63 | +through the live MCP server. The script |
| 64 | +\code{replication/scripts/ex07\_agent\_trace.py} writes a temporary |
| 65 | +\texttt{mpdta.csv}, calls \code{tools/list}, fits |
| 66 | +\code{callaway\_santanna} with \code{as\_handle=true}, sends the |
| 67 | +returned handle to \code{audit\_result}, and then sends the same |
| 68 | +handle to \code{honest\_did\_from\_result}. Table~\ref{tab:agent-trace} |
| 69 | +summarises the trace; Appendix~\ref{app:transcript} prints the |
| 70 | +normalised transcript. This is not a substitute for the production |
| 71 | +RCT, but it verifies the agent-specific claim that the estimate |
| 72 | +\(\to\) audit \(\to\) sensitivity chain is executable without |
| 73 | +ferrying arrays or hand-copying citations through the model context. |
| 74 | + |
| 75 | +\begin{table}[t] |
| 76 | +\centering\small |
| 77 | +\caption{Deterministic MCP trace generated by |
| 78 | +\code{replication/scripts/ex07\_agent\_trace.py}. Ephemeral result |
| 79 | +handles are normalised in the appendix transcript.} |
| 80 | +\label{tab:agent-trace} |
| 81 | +\input{../replication/tables/ex07_agent_trace.tex} |
| 82 | +\end{table} |
| 83 | + |
| 84 | +\subsection{Behavioural evidence (deferred to companion paper)}\label{sec:agentbench-behavioural} |
| 85 | + |
| 86 | +Mechanical and procedural evidence do not by themselves answer |
| 87 | +whether an LLM agent invoked through this interface produces |
| 88 | +\emph{better empirical analyses} than one invoked against |
| 89 | +\pkg{statsmodels} or against \proglang{R} packages exposed through a |
| 90 | +Jupyter MCP shim. The benchmark that closes this gap -- |
| 91 | +\textsc{CausalAgentBench} -- is an RCT-style study that holds the |
| 92 | +agent's language model fixed and varies only the toolset. |
| 93 | + |
| 94 | +The protocol consists of fifty causal-inference research prompts |
| 95 | +distributed across three difficulty levels (twenty L1 method-named |
| 96 | +prompts, twenty L2 indirect prompts, ten L3 workflow prompts), six |
| 97 | +experimental cells in a $3 \times 2$ factorial design (\statspai{} |
| 98 | +with MCP, a Pythonic statsmodels/linearmodels/DoubleML/grf-python |
| 99 | +stack, and \proglang{R}-via-MCP, each crossed with a frozen |
| 100 | +Anthropic Claude release and a frozen OpenAI GPT release), and a |
| 101 | +total of 900 trials at three repetitions per (prompt, cell). Eight |
| 102 | +metrics span task success, method correctness, code-execution |
| 103 | +success, token efficiency, hallucination rate, diagnostic |
| 104 | +completeness, reproducibility, and time-to-result. Five hypotheses |
| 105 | +will be pre-registered on OSF; failed trials are classified into |
| 106 | +five qualitative failure modes and reported with redacted |
| 107 | +transcripts. The cluster bootstrap with prompt as the clustering |
| 108 | +unit, $\alpha = 0.05$ two-sided with Bonferroni correction across |
| 109 | +the five hypotheses, is the planned statistical test. |
| 110 | + |
| 111 | +The complete protocol, gold answers, grading rubric, and a |
| 112 | +deterministic mock-LLM harness are shipped at |
| 113 | +\code{tests/agent\_bench/} in the source tree; an OSF |
| 114 | +pre-registration draft frozen at |
| 115 | +\code{tests/agent\_bench/prompts/\_protocol.md} lists the file |
| 116 | +hashes to deposit before the first production trial. A 900-trial |
| 117 | +mock-LLM dry run completes in $<1$~s on the benchmarking machine, |
| 118 | +validates the harness end-to-end, and produces the same per-cell |
| 119 | +table format that the production run will report. Flipping |
| 120 | +\code{runner.py} from \code{--mock} to \code{--api both} is the |
| 121 | +single change that swaps the mock LLM for the real ones; the |
| 122 | +scoring, aggregation, and statistical-test pipeline downstream of |
| 123 | +that flip is identical. The production run is gated on two |
| 124 | +external steps that we deliberately do not auto-execute: the OSF |
| 125 | +pre-registration deposit and an API budget approval. The full RCT |
| 126 | +and its statistical analysis are the subject of a forthcoming |
| 127 | +companion benchmark paper; we defer reporting them here both to |
| 128 | +preserve the pre-registration's scientific value and to keep the |
| 129 | +present manuscript focused on the unification, parity, and |
| 130 | +schema-layer claims that constitute the core JSS contribution. |
0 commit comments