event-extraction-llm-baseline/report.tex at master · rosscyking1115/event-extraction-llm-baseline · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
\documentclass[11pt,a4paper]{article}

\usepackage[margin=2.5cm]{geometry}
\usepackage{booktabs}
\usepackage{amsmath}
\usepackage{graphicx}
\usepackage{hyperref}
\usepackage{xcolor}
\usepackage{microtype}
\usepackage{parskip}
\usepackage{titlesec}
\usepackage{caption}
\usepackage{listings}
\usepackage{multirow}
\usepackage{array}

\hypersetup{
    colorlinks=true,
    linkcolor=blue!60!black,
    citecolor=blue!60!black,
    urlcolor=blue!60!black
}

\titleformat{\section}{\large\bfseries}{\thesection}{1em}{}
\titleformat{\subsection}{\normalsize\bfseries}{\thesubsection}{1em}{}

\title{\textbf{LLM-Based Event Template Filling:\\
A Baseline Study on MUC-4 and MUC-6}}

\author{
    Ross Feng\\
    \small University of Sheffield\\
    \small \texttt{acp25ck@sheffield.ac.uk}
}

\date{\today}

\begin{document}

\maketitle

\begin{abstract}
We evaluate whether instruction-tuned large language models (LLMs) can fill structured event templates from news text without any task-specific fine-tuning. Using \textbf{Qwen2.5-7B-Instruct} and \textbf{Llama-3.1-8B-Instruct}, we run zero-shot and few-shot inference on two classic MUC shared task datasets: MUC-4 (terrorism template filling, 23 slots) and MUC-6 (corporate succession template filling, 9 slots). We evaluate using a four-layer scoring framework — JSON validity, schema validity, exact match, and fuzzy Levenshtein match — reporting micro and macro F1 against empty and majority class baselines. On MUC-4, the best model condition (Llama few-shot) achieves Micro F1 of 0.164 (strict) and 0.186 (fuzzy), below the majority baseline of 0.237, which benefits from categorical slots that are trivially predictable. On MUC-6, Llama zero-shot achieves Micro F1 of 0.238, outperforming MUC-4 results substantially. Few-shot prompting reduces schema errors dramatically (from 16\% to 1\%) but provides only modest gains in slot-filling accuracy. The dominant failure mode across all conditions is hallucinating incorrect entity mentions in open-text slots. All code is available at \url{https://github.com/rosscyking1115/event-extraction-llm-baseline}.
\end{abstract}

% ─────────────────────────────────────────────────────────────────────────────
\section{Introduction}

Information extraction (IE) is one of the foundational tasks in natural language processing. At its most structured form, IE requires filling predefined templates from free text — identifying not just \textit{that} an event occurred, but \textit{who} did it, \textit{where}, \textit{when}, and with what consequences. The MUC (Message Understanding Conference) shared tasks, run between 1987 and 1997, defined this template-filling paradigm and produced the benchmark datasets that shaped IE research for decades~\cite{sundheim1992muc,grishman1996muc6}.

Modern large language models (LLMs), trained on vast corpora and fine-tuned for instruction following, represent a qualitatively different approach to this problem. Rather than learning extraction patterns from annotated examples, these models apply broad world knowledge and language understanding to fill structured templates from scratch. However, MUC-style template filling poses challenges that differ from the kind of generation tasks these models excel at: the output must exactly match a predefined schema, values must be grounded in the document rather than inferred, and multi-slot templates require coherent extraction across many fields simultaneously.

This report investigates three research questions:
\begin{enumerate}
    \item Can instruction-tuned LLMs fill MUC-style event templates without any fine-tuning, producing valid and schema-compliant JSON output?
    \item How does prompting strategy — zero-shot versus few-shot — affect slot-filling accuracy and schema adherence?
    \item How do Qwen2.5-7B-Instruct and Llama-3.1-8B-Instruct compare across two structurally different MUC tasks?
\end{enumerate}

We situate this work within a broader preliminary study that also explored trigger-detection baselines on MAVEN and WikiEvents; those results serve as background context (Section~\ref{sec:preliminary}) and motivated the shift to slot-filling evaluation on the more structured MUC benchmarks.

% ─────────────────────────────────────────────────────────────────────────────
\section{Background and Related Work}
\label{sec:background}

\subsection{MUC Shared Tasks}

The Message Understanding Conferences (MUC) were organised by DARPA to benchmark information extraction systems on real news text. MUC-3 and MUC-4~\cite{sundheim1992muc} focused on terrorism events from Latin American newswires, requiring systems to fill a 23-slot template for each detected event. MUC-6~\cite{grishman1996muc6} shifted to Wall Street Journal articles and a corporate succession template covering who moved into or out of which executive role at which organisation. These datasets established the core evaluation methodology of the field: template-level precision, recall, and F1 computed over individual slot values.

Classic MUC-4 systems achieved Micro F1 in the range of 0.40--0.55 using rule-based and early neural approaches~\cite{sundheim1992muc}. These systems were engineered specifically for the domain and had access to the training corpus. Our zero-shot LLM baselines should therefore be interpreted as lower bounds relative to supervised systems, not competitive benchmarks.

\subsection{LLMs for Information Extraction}

Recent work has demonstrated that instruction-tuned LLMs can perform zero-shot information extraction across a range of tasks. \citet{wei2023zero} showed that ChatGPT can extract entities and relations from text with no labelled examples, approaching supervised baselines on some benchmarks. \citet{ma2023large} found that LLMs perform better as rerankers than direct extractors on hard samples. For event extraction specifically, \citet{gao2023exploring} found that GPT-4 achieves near-supervised performance on ACE 2005 with chain-of-thought prompting.

MUC-style slot filling presents additional challenges compared to trigger detection or named entity recognition: the model must simultaneously handle schema compliance, multi-slot coherence, null-value decisions (when a slot is genuinely inapplicable), and multi-template scenarios where a single document describes multiple events.

% ─────────────────────────────────────────────────────────────────────────────
\section{Preliminary Experiments}
\label{sec:preliminary}

Before adopting the MUC evaluation framework, we conducted preliminary experiments on event trigger detection using MAVEN~\cite{wang2020maven} and WikiEvents~\cite{li2021document}. These are summarised briefly here to provide context for the evaluation design choices made in the main experiments.

\subsection{MAVEN}

MAVEN is a large-scale event detection dataset covering 168 event types from Wikipedia. On 50 sampled instances, Qwen2.5-7B-Instruct achieved trigger exact match of 0.160 (unconstrained) and 0.220 (constrained-label prompting), with type accuracy of 0.000 and 0.280 respectively. The 0.000 unconstrained type accuracy reflects a fundamental ontology mismatch: the model generates semantically plausible type strings that do not match MAVEN's exact label vocabulary. Constraining the label space to the 168 valid types resolves this.

\subsection{WikiEvents}

WikiEvents covers document-level event argument extraction across a three-level type hierarchy. On the development split (345 event mentions), trigger exact match reached 0.272--0.330 (Qwen vs Llama) with constrained prompting. Few-shot prompting improved type label adherence from 80.6\% to 97.4\%, confirming that worked examples help maintain ontology alignment. The dominant failure mode was multi-event sentence ambiguity: when a single sentence describes both an attack and a resulting death, the model frequently attaches to the more salient event rather than the gold-annotated one.

\subsection{Motivation for MUC Evaluation}

These preliminary results led us to adopt the MUC framework for the main study. MUC-style slot filling offers two advantages over trigger detection: (1) the evaluation is over \textit{structured templates} rather than individual trigger spans, making it more representative of real IE use cases; (2) the professor-recommended four-layer evaluation framework provides richer diagnostic signal — JSON validity, schema validity, exact match, and fuzzy match — than binary exact match alone.

% ─────────────────────────────────────────────────────────────────────────────
\section{Datasets}
\label{sec:datasets}

\subsection{MUC-4 (Terrorism Template Filling)}

MUC-4~\cite{sundheim1992muc} contains terrorism news articles from Latin American newswires, with gold templates annotating each terrorism event described in the article. Each template covers 23 slots (Table~\ref{tab:muc4_slots}).

\begin{table}[h]
\centering
\caption{MUC-4 slot schema (23 slots). Categorical slots have a fixed value set; string slots are open-text entity mentions.}
\label{tab:muc4_slots}
\small
\begin{tabular}{ll}
\toprule
\textbf{Slot} & \textbf{Type} \\
\midrule
INCIDENT\_TYPE & Categorical (ATTACK, BOMBING, KIDNAPPING, ARSON, ASSASSINATION, ROBBERY, FORCED WORK STOPPAGE) \\
INCIDENT\_STAGE & Categorical (ATTEMPTED, ACCOMPLISHED) \\
INCIDENT\_DATE & String (date expression) \\
INCIDENT\_LOCATION & String (place name) \\
INCIDENT\_INSTRUMENT\_TYPE & String (weapon/device type) \\
INCIDENT\_INSTRUMENT\_ID & String (weapon/device name) \\
PERP\_INCIDENT\_CATEGORY & Categorical (TERRORIST ACT, STATE-SPONSORED VIOLENCE) \\
PERP\_INDIVIDUAL\_ID & String (individual perpetrator name) \\
PERP\_ORGANIZATION\_ID & String (organisation name) \\
PERP\_ORGANIZATION\_CONFIDENCE & Categorical \\
PHYS\_TGT\_TYPE & String (type of physical target) \\
PHYS\_TGT\_ID & String (name of physical target) \\
PHYS\_TGT\_NUM & String (number of targets) \\
PHYS\_TGT\_EFFECT & Categorical (DESTROYED, SOME DAMAGE, NO DAMAGE, UNKNOWN) \\
PHYS\_TGT\_FOREIGNNATION & Categorical \\
HUM\_TGT\_NAME & String (victim name) \\
HUM\_TGT\_DESCRIPTION & String (victim description) \\
HUM\_TGT\_NUM & String (number of victims) \\
HUM\_TGT\_TYPE & Categorical (CIVILIAN, GOVERNMENT OFFICIAL, MILITARY, etc.) \\
HUM\_TGT\_NATIONALITY & Categorical \\
HUM\_TGT\_EFFECT & Categorical (DEATH, INJURY, NO INJURY) \\
HUM\_TGT\_FOREIGN\_NATION & Categorical \\
COMMENT & String (free text note) \\
\bottomrule
\end{tabular}
\end{table}

We use two official test splits: \textbf{TST3} (100 documents, 69 with events, 123 gold templates) and \textbf{TST4} (100 documents, 57 with events, 86 gold templates). For few-shot experiments, two example documents with events are drawn from TST4 when evaluating TST3, to avoid data leakage.

\subsection{MUC-6 (Corporate Succession Template Filling)}

MUC-6~\cite{grishman1996muc6} uses Wall Street Journal articles and focuses on corporate succession events — instances of named individuals entering or leaving named executive positions at named organisations. The template schema has 9 flat slots (Table~\ref{tab:muc6_slots}).

\begin{table}[h]
\centering
\caption{MUC-6 slot schema (9 slots).}
\label{tab:muc6_slots}
\small
\begin{tabular}{ll}
\toprule
\textbf{Slot} & \textbf{Description} \\
\midrule
succession\_org & Organisation where succession occurred \\
post & Executive position/title \\
person\_in & Person taking the position \\
person\_out & Person leaving the position \\
vacancy\_reason & Reason for vacancy (REASSIGNMENT, NEW\_POST\_CREATED, DEPART\_WORKFORCE, OTH\_UNK) \\
on\_the\_job\_in & Whether the incoming person is already in post \\
on\_the\_job\_out & Whether the outgoing person is still in post \\
other\_org\_in & Organisation the incoming person came from \\
rel\_other\_org\_in & Relationship type to other\_org\_in \\
\bottomrule
\end{tabular}
\end{table}

The test set contains 100 documents, 53 with at least one succession event, and 240 total succession events. Vacancy reasons are distributed as: REASSIGNMENT (123), OTH\_UNK (73), NEW\_POST\_CREATED (29), DEPART\_WORKFORCE (15).

% ─────────────────────────────────────────────────────────────────────────────
\section{Methodology}
\label{sec:methodology}

\subsection{Models}

We evaluate two open-weight instruction-tuned models:

\textbf{Qwen2.5-7B-Instruct}~\cite{qwen2025}: 7 billion parameters, trained by Alibaba Cloud with strong multilingual instruction following.

\textbf{Llama-3.1-8B-Instruct}~\cite{llama3}: 8 billion parameters, trained by Meta AI with a different pretraining corpus and instruction tuning pipeline.

Both models run on a single NVIDIA A100 GPU (Sheffield Stanage HPC) in \texttt{float16} precision with greedy decoding. No fine-tuning is performed.

\subsection{Prompt Design}

For each document, we construct a prompt that includes: (1) a task description explaining the slot-filling objective; (2) the complete slot schema with descriptions and valid values for categorical slots; (3) the document text; and (4) an instruction to output a single JSON object (MUC-4) or a JSON array of objects (MUC-6, for multi-event documents).

For \textbf{zero-shot prompting}, the prompt contains no extraction examples. For \textbf{few-shot prompting}, we prepend two complete worked examples — document text, prompt, and gold template output — before the target document. Examples are drawn from a held-out split to prevent data leakage.

A key design decision is to include all slot names and valid categorical values in the prompt. This provides the model with the full schema in context, allowing it to produce correctly-keyed JSON without prior knowledge of the MUC ontology.

\subsection{Four-Layer Evaluation Framework}

We evaluate each prediction through four sequential layers, following the framework specified for this project:

\begin{enumerate}
    \item \textbf{JSON validity.} Does the model output parse as valid JSON? Invalid JSON is counted as a complete failure for all slots.
    \item \textbf{Schema validity.} Does the parsed JSON contain all required slot keys? Missing keys are counted as \texttt{schema\_error} for those slots.
    \item \textbf{Exact match.} After normalisation, do the gold and predicted values match exactly?
    \item \textbf{Fuzzy match.} Does the normalised Levenshtein similarity meet the threshold $\geq 0.8$?
\end{enumerate}

\textbf{Normalisation} applies the following transformations before comparison: lowercase, strip surrounding whitespace and quotes, collapse multiple spaces, and standardise date separator characters.

\textbf{Levenshtein similarity} between strings $s_1$ and $s_2$ is computed as:
\[
\text{sim}(s_1, s_2) = 1 - \frac{\text{lev}(s_1, s_2)}{\max(|s_1|, |s_2|)}
\]
where $\text{lev}$ is the character-level edit distance. A fuzzy match is declared at similarity $\geq 0.8$, capturing near-misses such as name abbreviations or minor spelling variants.

\subsection{TP/FP/FN Rules}

We define true positives, false positives, and false negatives using six cases (Table~\ref{tab:scoring_rules}). The null-null case (both gold and predicted are null/absent) is excluded from all counts, since predicting that a slot has no value when it genuinely has none provides no extraction signal.

\begin{table}[h]
\centering
\caption{TP/FP/FN scoring rules for strict (exact match) and fuzzy scoring.}
\label{tab:scoring_rules}
\begin{tabular}{clllll}
\toprule
\textbf{Case} & \textbf{Gold} & \textbf{Predicted} & \textbf{Strict} & \textbf{Fuzzy} \\
\midrule
1 & null  & null  & not counted & not counted \\
2 & value & value (exact match)  & TP & TP \\
3 & value & value (fuzzy match only) & FP + FN & TP \\
4 & value & value (no match) & FP + FN & FP + FN \\
5 & value & null  & FN & FN \\
6 & null  & value & FP & FP \\
\bottomrule
\end{tabular}
\end{table}

\subsection{Metrics}

\textbf{Micro F1} aggregates all TP, FP, and FN counts across all slots and documents before computing precision, recall, and F1. This weights each slot prediction equally and tends to be driven by frequent slots.

\textbf{Macro F1} computes F1 per slot type separately, then averages across all slot types. This treats rare and frequent slots equally and is more informative when the slot distribution is skewed.

Both metrics are computed for strict (exact match) and fuzzy (Levenshtein $\geq 0.8$) matching.

\subsection{Baselines}

We compute two baselines that require no model inference:

\textbf{Empty baseline:} Predicts \texttt{null} for every slot. This achieves Micro F1 = 0.000 by definition, serving as the absolute lower bound.

\textbf{Majority baseline:} For categorical slots, predicts the most frequent gold value seen in the test set (computed in a single pass, so it is technically a gold-informed baseline). For string slots, predicts \texttt{null}. This isolates how much of the F1 score is achievable purely from class frequency on categorical slots.

\subsection{Error Taxonomy}

Every slot prediction is assigned one of 13 error categories: \texttt{correct}, \texttt{missing\_slot}, \texttt{hallucinated\_slot}, \texttt{wrong\_argument}, \texttt{wrong\_event\_type}, \texttt{partial\_entity}, \texttt{over\_specific}, \texttt{under\_specific}, \texttt{date\_format}, \texttt{invalid\_json}, \texttt{schema\_error}, \texttt{multiple\_values\_error}, \texttt{event\_boundary\_error}.

% ─────────────────────────────────────────────────────────────────────────────
\section{Results}
\label{sec:results}

\subsection{MUC-4 TST3 — Main Results}

Table~\ref{tab:muc4_results} presents results across all model conditions on MUC-4 TST3.

\begin{table}[h]
\centering
\caption{MUC-4 TST3 results ($n=100$ documents, 123 gold templates). Strict = exact match after normalisation. Fuzzy = Levenshtein similarity $\geq 0.8$.}
\label{tab:muc4_results}
\begin{tabular}{llccc}
\toprule
\textbf{Model} & \textbf{Prompt} & \textbf{Micro F1 (strict)} & \textbf{Micro F1 (fuzzy)} & \textbf{Macro F1 (strict)} \\
\midrule
Empty baseline    & —         & 0.000 & 0.000 & 0.000 \\
Majority baseline & —         & 0.237 & 0.237 & 0.073 \\
\midrule
Qwen2.5-7B        & Zero-shot & 0.130 & 0.155 & 0.101 \\
Llama-3.1-8B      & Zero-shot & 0.140 & 0.158 & 0.098 \\
Qwen2.5-7B        & Few-shot  & 0.126 & 0.159 & 0.093 \\
Llama-3.1-8B      & Few-shot  & \textbf{0.164} & \textbf{0.186} & \textbf{0.113} \\
\bottomrule
\end{tabular}
\end{table}

All model conditions fall below the majority baseline on Micro F1. This is expected: the majority baseline trivially achieves high recall on the three categorical slots that dominate slot counts (INCIDENT\_STAGE, INCIDENT\_TYPE, PERP\_INCIDENT\_CATEGORY), while predicting null for all 20 open-text string slots. LLM conditions score more broadly across slot types, as reflected in substantially higher Macro F1 compared to the majority baseline (e.g.\ Llama few-shot: 0.113 vs 0.073).

The fuzzy-to-strict gap is consistent across model conditions (roughly 0.022--0.033 points), indicating that a proportion of model predictions are correct entity mentions with minor surface variation — name abbreviations, partial titles, or alternative transliterations — that fuzzy matching recovers.

\subsection{MUC-6 — Main Results}

Table~\ref{tab:muc6_results} presents results on MUC-6. Only zero-shot conditions were run on MUC-6.

\begin{table}[h]
\centering
\caption{MUC-6 test set results ($n=100$ documents, 240 gold succession events).}
\label{tab:muc6_results}
\begin{tabular}{llcc}
\toprule
\textbf{Model} & \textbf{Prompt} & \textbf{Micro F1 (strict)} & \textbf{Macro F1 (strict)} \\
\midrule
Empty baseline    & —         & 0.000 & 0.000 \\
Qwen2.5-7B        & Zero-shot & 0.197 & 0.189 \\
Llama-3.1-8B      & Zero-shot & \textbf{0.238} & \textbf{0.222} \\
\bottomrule
\end{tabular}
\end{table}

MUC-6 results are substantially higher than MUC-4. This reflects the structural simplicity of the succession task: with 9 flat slots (compared to 23) and a clearer extraction target (a named person moving into or out of a named role), the model has fewer opportunities to hallucinate or conflate. Llama again outperforms Qwen, consistent with the MUC-4 pattern.

\subsection{Effect of Prompting Strategy}

Few-shot prompting produces mixed results. On MUC-4 TST3, Llama few-shot (+0.024 Micro F1 strict over Llama zero-shot) and Qwen few-shot (−0.004 Micro F1 strict vs Qwen zero-shot) show that in-context examples help Llama but have negligible impact for Qwen. The most consistent effect of few-shot prompting is on schema adherence: the schema error rate drops from approximately 16\% (Qwen zero-shot) to approximately 1\% (Qwen few-shot), and from 8\% (Llama zero-shot) to under 1\% (Llama few-shot). In-context examples teach the model the exact slot key names far more reliably than the schema description alone.

\subsection{Model Comparison: Qwen vs.\ Llama}

Llama-3.1-8B outperforms Qwen2.5-7B in three of four MUC-4 conditions, and in MUC-6. The largest gap is in the few-shot condition (Llama 0.164 vs Qwen 0.126, Micro F1 strict), suggesting that Llama is a better in-context learner for this structured extraction task. On zero-shot MUC-4, the gap is small (0.140 vs 0.130). On MUC-6 zero-shot, the gap is largest (0.238 vs 0.197). Across all conditions, Llama produces fewer schema errors and higher fuzzy match rates, suggesting stronger instruction-following.

% ─────────────────────────────────────────────────────────────────────────────
\section{Analysis}
\label{sec:analysis}

\subsection{Majority Baseline Decomposition}

The majority baseline's strength (Micro F1 = 0.237 on MUC-4 TST3) is concentrated entirely in three categorical slots. Table~\ref{tab:majority_breakdown} shows per-slot F1 for the majority baseline.

\begin{table}[h]
\centering
\caption{MUC-4 TST3 — majority baseline per-slot F1 for slots with non-zero score.}
\label{tab:majority_breakdown}
\begin{tabular}{lcc}
\toprule
\textbf{Slot} & \textbf{F1 (majority)} & \textbf{Predicted value} \\
\midrule
INCIDENT\_STAGE              & 0.769 & ACCOMPLISHED \\
INCIDENT\_TYPE               & 0.544 & ATTACK \\
PERP\_INCIDENT\_CATEGORY     & 0.372 & TERRORIST ACT \\
All other slots (20 slots)   & 0.000 & null \\
\bottomrule
\end{tabular}
\end{table}

LLMs must score above zero on the 20 string slots to surpass the majority baseline on Micro F1. The higher Macro F1 of LLM conditions (0.093--0.113 vs 0.073) confirms that they do so, extracting meaningful signal from slots like INCIDENT\_LOCATION, PERP\_INDIVIDUAL\_ID, and HUM\_TGT\_NAME.

\subsection{Slot Difficulty}

Categorical slots (INCIDENT\_TYPE, INCIDENT\_STAGE, HUM\_TGT\_EFFECT) are the easiest: the model simply needs to choose from a small fixed set, and including the valid values in the prompt makes this reliable. String slots involving named entities (PERP\_INDIVIDUAL\_ID, HUM\_TGT\_NAME, PHYS\_TGT\_ID) are harder because the model must locate the correct mention in the document text. DATE and LOCATION slots present intermediate difficulty, as the model generally identifies the right type of value but may extract a slightly different surface form or normalise differently from the gold.

\subsection{Why MUC-4 is Harder than MUC-6}

MUC-6 produces higher F1 for three reasons. First, the succession task has a clearer extraction pattern: there is typically one person-role-organisation triple per event, and the surface form is consistent in financial news. Second, MUC-6 has 9 slots compared to MUC-4's 23, reducing opportunities for errors to compound. Third, MUC-4 documents often describe multiple distinct terrorism events that require separate templates; the model must recognise event boundaries and produce separate JSON objects for each, which is a significantly harder task than extracting a single template.

\subsection{Effect of Few-Shot Examples on Schema Compliance}

The most consistent and practically significant finding is the effect of few-shot prompting on schema compliance. Without examples, models frequently make slot key naming errors — using \texttt{PERP\_ORG} instead of \texttt{PERP\_ORGANIZATION\_ID}, or omitting less salient slots entirely. With two worked examples in the prompt, the model observes the exact expected output format and replicates it reliably. This suggests that for structured extraction tasks, the primary value of few-shot examples is as a \textit{format demonstration} rather than as semantic extraction guidance.

\subsection{Failure Mode Analysis}

The dominant failure mode across all conditions is \textbf{wrong entity value} in open-text slots — the model extracts a plausible entity from the document but not the gold-annotated one. In MUC-4, this often manifests as extracting the wrong perpetrator organisation (e.g.\ naming the ideological movement instead of the specific cell), or extracting an approximate victim count that differs from the gold. In MUC-6, the most common error is extracting the wrong person when multiple personnel changes are described in the same article.

\textbf{Hallucination of values} (predicting a non-null value when gold is null) is the second most frequent failure, accounting for the FP component of the fuzzy-to-strict gap. The model occasionally infers entities that are consistent with the event type but not explicitly mentioned in the document.

\textbf{Event boundary errors} affect MUC-4 more than MUC-6: when a document describes both a bombing and a kidnapping, the model sometimes merges both events into a single template rather than producing two separate predictions.

% ─────────────────────────────────────────────────────────────────────────────
\section{Conclusion}
\label{sec:conclusion}

We presented an LLM-based template-filling baseline on MUC-4 and MUC-6 using Qwen2.5-7B-Instruct and Llama-3.1-8B-Instruct with zero-shot and few-shot prompting. Key findings are as follows.

All model conditions score below the majority baseline on MUC-4 Micro F1 because the majority baseline benefits from high recall on three dominant categorical slots; however, LLM conditions achieve substantially higher Macro F1 (up to 0.113 vs 0.073), reflecting their ability to extract open-text slot values that majority prediction cannot handle.

Few-shot prompting provides its most consistent benefit through schema compliance rather than slot-filling accuracy. Schema error rates drop from 16\% to 1\% with two in-context examples, suggesting that format demonstration is the primary function of few-shot examples in structured extraction tasks.

Llama-3.1-8B outperforms Qwen2.5-7B across most conditions, with the largest gap on MUC-6 zero-shot (Micro F1 0.238 vs 0.197). MUC-6 results are substantially higher than MUC-4, attributable to the simpler schema, more consistent surface forms in financial news, and the absence of multi-event boundary decisions.

The primary failure mode across all conditions is incorrect entity extraction in open-text slots, which cannot be addressed through prompt engineering alone. Fine-tuning on in-domain data, retrieval-augmented generation, or chain-of-thought decomposition are promising directions for future work. Extension to MUC-7 (rocket launch events) and full multi-template alignment would further characterise LLM capabilities on the complete MUC benchmark suite.

% ─────────────────────────────────────────────────────────────────────────────
\bibliographystyle{plain}
\begin{thebibliography}{99}

\bibitem{sundheim1992muc}
B.\ Sundheim.
\newblock Overview of the Fourth Message Understanding Evaluation and Conference.
\newblock In \textit{Proceedings of the Fourth Message Understanding Conference (MUC-4)}, 1992.

\bibitem{grishman1996muc6}
R.\ Grishman and B.\ Sundheim.
\newblock Message Understanding Conference — 6: A Brief History.
\newblock In \textit{Proceedings of COLING}, 1996.

\bibitem{gao2023exploring}
T.\ Gao et al.
\newblock Exploring the feasibility of ChatGPT for event extraction.
\newblock \textit{arXiv preprint arXiv:2303.03836}, 2023.

\bibitem{li2021document}
S.\ Li, H.\ Ji, and J.\ Han.
\newblock Document-level event argument extraction by conditional generation.
\newblock In \textit{Proceedings of NAACL}, 2021.

\bibitem{llama3}
A.\ Dubey et al.
\newblock The Llama 3 herd of models.
\newblock \textit{arXiv preprint arXiv:2407.21783}, 2024.

\bibitem{ma2023large}
Y.\ Ma et al.
\newblock Large language model is not a good few-shot information extractor, but a good reranker for hard samples!
\newblock In \textit{Findings of EMNLP}, 2023.

\bibitem{qwen2025}
Qwen Team.
\newblock Qwen2.5: A party of foundation models.
\newblock \textit{arXiv preprint arXiv:2412.15115}, 2025.

\bibitem{wang2020maven}
X.\ Wang et al.
\newblock MAVEN: A massive general domain event detection dataset.
\newblock In \textit{Proceedings of EMNLP}, 2020.

\bibitem{wei2023zero}
X.\ Wei et al.
\newblock Zero-shot information extraction via chatting with ChatGPT.
\newblock \textit{arXiv preprint arXiv:2302.10205}, 2023.

\end{thebibliography}

\end{document}