PrefixGuard: From Large Language Model Agent Traces to Online Failure-Warning Monitors

Xinmiao Huang, Jinwei Hu, Rajarshi Roy, Changshun Wu, Yi Dong, Xiaowei Huang

University of Liverpool · Université Grenoble Alpes

Paper Code BibTeX Diagnostics

0.900 WebArena AUPRC

0.710 τ²-Bench AUPRC

+0.137 Average gain vs. raw-text controls

PrefixGuard motivation overview — PrefixGuard targets online warning before terminal failure, while avoiding hand-authored schemas and deployment-time LLM judging for every prefix.

Abstract

Large language model agents now execute long, tool-using tasks where final outcome checks can arrive too late for intervention. PrefixGuard is a trace-to-monitor framework with an offline StepView induction step followed by supervised monitor training. StepView induces deterministic typed-step adapters from raw trace samples, and the monitor learns an event abstraction and prefix-risk scorer from terminal outcomes. Across WebArena, τ²-Bench, SkillsBench, and TerminalBench, the strongest PrefixGuard monitors reach 0.900/0.710/0.533/0.557 AUPRC. They improve over raw-text controls by an average of +0.137 AUPRC when comparing the strongest backend in each representation, while LLM judges remain substantially weaker under the same prefix-warning protocol.

PrefixGuard Pipeline

PrefixGuard converts heterogeneous raw traces into typed StepView records, encodes each step with a frozen TF-IDF encoder, learns a discrete event abstraction with Gumbel-softmax, and scores risk online with a GRU, Transformer, or finite-state backend. Hard learned symbols can also be compiled into PrefixGuard-DFA for finite-state audit.

StepView

LLM-assisted offline adapter induction maps raw browser, dialogue, coding, and CLI traces into fixed typed fields.

Learned Events

A differentiable event abstraction layer learns a failure-aligned symbol alphabet from the prefix warning objective.

Online Risk

The monitor produces a causal risk score after each new step, with no LLM inference required at deployment.

Main Results

Input view	Head / scorer	WebArena	τ²-Bench	SkillsBench	TerminalBench
Prompt	GPT-5.4-mini	0.407	0.302	0.101	0.127
Prompt	DeepSeek-V4-Pro	0.450	0.396	0.080	0.107
StepView activity	PPM LSTM	0.382 ± 0.004	0.231 ± 0.003	0.089 ± 0.001	0.093 ± 0.000
Raw text	GRU	0.871 ± 0.004	0.554 ± 0.006	0.315 ± 0.006	0.370 ± 0.001
StepView	PrefixGuard-DFA	0.792 ± 0.015	0.316 ± 0.055	0.190 ± 0.021	0.184 ± 0.029
StepView	PrefixGuard-Transformer	0.892 ± 0.006	0.710 ± 0.014	0.478 ± 0.028	0.555 ± 0.006
StepView	PrefixGuard-GRU	0.900 ± 0.015	0.696 ± 0.004	0.533 ± 0.020	0.557 ± 0.005
PG-GRU gain vs. Raw-text GRU		+0.029	+0.142	+0.218	+0.187

Values are AUPRC. Multi-seed rows report mean ± standard deviation over 3 seeds. Prompt baselines use zero-shot full-prefix prompts with N=200 samples.

Key Findings

Typed traces matter

Changing only the representation from raw text to StepView improves PrefixGuard-GRU by +0.029 to +0.218 AUPRC across the four benchmark families.

Neural monitors are strongest

GRU and Transformer backends give the best ranking scores. DFA extraction remains useful as an audit artifact, especially when the learned symbolic state space stays compact.

Ranking and alert utility differ

WebArena reaches high AUPRC but gives only 0.007 early failed-trajectory recall at 10% FAR. τ²-Bench and TerminalBench retain more actionable early alerts under the same diagnostic.

LLM judges are not enough

Zero-shot LLM prompt baselines are substantially weaker under the matched prefix-warning protocol, motivating lightweight learned monitors instead of repeated deployment-time LLM judging.

Deployment Diagnostics

A high AUPRC is useful only when the monitor can raise alerts early enough and under a realistic false-alarm budget. The diagnostics below read the same PrefixGuard results through this deployment lens.

False-alarm-rate constrained warning utility — First-alert diagnostics separate ranked AUPRC from deployable low-FAR warning utility.

Conditional AUPRC observability ceiling — The observability ceiling diagnoses when visible prefix evidence limits recoverable AUPRC.

Benchmark	FAR	Failed recall	Early recall	Lead fraction
WebArena	0.079	0.287	0.007	0.026
τ²-Bench	0.089	0.979	0.192	0.106
SkillsBench	0.105	0.954	0.039	0.017
TerminalBench	0.101	0.965	0.178	0.215

The operating-point table shows deployable alert behavior; the observability ceiling explains when the trace itself limits recoverable warning signal; and the DFA audit shows how a finite-state monitor can be inspected after training.

Observability Ceiling

A monitor can only warn from evidence that has already appeared in the prefix. The conditional AUPRC ceiling estimates this upper bound by grouping prefixes with comparable visible evidence and asking how separable future failures remain within those groups. When the ceiling is high, missed warnings usually point to modeling or representation limitations. When the ceiling is low, the trace itself is not yet exposing enough information for any causal monitor to separate success from failure reliably.

\[ P(x \mid p{=}1)=\pi P_{\mathrm{obs}}+(1-\pi)P_{\mathrm{neg}}, \qquad P_{\mathrm{neg}}:=P(x\mid p{=}0) \] \[ \mathrm{AUPRC}(f) \leq \mathcal{A}(\pi,r) := \pi + \frac{r(1-\pi)^2}{1-\pi r} + \frac{r\pi(1-\pi)(1-r)}{(1-\pi r)^2}\ln\frac{1}{\pi r} \]

Here \(r\) is the positive-prefix rate and \(\pi\) is the fraction of positive warning prefixes whose failure evidence is already observable in the current representation. The endpoints are \(\mathcal{A}(0,r)=r\) and \(\mathcal{A}(1,r)=1\).

This explains why ranked AUPRC and deployable alert utility can diverge. WebArena has strong ranking performance, but many failures become distinguishable only late, so a strict 10% false-alarm budget leaves little room for early intervention. In contrast, τ²-Bench and TerminalBench expose more actionable prefix cues, which is why the first-alert diagnostics retain higher failed-trajectory recall under the same low-FAR setting.

DFA Audit

PrefixGuard-DFA is not intended to be the strongest scorer. Its role is to make the learned event abstraction inspectable. Hard event symbols are replayed through a finite-state monitor, then each state can be audited by its empirical risk, representative prefixes, outgoing transitions, and whether it corresponds to a trusted phase or a warning phase.

The audit surfaces concrete behavioral phases rather than opaque per-prefix scores: browser reset and click-loop states in WebArena, lookup fan-out and policy handoff in τ²-Bench, dependency repair and output verification in SkillsBench, and tool-call failure or late-stage repair in TerminalBench.

WebArena DFA State Alignment

The WebArena DFA has 29 states; 27 trusted states were coded from exemplar StepView prefixes and 2 low-support states were excluded. Warning states are listed first, followed by representative normal states.

State	Behavioral phase	Risk	Eval	\(\bar{t}/T\)	Representative step
Warning states (risk ≥ 0.34)
q0	Early navigation reset	0.857	544	0.25	`click; goto homepage`
q28	Explicit error message	0.548	40	0.81	`type [out of stock...]`
q22	Repetitive click loop	0.518	595	0.40	`click×6 (no type)`
q12	Misaligned search query	0.510	643	0.25	`type [CMU / restaurants near CMU]`
q24	External-search redirect	0.434	276	0.40	`new_tab; goto google.com; type`
q1	Early scroll-and-click	0.342	379	0.25	`click; scroll [down]`
Representative normal states (risk < 0.25)
q17	Productive backtracking	0.085	56	0.83	`go_back×5`
q4	Credential entry	0.099	139	0.67	`type [username]; click`
q26	Task-specific search	0.038	90	0.50	`type [color utility]; click`
q8	Long-form text entry	0.122	80	0.74	`type [multi-sentence message]`
q7	Short-label selection	0.119	111	0.75	`type [feature]; click`

The alignment is a single-coder diagnostic for one WebArena DFA seed. It is evidence that high-risk states can be behaviorally named, not a multi-coder interpretability study.

BibTeX

@article{huang2026prefixguard,
  title={PrefixGuard: From LLM-Agent Traces to Online Failure-Warning Monitors},
  author={Huang, Xinmiao and Hu, Jinwei and Roy, Rajarshi and Wu, Changshun and Dong, Yi and Huang, Xiaowei},
  journal={arXiv preprint arXiv:2605.06455},
  year={2026},
  doi={10.48550/arXiv.2605.06455},
  url={https://arxiv.org/abs/2605.06455}
}