arXiv 2026 · Preprint

Retrospective Harness Optimization

Improving LLM Agents via Self-Preference over Trajectory Rollouts

Wenbo Pan1, Shujie Liu2, Chin-Yew Lin2, Jingying Zeng2, Xianfeng Tang2, Xiangyang Zhou2, Yan Lu2, Xiaohua Jia1

1City University of Hong Kong  ·  2Microsoft Research Asia

The RHO pipeline: coreset selection, group rollout, harness proposal
The RHO pipeline. A diverse coreset of past tasks is re-solved in parallel; within- and cross-trajectory signals drive candidate harness edits; the agent's own pairwise self-preference selects the winner. No ground-truth labels are used.

Abstract

AI agents rely on a harness of skills, tools, and workflows to solve complex problems, and continually improving this harness is essential for adapting to new tasks. Existing optimization methods typically require ground-truth validation sets, yet such labeled data is difficult to acquire in practical deployment settings.

We introduce Retrospective Harness Optimization (RHO), a self-supervised method that optimizes the agent harness using only past trajectories. RHO selects a diverse coreset of challenging tasks from past trajectories and re-solves them in parallel. The agent analyzes these rollouts using self-validation and self-consistency, then generates candidate harness updates and selects the most effective one by its own pairwise self-preference.

We evaluate RHO across three diverse domains spanning software engineering, technical work, and knowledge work. Notably, a single optimization round improves the pass rate on SWE-Bench Pro from 59% to 78% without any external grading. Our analysis shows that RHO effectively targets prior failure modes: the optimized harness alters the agent's behavior patterns and sustains higher accuracy during long-horizon sessions.

TL;DR  — RHO improves an agent's harness purely from its own unlabeled past trajectories, with no validation set and no external grading. One retrospective pass: SWE-Bench Pro 59% → 78%.

How it works

Most harness optimizers iterate against a labeled validation set — but real deployments rarely have one. A deployed agent does, however, produce a continuous stream of unlabeled trajectories. RHO turns those into harness improvements in three label-free stages:

1

Coreset Selection

A determinantal point process (DPP) picks a small, difficulty-diverse subset of past tasks to re-solve.

2

Group Rollout

Each coreset task is re-solved G times in parallel, yielding two diagnostic signals: self-validation within a trajectory and self-consistency across trajectories.

3

Harness Proposal

The agent samples N candidate harness edits and keeps the one its own pairwise self-preference ranks highest over the baseline.

Validation-based vs retrospective optimization

No validation set, no feedback loop

Validation-feedback methods repeatedly score harness edits against labeled data. RHO instead reflects on past trajectories in a single retrospective pass — replacing the external grader with the agent's own self-validation, self-consistency, and self-preference.

Results

Held-out pass rate after a single optimization round (Codex + GPT-5.5), against feedback-free baselines under a matched agent-call budget:

MethodHarness surfaceSWE-Bench ProTerminal-Bench 2GAIA-2
Vanilla Codex0.590.710.29
Dynamic CheatsheetSkills0.62 +0.030.73 +0.020.30 +0.01
ReasoningBankMemory0.61 +0.020.73 +0.020.28 −0.01
Sleep-time ComputeMemory0.64 +0.050.73 +0.020.32 +0.03
RHO (ours)Skills + Tools0.78 +0.190.76 +0.050.37 +0.08

RHO also beats Meta-Harness, a validation-feedback optimizer, at a matched single-round budget (0.78 vs 0.62 on SWE-Bench Pro) — without ever touching ground-truth labels.

What the optimized harness contains
What RHO writes into the harness: task-agnostic instructions, skills that record environment idiosyncrasies behind past failures, and executable tools — each targeting a failure mode observed in the original trajectories.

Citation

If you find RHO useful, please cite:
@article{pan2026rho,
  title   = {Retrospective Harness Optimization: Improving LLM Agents
             via Self-Preference over Trajectory Rollouts},
  author  = {Pan, Wenbo and Liu, Shujie and Lin, Chin-Yew and Zeng, Jingying
             and Tang, Xianfeng and Zhou, Xiangyang and Lu, Yan and Jia, Xiaohua},
  journal = {arXiv preprint arXiv:2606.05922},
  year    = {2026}
}