The Harness

The Harness Is All You Need

Theodoros Galanos — Fri, 03 Apr 2026 00:00:00 GMT

Estimated reading time: 5 minutes

There is a genre of LinkedIn post that goes something like this: "I gave GPT my floor plan and it produced a schedule in 30 seconds. The industry will never be the same."

It's a compelling demo but not much more than anecdotal evidence.

The gap between a model producing a plausible-looking answer on a single prompt and actually performing reliably across a wide range of real engineering tasks is enormous. But we don't have the means to measure it.

I have written about this before. In Where Capability Actually Lives, the argument was that in domain-specific work, performance is not a property of the model alone — it is distributed across the model, the tools, the verification, the orchestration, and the output contracts around it. In Benchmarking Agents on Real Engineering Work, early empirical results supported that claim: when harness support was removed, capability didn't degrade gracefully. It collapsed. And in What If the Harness Could Improve Itself?, we showed that automated improvement of the harness environment compounds in ways that model upgrades alone don't.

Those were arguments and early experiments. What we didn't have was a proper benchmark — one broad enough, multimodal enough, and rigorous enough to make the case at scale.

Now we do. Or at least a solid start at it.

AEC-Bench

Together with Nomic AI, we have released AEC-Bench: the first multimodal benchmark for evaluating AI agents on real-world architecture, engineering, and construction tasks. It is open source under Apache 2.0, and it covers 196 task instances across three complexity levels, from single-sheet understanding to cross-document coordination across drawings, specifications, and RFIs.

Real documents, real decisions. The kind AEC professionals deal with every day: reading construction drawings, cross-referencing detail callouts, navigating sheet indices, reconciling specs with RFIs. Work where getting the gist doesn't cut it.

We evaluated multiple frontier agent configurations — including Claude Code (Opus 4.6, Sonnet 4.6), OpenAI Codex (GPT-5.2, GPT-5.4), and Nomic's domain-specific agent — across all three complexity tiers. The full results are in the paper, and they are worth reading. But the headline finding has nothing to do with which model scored highest.

The headline finding is that the harness matters more than the model.

Retrieval Is the Bottleneck, Not Reasoning

Agents frequently fail before they get to the hard part — before any engineering judgment is required — because they cannot reliably locate the right sheet or the right cross-reference within a complex multimodal document set. Retrieval has to be embedded in the agent's logic and reasoning approach. Without the right harness, agents defaulted to treating rich construction drawings as flat text files via pdftotext, a fundamental mismatch with the structure and visual density of real AEC documents.

The model was fine. The harness was wrong.

But when equipped with domain-specific document parsing and retrieval, performance jumped dramatically. Gains of twenty to thirty points on the hardest task families, far exceeding what any model upgrade alone could deliver.

That's the empirical version of the claim I have been making for months: in domain-specific work, the operating environment is part of the capability.

Benchmarks Are Not Leaderboards

This matters because it changes what benchmarks are for.

If you think a benchmark is a leaderboard — a place to crown a winner — then AEC-Bench tells you which agent configuration scored highest. Fine. But that is the least interesting thing it does.

The more important function of a benchmark like this is diagnostic. It tells you where the system breaks, why it breaks, and what kind of intervention would fix it. Retrieval matters more than reasoning on these tasks. Document understanding is a harder unsolved problem than calculation or code generation. And the difference between a model that scores 40% and a model that scores 70% might not be the model at all — it might be the tools it was given.

That diagnostic function is why the "it works on my example" style of evaluation is so dangerous. An anecdote has no control group. It has no complexity tiers. It does not distinguish between a model that got lucky on a simple case and a system that performs reliably across difficulty levels and task types. The anecdote tells you what happened once. The benchmark tells you what to expect, and more importantly, what to fix.

If you want AI to work in engineering, stop collecting demos and start building benchmarks.

Benchmarks can ossify. They can reward gaming over genuine capability. But they're still the only tool that separates signal from anecdote.

For anyone building AI systems for AEC, or for any domain where the work is artifact-bound and intolerant of plausibly generic answers: rigorous evaluation on real documents at real complexity is the minimum. And the full system has to be in scope — that's where the capability actually lives.

What Comes Next

AEC-Bench is a start, but it is only a start. The industry needs benchmarks that evolve — that grow with more task families, more disciplines, and document types the current set doesn't cover. It needs evaluation infrastructure that supports reproducibility, trajectory analysis, and systematic comparison of harness designs. Tooling that lets you understand how an agent succeeded or failed, and what that means for the next iteration.

This is what aec-bench is supposed to be — an open platform for AEC agent evaluation — and we will have more to share soon.

In the meantime, the paper and benchmark dataset are available now. If you're interested making AI work in the built environment, take a look and share your thoughts! And if you're still relying on anecdotes to judge what these systems can do, consider that you might be optimising for demos when you should be optimising for deployment.

What If the Harness Could Improve Itself?

Theodoros Galanos — Sat, 14 Mar 2026 00:00:00 GMT

Estimated reading time: 25 minutes

About a week ago, Andrej Karpathy released autoresearch: a system where an LLM agent improves a model training pipeline overnight. The agent edits code, trains for five minutes, checks whether the result improved, and keeps or discards the change. Over roughly 700 experiments, it found around 20 real improvements and cut GPT-2 training time by 11%.

The design is compact: one markdown file, one editable target (train.py), one metric (val_bpb), and a greedy keep-or-revert loop.

But training code is not the only thing that determines how well an AI system performs.

In Where Capability Actually Lives in Agentic Engineering, I argued that in domain-specific work, performance is distributed across the whole system: model, tools, prompts, verification, and output contracts. In Benchmarking Agents on Real Engineering Work, initial results supported that claim. When workflow support was removed, capability did not degrade gracefully. It collapsed.

That finding raises a question this article tries to answer: if the environment around the model carries a meaningful share of the capability, can part of that environment begin improving itself?

If the operating environment carries capability, then improving the environment is improving the capability. And if the improvement loop can be automated, it compounds.

This article applies that pattern to only one part of an engineering evaluation harness: the system prompt that guides audit behaviour. It lays out the design, reports four early experiment blocks on HVAC audit tasks, and examines what this kind of recursive improvement actually surfaces.

TL;DR

We applied the autoresearch pattern to one surface of our harness: the system prompt that guides audit behaviour.
The system uses an information barrier so the autoresearcher agent sees patterns in outcomes, not task answers.
Behavioural feedback from an agentic-bonds classifier gives a richer signal than reward alone by showing how the agent distributed effort across execution, deliberation, exploration, and verification.
Across the Claude full-reference runs, one prompt change improved mean reward from 0.94 to 0.98 across five task instances. The entire change was two sentences: an explicit per-room verification checklist.
On Claude at L0, that full-reference winner backfired. Confidence thresholds and cross-room consistency instead improved reward from 0.73 to 1.0 on one instance.
On GPT-4.1-mini at L0, the same high-level strategy transferred, but the wording had to become a hard verification gate, and the model still topped out at 0.83.
For Claude, perfect runs were more execution-heavy and less verification-heavy than partial ones. The difference was timing, not quantity: the strongest runs executed first and verified later in concentrated blocks.
The main lesson is conditional. Prompt strategy depends both on the information available and on the model receiving the prompt.

Why Autoresearch Works

Karpathy's design works because it fixes the evaluation loop into a form that can run unattended.

Fixed time budget. Every experiment runs for exactly five minutes. Scores are comparable because improvements cannot hide behind longer training.
Single artefact. The agent edits one file. That constrains the search space and keeps diffs reviewable.
Binary keep/discard. The metric improved or it did not. No subjective judgement required.
Git as memory. Every experiment is a commit, so the branch tip stays at the best known configuration.

As Delip Rao's prompt anatomy notes, program.md is not a loose prompt but an operating procedure: context, constraints, target, audit trail, loop, and safety valves. That is why the system works.

As Philipp Schmid notes, when experiments run far faster than a human can manage, the bottleneck becomes the evaluation system. If the harness is a bottleneck, improving the harness becomes a high-leverage move.

Why Engineering Harnesses Are Different from Software Agent Harnesses

Translating the autoresearch pattern to engineering harnesses requires three design changes.

The information barrier

In autoresearch, the autoresearcher agent sees everything: full training logs, exact loss curves, model weights. In our case, the artefact is a system prompt. If the autoresearcher could see the task content or planted errors, it could bake that knowledge into the prompt and overfit the benchmark.

So we need an information barrier between raw trial outputs and the autoresearcher. The autoresearcher sees patterns — "3 of the last 5 runs had incomplete coverage" — but never answers. That is the key structural difference from autoresearch.

More Than One Score

Autoresearch has one scalar: val_bpb. Engineering harnesses need more. A score of 0.7 could mean missed findings, formatting errors, or a timeout. The autoresearcher needs to know not just how well the agent scored, but how it behaved.

We address this with three feedback channels:

Reward — a float between 0.0 and 1.0 produced by a deterministic verifier that scores how completely and correctly the agent completed the audit task.
Failure categories — abstracted failure modes without task-specific details: incomplete_coverage, incorrect_values, format_error, timeout, false_positives, clean_miss.
Behavioural profile — a classification of each turn in the trace using our agentic-bonds classifier. It labels turns as execution, deliberation, exploration, or verification, and returns the distribution, temporal sequence, and a short narrative summary.

That gives the autoresearcher more than a score drop. It can tell whether the agent stopped verifying, front-loaded execution, or spent too long deliberating.

The Target Is the Workflow Prompt

Autoresearch optimises model architecture and hyperparameters. The artefact is code that defines how a neural network trains. In our case, the artefact is a system prompt: workflow instructions for audit work. It changes how the agent works, not what it knows. Prompts were the easiest surface to start with: easy to isolate, easy to diff, and easy to evaluate. As the harness becomes more structured, other surfaces like task generation, scoring rubrics, and tool workflows become plausible optimisation targets too.

This is closer to optimising an organisation's operating procedure than to optimising code. The question is not "what parameters produce the lowest loss?" but "what workflow instructions produce the most reliable engineering review?"

The Design: Automated Prompt Optimisation for HVAC Audit Tasks

The system has three layers with a strict information flow.

The autoresearcher agent runs locally, driven by Claude Code following a program.md — the same pattern as autoresearch. It edits system prompts, makes git commits, reads sanitised feedback, and decides whether to keep or revert each change. It never sees task content, verifier code, or raw conversation transcripts.
Support scripts handle the structured mechanics. run_experiment.py triggers a Harbor job in a Docker sandbox with the current system prompt. feedback.py enforces the information barrier: it reads the raw trial outputs, strips task-specific content, runs the bonds classifier, and returns a sanitised JSON summary. results.py manages the TSV audit trail.
The experiment sandbox is unchanged from the existing evaluation harness — the same Docker containers, agents, verifiers, and output contracts used in the benchmark work from the previous article. The only thing that changes between iterations is the system prompt file mounted into the container.

The loop mirrors autoresearch exactly:

Edit the system prompt
Git commit
Run experiment in sandbox
Read sanitised feedback
Keep if reward improved, revert if not
Repeat

The program.md is organised into eleven sections: purpose, setup, scope constraints, artefact definition, experiment execution, feedback reading, optimisation targets, the loop itself, logging, autonomy mandate, and information discipline. The information discipline section is the one that has no analogue in Karpathy's version: it explicitly instructs the autoresearcher not to attempt reading task files or verifier code, and frames any desire to know specific task details as a signal to focus on process guidance instead.

What Behavioural Feedback Adds to Agent Evaluation

The agentic-bonds classifier gives the autoresearcher something autoresearch does not have: a temporal view of how the agent distributed its effort across the trace.

Each assistant turn in the conversation trace is classified as one of four types:

Execution — doing the obvious next step: calling tools, formatting output, writing results
Deliberation — committed reasoning about how to solve a problem, step by step
Exploration — comparing alternatives, deciding what to do
Verification — checking backward at its own work, comparing results against expectations

The autoresearcher sees this at three levels: the bond profile (distribution), the bond sequence (temporal pattern), and a short bond narrative.

What the classifier actually found

We classified all 30 Claude Sonnet 4.6 traces from these runs — 12 perfect (reward 1.0) and 18 partial (reward 0.9). The classifier should be read as a behavioural lens rather than ground truth about cognition, but even on that basis the headline finding was counterintuitive.

Perfect traces are more execution-heavy, not more verification-heavy.

	Execution	Verification	Deliberation
Perfect (1.0)	55.5%	41.9%	2.6%
Partial (0.9)	45.3%	49.0%	5.8%

Exploration was effectively absent in this set of traces, which is why it does not appear as a meaningful part of the comparison.

The agents that scored perfectly spent less time verifying than the ones that missed an error. The execution-to-verification ratio shows the same pattern: 1.71 for perfect traces versus 1.06 for partial ones. This does not mean verification is unhelpful. It means badly timed verification correlates with worse performance.

Other patterns we see in the data:

Perfect traces execute first, verify later. In the first half of the trace, perfect agents spent only 28% of turns on verification, compared to 41% for partial agents. By the second half, both groups converged around 55%. Perfect agents front-loaded decisive execution, then verified in concentrated blocks. Partial agents hedged earlier — which appears to reflect uncertainty rather than thoroughness.
Perfect traces build momentum. The average longest uninterrupted execution streak was 3.8 turns for perfect traces, compared to 2.4 for partial. Perfect agents commit to a direction and sustain it. Partial agents interrupt themselves to verify before they have built enough context for verification to be useful.
Deliberation correlates with imperfection. Only 25% of perfect traces contained any deliberation turns, compared to 56% of partial traces. When the agent pauses to reason about how to proceed, it is more likely to miss something. This suggests that deliberation in these traces signals uncertainty or confusion, not carefulness.
Start patterns diverge. 58% of perfect traces opened with three consecutive execution turns (exec | exec | exec). Only 28% of partial traces did the same — most interrupted with verification or deliberation by the third turn.

What this means for prompt design

The behavioural data sharpens the prompt improvement story. The successful change — adding a systematic checklist (a) inputs, (b) components, (c) ventilation, (d) totals — did not add verification. It structured the execution phase so the agent could work systematically before verifying.

In other words, the checklist made the agent behave more like the perfect traces already did: decisive execution, then concentrated verification.

Early Results: How Small Harness Changes Moved HVAC Audit Performance

We ran four experiment blocks on the system prompt that guides HVAC schedule audit tasks. One established the single-instance trap, one found a prompt improvement across five full-reference instances, one showed that the best strategy changes at L0, and one repeated the L0 setup on a weaker model.

Claude on One Full-Reference Task

The first Claude Sonnet 4.6 run tested five prompt modifications against a single task instance (adelaide-15rm). The baseline scored 0.9 — four of five planted errors detected.

Five strategies were tried: a planning step, cascade-error reporting, field-by-field verification, larger batches, and a re-verification pass. None moved the reward.

The autoresearcher reasonably concluded that the fifth error looked like a model capability boundary rather than a prompt issue. The one useful finding was about efficiency: the larger-batch approach achieved the same accuracy with fewer turns and tokens — 9 turns and 98k tokens versus the baseline's 10 turns and 105k.

That conclusion turned out to be wrong. Not because the reasoning was poor, but because the evidence was insufficient. Single-instance evaluation made a task-specific ceiling look like a universal one.

Claude on Five Full-Reference Tasks

The second Claude Sonnet 4.6 run tested four modifications against five instances simultaneously (Adelaide, Brisbane, Darwin, Melbourne, Perth). The baseline mean reward was 0.94: three instances at 0.9 and two at 1.0.

Three approaches failed:

Iteration 1 added a completeness rule: "verify every room, do not skip rooms, a missed error costs more than an extra turn." No reward change. The instruction was redundant.
Iteration 2 added zero tolerance for numerical mismatches: "report ANY difference, do not dismiss as rounding." No reward change, but efficiency degraded sharply — 64 turns and 685k tokens versus the baseline's 52 turns and 553k. The agent re-checked more and found nothing new.
Iteration 3 restructured the workflow to a two-pass approach: sweep all totals first, then deep-dive flagged rooms. No reward improvement, and efficiency nearly doubled — 79 turns and 1.08 million tokens. It mostly shuffled which errors were caught without improving overall coverage.

Iteration 4 worked. The change was small: two sentences extended. The orient step added design conditions. The verify step added an explicit per-room checklist: (a) input parameters, (b) each heat gain component, (c) ventilation terms, (d) totals. Mean reward rose from 0.94 to 0.98. Adelaide and Melbourne both jumped from 0.9 to 1.0. Only Darwin remained at 0.9. The efficiency cost was modest: 55 turns and 613k tokens, an 11% increase over baseline.

Here is the actual diff — the entire improvement:

-1. **Orient (1 turn):** Read the schedule and identify the number of rooms
-   and available tools. Count your rooms — this determines your turn budget.
-2. **Verify (1 turn per 2-3 rooms):** For each batch of rooms, call the
-   calculation tool, then compare results against the schedule values.
-   Note discrepancies immediately.
+1. **Orient (1 turn):** Read the schedule and identify the number of rooms
+   and available tools. Count your rooms — this determines your turn budget.
+   Before proceeding, note the design conditions: outdoor temperature,
+   humidity, and any building-level parameters.
+2. **Verify (1 turn per 2-3 rooms):** For each batch of rooms, call the
+   calculation tool, then compare results against the schedule values.
+   For each room, systematically check: (a) input parameters (occupancy,
+   area, volume), (b) each heat gain component, (c) ventilation terms,
+   (d) totals. Note discrepancies immediately.

The improvement came from making an implicit checking procedure explicit.

What the failed approaches have in common

The three failed approaches all tried to add more work. The successful one structured the existing work differently.

Iteration	Approach	Reward	Token cost vs baseline
Baseline	—	0.94	—
1	Add completeness rule	0.94	-1%
2	Zero tolerance	0.94	+24%
3	Two-pass restructure	0.94	+95%
4	Systematic checklist	0.98	+11%

That pattern — structure matters more than volume — echoes the previous article's result that the strongest models did not simply verify more; they verified as part of a structured workflow.

Claude at L0

The Claude L0 run targeted the hardest reference level: audit-office-building-L0. Here the agent receives only room type, floor area, ceiling height, and location. It does not receive design conditions, formulas, or lookup tables. The prompt is operating in a much thinner information environment.

The baseline reward on the Adelaide L0 instance was 0.73. The agent found 2 of 3 issues, but it also produced false positives.

The first useful change was not a better checking procedure. It was a confidence threshold: only report an error if the model can identify the specific assumption, parameter, or formula that appears to be wrong. That lifted reward from 0.73 to 0.83 by eliminating false positives.

The next result was the important one. The full-reference winner — the systematic per-room checklist — did not generalise. It backfired. Reward dropped from 0.83 to 0.53, and false positives returned. At full-reference levels, granular checks help because the agent can anchor them against formulas and tables. At L0, the same granularity forces the agent to compare schedule values against assumptions it is partly reconstructing from memory.

The winning change was different in kind. Adding a cross-room consistency rule — compare similar rooms against one another and investigate large relative differences — lifted reward from 0.83 to 1.0 on this instance, while also reducing token usage sharply.

That is a strong result, but it is still single-instance. It shows that the strategy can change sharply when reference material is removed. It does not yet prove that we have the generally best L0 prompt.

What it does show is that prompt optimisation is reference-level-specific. At high reference levels, the prompt benefited from explicit granular verification. At L0, the better strategy was uncertainty management plus relative comparison.

L0 Iteration	Strategy	Reward	Tokens In	Outcome
Baseline	Original prompt	0.73	304k	2/3 findings + false positives
1	Confidence threshold	0.83	248k	False positives removed
2	Systematic checklist	0.53	407k	Backfired
3	Cross-room consistency	1.00	104k	3/3 findings, no false positives
5	Add design-conditions note	1.00	258k	Same reward, worse efficiency

GPT-4.1-mini at L0

The GPT-4.1-mini L0 run repeated the same task on the weakest model from the earlier benchmark. The baseline was much worse than Claude's: reward 0.37, only 1 of 3 planted errors found, and 18 false positives.

What transferred was the strategy, not the exact wording. The soft confidence-threshold language that helped Claude did not work here. But once the instruction was rewritten as a hard verification gate — do not report a discrepancy unless the tool confirms it — false positives disappeared and reward jumped to 0.83.

That shows two things at once. First, the underlying prompt idea generalises across model families: reduce unsupported claims, rely on internal consistency, and force the model to ground discrepancies. Second, the wording still has to match the model's instruction-following style.

The ceiling also remained model-specific. GPT-4.1-mini improved dramatically, but it did not reach Claude's L0 result. The best run found 2 of 3 issues with no false positives.

Model	Baseline	Best	Winning strategy	Constraint style
Claude Sonnet 4.6	0.73	1.00	Confidence threshold + cross-room consistency	Advisory
GPT-4.1-mini	0.37	0.83	Verification gate + cross-room consistency	Prohibitive

The early behavioural picture suggests the two wins differ not only in wording, but also in how the models stabilise. Claude's successful L0 run remained relatively execution-compatible. GPT-4.1-mini's best run was much more verification-heavy.

What the Results Suggest for Harness Engineering in AEC

Self-improving harnesses are feasible

The system worked. The autoresearcher agent formed hypotheses from sanitised feedback, made targeted prompt changes, measured their effect, and advanced the branch when something improved. The information barrier appeared to hold, and some improvements generalised across instances.

This is obviously a small result, but it validates the basic mechanism.

The environment is improvable, not just measurable

The previous two articles established that the operating environment carries capability and that this dependence is measurable. This article adds a narrower claim: at least one important part of that environment, the workflow prompt, is improvable through automated search. If harness improvement can be partially automated, investment in evaluation infrastructure compounds.

Single-instance evaluation misleads

The single-instance Claude run concluded that the reward ceiling was a model capability boundary. The five-instance Claude run proved that wrong. The ceiling was instance-specific. Multi-instance evaluation was necessary to discover that the prompt could improve, and which prompt changes actually generalised.

This point travels beyond this setup. Evaluating on a single test case can make genuine improvement opportunities look like hard limits.

Structure beats volume

The most expensive failed approach cost nearly twice the baseline in tokens and produced no improvement. The successful approach cost 11% more and raised reward by 4 percentage points. It is possible that, in engineering review work, telling an agent how to check is more effective than telling it to check more.

That said, the Claude L0 run adds an important qualifier. The right structure depends on the reference level. On full-reference tasks, structure meant explicit component-by-component verification. On L0, structure meant constraining when the agent should trust its own judgement and shifting toward relative comparisons inside the schedule itself.

Prompt optimisation is reference-level-specific

The L0 run is the strongest evidence in the whole set that there may be no single best workflow prompt. The same checklist that helped at full reference actively hurt at L0. Removing reference material changed what kind of prompt guidance was useful.

That suggests a more general rule: prompt quality is conditional on information availability. A prompt that works well when the environment supplies formulas, tables, and design conditions may fail when the model has to reconstruct too much of that context from memory.

Strategy transfers across models, wording does not

The GPT-4.1-mini run adds a second qualifier. The high-level idea that worked at L0 transferred across model families: suppress unsupported discrepancies and use relative comparisons when absolute references are weak. But the wording had to change. GPT-4.1-mini did not respond reliably to soft advisory language. It improved only when the same idea was expressed as a hard gate.

That suggests prompt portability has two layers. Strategy may travel. Surface phrasing may not. It also suggests that prompt optimisation has a ceiling. GPT-4.1-mini improved much more in relative terms than Claude, but it still stopped well short of Claude's absolute result.

Behavioural analysis inverted our assumptions

Before classifying the traces, we assumed that more verification would correlate with better performance. The data showed the opposite: perfect traces were execution-dominant, while partial traces were roughly balanced. The goal is not to maximise verification. It is to make execution decisive enough that verification can happen in concentrated blocks rather than as reactive interruptions.

Honest Limitations

This is early work. The scope is narrow and the results are preliminary.

Single task type, single domain. All experiments targeted HVAC audit tasks. We do not know whether the checklist approach generalises to other engineering task types.
Small iteration count. Four experiment blocks are enough to validate the mechanism and expose two important conditionalities, but not enough to map the full improvement frontier.
Bonds data is still thin outside the Claude run and has no human-annotated validation set here. The 30-trace behavioural analysis is for Claude Sonnet 4.6 on one task family. The GPT-4.1-mini read is directionally useful, but much smaller.
The information barrier is untested against adversarial pressure. The autoresearcher followed the information discipline in these runs, but we have not stress-tested whether a sufficiently capable autoresearcher agent might infer task-specific content from the feedback patterns.
The L0 result is still single-instance. The 1.0 score is encouraging, and it was reproduced once with a slightly more expensive prompt variant, but we have not yet tested the L0-optimised prompt across multiple cities. The five-instance Claude run already showed how misleading single-instance conclusions can be.
The GPT-4.1-mini comparison is also single-instance. We do not yet know whether the 0.83 ceiling or the prompt fragility pattern will hold across other cities or adjacent task families.
Darwin remains at 0.9. One instance did not improve across any prompt change. This may be a genuine model capability limit for that specific error type, but we would need to break the information barrier to investigate.
Stochastic variance is bounded but real. Iteration 3 showed Brisbane and Melbourne swapping scores, suggesting approximately 5-10% variance per instance. Larger sample sizes would help distinguish signal from noise.

What Comes Next

The system currently optimises one surface: system prompts. Two obvious new surfaces remain:

Task generation — using the loop to generate new task instances that produce useful signal.
Scoring rubrics — iterating on how agent output is evaluated, especially for tasks with qualitative judgement.

The prompt surface itself can also become more adaptive:

Level-adaptive prompts — switching workflow strategy based on how much reference material is available. The contrast between the five-instance full-reference Claude run and the Claude L0 run suggests that prompt selection may need to be conditional rather than global.
Model-adaptive prompts — varying the instruction style as well as the workflow strategy. The GPT-4.1-mini run suggests that the same conceptual rule may need different wording for different models.

The longer-term question is whether prompts, tasks, and rubrics can be improved jointly. That is where recursive harness improvement starts to look more like a research programme.

For now, the result is small and specific: an autonomous loop, an information barrier, behavioural feedback that inverted our assumptions about verification, one prompt change that helped at full reference, another that worked for a very different reason at L0, and a cross-model comparison showing that prompt ideas can transfer even when prompt wording does not. The environment was improvable in each case, but the successful strategy depended on what information the environment already supplied and which model was inside it. The agents that scored best were not simply the ones that checked the most. They were the ones whose workflow matched the structure of the task, the information available, and the model's own behavioural constraints.

That, at least, is consistent with everything we have been learning about where capability actually lives.

Benchmarking Agents on Real Engineering Work Is Already Teaching Us Something Important

Theodoros Galanos — Thu, 12 Mar 2026 00:00:00 GMT

Estimated reading time: 15 minutes

Current frontier model performance is still heavily concentrated in a narrow band of well-covered domains, especially code, math, and adjacent text-heavy tasks.

Engineering is not one of them.

That is the starting point for this work.

In Where Capability Actually Lives in Agentic Engineering, I argued that progress in this domain will not come from better models alone. It will come from better operating conditions: better tools, better harnesses, and better environments for reliable work. This article is a first empirical step in that direction.

That earlier piece made a conceptual claim about where capability lives. This one asks whether that claim survives contact with measured performance.

In domain-specific work, the environments agents operate in are part of the capability.

The aim here is to start measuring that claim on real engineering tasks. The longer-term goal is a benchmark, but this piece reports an early run toward one: a single task, from a single discipline, using one agent harness setup, tested across a small set of models.

Concretely, the work focuses on HVAC heat load calculations and schedule audits, grounded in the kinds of structured documents that show up in actual AEC workflows, and includes nearly 480 trials with the same harness, the same tools, and the same output contract. The harness was a tool-using agent, and the tool itself was deliberately strong: the calculation procedure was reverse engineered into code, the relevant lookup values were exposed, and the models were given something close to an oracle calculator rather than being asked to derive the whole method from scratch.

That is narrow by design, and still early. But it is already enough to surface useful patterns. The point of sharing it now is not to pretend the benchmark is finished. It is to put the initial results and the early methodology in the open, get feedback, and sharpen the next iterations while the benchmark is still taking shape.

What follows is a first look at what current agents are good at, where they still break, and why evaluation design matters almost as much as model quality. It is also an early step in turning a vague domain gap into a concrete improvement programme.

TL;DR

This benchmark is best read as evidence about conditional capability. Inside a strong harness, current frontier agents can do meaningful engineering review work. Once guidance and tool support are stripped away, performance often collapses rather than degrading gracefully.
The model ranking is clear, but it is not the deepest result. Sonnet 4.6 is strongest overall, and Haiku 4.5 is the most attractive quality-per-dollar option in this setup.
The most informative tasks were audit tasks, not calculation tasks. Once the harness supplied something close to an oracle calculator, the real separation moved into checking, discrepancy detection, and reliable completion.
Verification behaviour appears to matter a lot. The strongest model did not just verify when it was in trouble. It verified as part of its default workflow, which looks like part of the mechanism behind its recall advantage.
This is an empirical counterpart to the argument in the previous article, and the beginning of a broader benchmark effort. It supports the claim that in engineering, capability is not just a model property. It is distributed across the model, the harness, the tools, the verifiers, and the output contract. The longer-term goal is to build high-quality benchmarks with enough coverage to support meaningful progress in agentic engineering.

That is the core result in compressed form: the benchmark ranks models, but more importantly, it makes the system dependence visible.

The Benchmark Was Simple on Purpose: HVAC Heat Load Tasks as a Starting Point

That narrow setup matters because it tells us something specific.

Within this first experiment, the focus was one meaningful slice of engineering work: mechanical heat-load tasks. Even inside that small scope, two task families behaved very differently.

Calculation tasks ask the agent to compute loads correctly from structured room inputs.

Audit tasks ask the agent to inspect schedules, identify discrepancies, and propose the correct fixes.

That distinction ended up being crucial.

The calculation tasks were close to saturated for the Anthropic models. They are useful as a baseline, but they do not separate strong agents from stronger ones.

That result is even more telling once you know the setup. The method was not hidden inside the task and left for the model to rediscover. The calculation approach was encoded directly into the harness as a near-oracle tool: a coded procedure plus the relevant lookup table values and calculator logic. Even with that help, not every model was perfect, and the harder audit tasks still separated the field clearly.

That is where the real spread appears: systematic checking, consistency across many rows, and enough discipline to finish with the correct structured output.

They also give us a useful improvement gradient. When an agent fails, the failure is usually legible. It missed a discrepancy, checked too shallowly, used the tools badly, ran out of turns, or never made it cleanly into the required output format. That is exactly what you want from a benchmark aimed at building better agent environments around real domain tasks.

For AEC practitioners, that should feel familiar. The painful mistakes in practice are often not a single wrong formula. They are missed discrepancies, skipped checks, and brittle review processes.

For agent evaluation people, it is a reminder that task design determines what you learn. If the task is too easy, you are mostly benchmarking formatting and latency. If it is too synthetic, you may learn very little about deployed usefulness.

The Model Ranking Is Clear, but It Is Not the Whole Story

At the headline level, the ranking is straightforward.

Sonnet 4.6 was best overall. It was the only model to clear both near-perfect calculation performance and clearly best-in-class audit recall. It also had zero zero-score trials.

Haiku 4.5 came second on accuracy and first on value. Its overall reward was 96.3%, and while it trailed Sonnet 4.6 on audit recall, it stayed strong enough that its much lower per-trial cost changes the deployment conversation.

Sonnet 4 remains solid. It is cheaper than 4.6 and more accurate than GPT-4.1-mini by a wide margin, but the newer generation models have moved the frontier.

GPT-4.1-mini was not competitive for this workload. Its overall reward was 34.1%, with a 64% zero-rate. The issue was broader than engineering reasoning. A large share of failures were format failures, truncated outputs, or prose that never turned into the required JSON result.

That last detail matters. In an eval setting, people sometimes treat format failures as a nuisance variable. In deployed agent systems, they are part of the failure surface. If an agent can reason but cannot reliably finish the job in the required structure, it still failed.

If your eval is shallow, your conclusions will be shallow too.

The Biggest Lesson Was About the System, Not the Model

One of the clearest findings in this benchmark is that harness choices materially change measured capability.

This is the most direct continuity with the previous article. There, the claim was conceptual: the harness is part of the capability story. Here, the same point shows up empirically in the numbers.

Even simple harness changes moved results. In the early setup, a 10-turn cap made one audit workload look almost impossible for an otherwise capable model, but turn budget was only part of the problem. With weaker guidance, the default strategy was to decompose the audit room by room and spend roughly two turns per room in an execute-then-verify rhythm. That produced real work, but it was the wrong workflow for the budget. The fix combined a higher cap with guidance that pushed the agent toward batching rather than treating every room as its own mini-loop.

The same pattern showed up elsewhere. Verifier fixes mattered. Prompt refinements mattered. Prompt caching mattered for cost. Better output instructions reduced avoidable formatting zeros for the Anthropic models. And the tool design mattered too: even when the harness provides something close to an oracle for the core calculation, model differences do not disappear. They move into disciplined checking, orchestration, and reliable completion.

The ablation results make the point more sharply. Reducing the turn budget from 20 to 10 is one kind of degradation. Removing guidance and tool support is another. The first makes the task harder. The second starts to expose the capability boundary. On this task family, the gap between strong-harness performance and low-guidance performance is larger than many of the model-to-model differences people usually focus on.

Those are not side details. They are part of the measured system. The harness is not neutral background. It is an active ingredient in whether a model can express the capability it already has.

For practitioners, this means you should be skeptical of any claim that a model simply can or cannot do a workflow based on a weak first-pass eval.

For evaluation researchers, it means benchmark design has to be treated with the same rigor as model comparison itself.

What Happened When We Removed Harness Guidance

This is also where the setup gets more interesting.

From the beginning, one of the key questions was how much of the measured capability depended on the operating conditions around the model, not just which model performed best inside the strongest harness.

So after the main tool-enabled runs, a small guidance-ablation ladder was set up around the same office-building audit family.

The idea was simple. Keep the underlying task family fixed, then remove support in stages.

Condition	Support removed or added
`L0`	bare problem statement, no embedded formulas or lookup table
`L1`	effectively the same as `L0` in the current task set, which turned out to be informative in its own right
`L2`	adds the psychrometric constants and explicit outside-air rules
`L3`	replaces that with a compact AS 1668.2 reference table and general calculation guidance
`no-tool`	removes the calculation tool entirely and asks the model to do the audit directly from the prompt context

This was not meant to be a polished benchmark surface. It was a probe. The point was to see how quickly the task collapses once the environment starts losing structure.

The results were blunt.

On the tool-enabled baseline, Sonnet 4.6 averaged 0.966 reward on the office-building audit set. Dropping the turn budget from 20 to 10 lowered that to 0.943. That is a real degradation, but it still leaves the task clearly inside the model's workable envelope.

The guidance ladder was a different story.

In the direct no-tool reference run, the current partial results look like this:

Condition	Mean reward	What happened
Tool-enabled baseline	0.966	Strong performance, many perfect trials
Budget-10 counterfactual	0.943	Measurable quality drop, but still viable
`L0`	0.000	All zero-score trials
`L1`	0.000	All zero-score trials
`L2`	0.000	All zero-score trials
`L3`	0.000	All zero-score trials
`no-tool`	0.153	One perfect trial, one partial trial, eight zeroes

That tells us a few things. The cleanest way to read it is as three different regimes: a workable strong-harness regime, a mildly degraded budget-constrained regime, and a collapse regime once the environment stops carrying enough of the method.

First, the budget ablation and the guidance ablation are not the same phenomenon. Reducing turns hurts, but the agent still basically knows what kind of work it is doing. Removing guidance and tool support is much harsher. Most of those conditions do not degrade gracefully. They collapse.

Second, this is exactly the kind of out-of-distribution behaviour we were worried about.

In-distribution domains are the places where models have already seen enough adjacent structure that they can interpolate their way through the task even when the scaffold is weak. Engineering audit work did not behave like that here. Once the environment stopped carrying key pieces of the method, performance did not taper off a little. It mostly went to zero.

Third, the surviving no-tool signal matters precisely because it is weak. There was a small amount of non-zero performance there. That suggests the capability is not entirely absent. But it is nowhere near robust enough to treat the task as natively solved. In other words, the environment is still doing real cognitive work for the model.

That is the larger point.

When people say a model can do engineering, they often leave unspoken how much hidden structure is being provided by the harness, the tools, the prompt, or the reference data. Our ablation run makes that visible. In this task family, capability is highly conditional on the operating environment. Remove the support and the system does not simply get a bit worse. It often stops functioning in a useful way.

That is not a failure of evaluation. It is exactly what good evaluation is supposed to reveal.

It is also why the out-of-distribution framing matters so much. If the domain were already well-covered by the model's native priors, these ablations would look like inconvenience tests. Instead they look like capability boundary tests. That is a strong sign that for real engineering work, at least today, the harness is not a wrapper around the capability. It is part of the capability.

What the Strongest Model Did Differently

The cleanest behavioural finding in this work is about verification.

All models increased checking when they were struggling. But Sonnet 4.6 did something more interesting: it verified even when it was succeeding.

In our behavioural analysis, Sonnet 4.6 spent 23% of its successful traces in verification behaviour. The other strong models were much lower. Haiku showed the steepest verification gradient between success and failure, which makes it interesting for runtime monitoring, but Sonnet 4.6 made verification part of its default operating mode.

That appears to be the mechanism behind its recall lead.

In plain engineering terms, it behaved less like a model that checked at the end and more like one that treated review as part of execution. It did not only check when it sensed danger. It checked because checking was built into the workflow.

That matters in both practical engineering terms and evaluation terms.

For SMEs, it matches a familiar truth: good review practice is not a panic move. It is routine.

For agentic-eval people, it suggests that model quality may show up less in raw chain-of-thought style reasoning and more in when and how an agent decides to revisit earlier work. Reliability here depends on whether verification is part of the default workflow before failure starts to accumulate.

A Behavioural Lens From Reasoning Research

To get beyond simple success rates, we adapted the Agentic Bonds framework from Du et al.'s work on the molecular structure of thought.

The basic idea is that quality does not come only from individual steps. It comes from the pattern of transitions between types of steps.

We classified agent turns into four categories: execution, verification, deliberation, and exploration.

That gave us a behavioural fingerprint for each model.

Sonnet 4 looked like a rigid workhorse: high execution share, low exploration, highly predictable structure.

Haiku looked more adaptive, but also more verbose. It often spent more turns and generated a stronger distress signal when things were going badly.

GPT-4.1-mini produced the strangest result: success and failure were behaviourally almost indistinguishable. It did not seem to have a readable internal signal that it was in trouble.

That is a serious limitation if you want runtime monitoring or intervention. You cannot reliably rescue a model that does not behaviourally reveal when it is failing.

This kind of analysis complements task-level scoring. Accuracy tells you what happened. Behavioural structure starts to tell you why.

That framing came from the tool-loop traces. The no-tool runs exposed a different but complementary failure surface.

What The No-Tool Traces Revealed

The tool-loop traces gave us one kind of behavioural visibility: turn-by-turn structure. The no-tool and low-guidance runs gave us a different one. There we often only had the final written artefact, so the analysis had to be more forensic. We were no longer asking which turn type came next. We were asking a simpler and harsher question: did the model stay attached to the instance at all?

That ended up being one of the clearest behavioural signals in the whole project.

Once we read a broader sample of the direct no-tool and guidance-ladder traces, the main split was not simply success versus failure. It was anchored audit behaviour versus free-running domain narration.

The successful no-tool traces stayed tightly locked to the assigned schedule. They rebuilt the formulas from the prompt, carried the instance-specific constants through the arithmetic, and converged toward compact findings. Even without the calculator tool, they still behaved like audits. The single perfect Sydney no-tool trace is the clearest example of that pattern: it stayed on the given schedule, reconstructed the formulas locally, and still landed a verifier-clean result.

The failed traces were more interesting than simple arithmetic misses. They often looked superficially impressive. They used the right vocabulary. They wrote long engineering-sounding explanations. They sometimes did coherent local arithmetic. But many of them had already slipped off the actual task. They started substituting room programmes, changing climate conditions, inventing alternate schedules, or confidently asserting standard lookups that were not stably grounded in the prompt.

One Adelaide L3 trace, for example, stopped auditing the office-building schedule and began analyzing hotel rooms and hotel suites instead. A Brisbane no-tool failure turned into a different classroom-and-library schedule with its own invented occupancy logic. Both traces remained fluent. Neither stayed on the job.

That is the important distinction. The failure mode was often not "cannot calculate." It was "cannot stay on the instance."

That gave us a simple rubric for reading these traces. The key dimensions were instance fidelity, standards grounding, formula grounding, causal compression, and output discipline. The strongest traces stayed close to the presented schedule and compressed toward verifier-relevant findings. The weakest traces did the opposite: they drifted into generic HVAC explanation, expanded in length, and lost the contract.

From that manual read, a few recurring failure labels stood out.

Instance substitution: the model silently stopped auditing the presented office-building schedule and solved a different problem.
Generic-domain takeover: the trace remained fluent and domain-aware, but it had turned into an HVAC essay rather than an audit.
Standard hallucination: the model introduced confident but weakly grounded claims about AS 1668.2 lookups, occupant densities, or OA rates to justify a path it had invented.
Verbosity runaway: the trace expanded toward the output-token ceiling without improving task fidelity or output quality.

Failure shape	What it looks like in practice	Why it matters
Instance substitution	The trace silently swaps in a different schedule, room mix, or city conditions	The model is no longer auditing the assigned artefact
Generic-domain takeover	The writing stays fluent and technical but turns into generic HVAC explanation	Domain fluency masks loss of task control
Standard hallucination	The trace confidently asserts unsupported lookup values or code interpretations	It manufactures justification for the wrong path
Verbosity runaway	The trace expands toward the token ceiling without converging toward findings	Length substitutes for control

A few lines from the traces make the pattern obvious.

"### Room 1 — Hotel Room A (Hotel Bedrooms, 30 m²)"

That line came from an Adelaide L3 run that was supposed to be auditing an office-building schedule. By that point the trace was no longer on the assigned task at all.

"### Room 1 — Classroom A"

That came from a Brisbane no-tool failure. The model remained fluent and organized, but it had drifted into a classroom-and-library problem that was never in the prompt.

"Room 3 Errors: - Conduction W: given 4320, correct = 1600"

That comes from the successful Sydney no-tool trace. It is much less ornate, but it stays attached to the actual schedule and compresses toward the discrepancies that matter.

That pattern matters because it changes how to read the weak positive signal in the no-tool condition. The surviving no-tool traces suggest something specific: the model can sometimes reconstruct enough of the method to succeed, but only when it keeps a very tight lock on the actual instance. Once that lock breaks, domain fluency is not enough to rescue the audit.

This is exactly the kind of behaviour you would expect in an out-of-distribution domain. The model does not fail by becoming incoherent. It fails by becoming plausibly generic.

That matters for cost too, because these are not always short failures. Some of them are long, fluent, and expensive failures.

Time and Cost Need To Be Measured Together

The cheapest model per trial is not automatically the best value.

GPT-4.1-mini was cheapest in raw dollar terms, but too much of that spend was wasted because the outputs were unusable or incomplete.

Sonnet 4.6 was the most expensive, and part of that cost came from output verbosity. It generated much more output than the other Anthropic models, which limits how much prompt caching can save. The no-tool traces make the broader point clearly: a long failing trace is also a cost event.

Haiku 4.5 hit the most interesting middle ground. It was fast, much cheaper than Sonnet 4.6, and accurate enough that it dominated on reward-squared-per-dollar.

That metric matters because it punishes low accuracy sharply. A cheap wrong answer is not a bargain in review-heavy engineering workflows.

That matters especially in AEC, where the near-term deployment shape is often low request volume and very high task value. These are usually not million-QPS workloads. They are bounded but complex tasks that take skilled people real time to complete, and where the cost of a bad result can easily dominate the cost of the model run itself. In that setting, quality comes first.

The practical sequence is usually two-stage. First, you pay for quality in order to discover which tasks agents can actually do well enough to be useful. Only later, once those workflows are stable and you start scaling them across teams or organisations, does cost become the main optimisation target. At that point the question changes from can this task be done well to how broadly can we deploy it without losing quality or blowing up spend.

In this small experiment, if you want the highest ceiling, Sonnet 4.6 is the answer.

In the same narrow setting, if you want the strongest quality-per-dollar tradeoff, Haiku 4.5 is hard to ignore.

If you want a deployable system, the right answer probably depends on where in the workflow the agent sits and how much review coverage a human still provides.

What This Means for AI Agents in AEC

The practical takeaway is not that AI can now replace engineering judgement. That would be the wrong lesson.

The stronger result is narrower and more useful: on bounded, well-instrumented tasks, evaluation quality already matters as much as model selection.

The models were not most differentiated by calculation. They were differentiated by disciplined checking, completeness, and reliable finish behaviour. Those are exactly the traits that matter in real QA workflows.

So if you are trying to bring agents into AEC practice, one sensible near-term path is not full autonomy. It is scoped, auditable assistance on tasks where you can define the inputs, the expected outputs, and the failure modes clearly.

That is also where benchmarks can be genuinely useful: as a way to test whether an agent is ready for a specific class of work.

What This Means for Agentic Evaluation and Benchmarks

What this benchmark suggests is that three parts of the eval design matter especially strongly.

First, real task grounding. The benchmark should represent work that people actually care about getting right.

Second, harness transparency. Turn limits, verifier design, tool affordances, and output contracts are not implementation trivia. They are part of the measured system.

Third, behavioural instrumentation. If two agents get similar scores but fail in different ways, that difference matters. If one model exposes a strong distress signal and another does not, that matters too.

This is why benchmarks are most useful when those layers are visible together: domain realism, outcome quality, and agent behaviour. In that sense, AEC is a good stress case. It can be highly digitised in important workflow slices, it is economically important, and it is awkward enough to expose real capability gaps.

What Comes Next

The obvious next step is scale. One task in one discipline is enough to surface a potential pattern, but not enough to define a field. If this work is going to mature into a useful benchmark, it needs to grow into thousands of task instances across multiple disciplines, with enough breadth to distinguish narrow task skill from more general domain competence.

It also needs multimodality much earlier than many evals do. Design and engineering work are not purely text workflows. Drawings, schedules, details, markups, diagrams, and spatial context are central to the job. A serious benchmark for this domain will need multimodal inputs and multimodal tool use as part of the core design, not as an optional extension added later.

Then there is the harder evaluation problem: tasks where there is no single clean quantitative answer. A lot of real engineering work is about adequacy, judgement, prioritisation, and review quality rather than one exact number. That is where expert-authored rubrics, and eventually rubric-driven reward systems, become crucial. Recent work such as Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains points in that direction. If we want to benchmark useful domain work rather than only easily scored work, we will need much better machinery for structured qualitative evaluation.

And beyond single tasks, there is process. Many real workflows are long-horizon and compositional: they are made of many smaller tasks chained together across time, artefacts, and decisions. That is part of why starting with sharply scoped task instances still makes sense. They are the building blocks. Over time, the harder benchmark will be the composition problem: whether agents can string those capabilities together reliably across longer processes without losing quality, context, or control.

The Current Bottom Line

If you force a one-line conclusion, it is this:

Current agent capability on real engineering tasks is still highly conditional on the operating environment: the best systems verify better, finish more reliably, and look much weaker once the scaffold is removed.

That is encouraging, but it is also a warning.

You can learn the wrong lesson from a bad eval.

And you can misunderstand both strength and weakness if you are only looking at model names instead of the full system around them.

This work is still in its early phases and still narrow. But it is already telling us something useful: the next layer of progress will not come from bigger scoreboards alone. It will come from better tasks, better harnesses, and a clearer view of how much of domain capability is native to the model and how much is being supplied by the environment around it.

Where Capability Actually Lives in Agentic Engineering

Theodoros Galanos — Tue, 10 Mar 2026 00:00:00 GMT

Estimated reading time: 16 minutes

Engineering is a useful stress test for agentic systems because it exposes a kind of weakness that general demos often hide. The problem is not that the domain is technical in the abstract. It is that the work is instance-bound, constraint-heavy, and intolerant of plausibly generic answers. To be useful here, an agent has to do more than sound competent. It has to stay attached to the artifact, preserve the method, survive verification, and finish in a form that another system or person can trust.

To be useful here, an agent has to do more than sound competent. It has to stay attached to the artifact, preserve the method, survive verification, and finish in a form that another system or person can trust.

That changes the way we should think about progress. In domains like engineering, capability cannot be understood as a property of the base model alone. What matters is whether the full system can remain faithful to the specific document, drawing, assumptions, standards, and constraints that define the task in front of it. The question is not just whether a model can reason in the neighborhood of the work. It is whether the operating environment lets reliable work happen at all.

This broader shift is already visible across AI development. State-of-the-art behaviour increasingly comes from compound systems (BAIR, 2024) rather than single model calls, and actual work-related use still occupies a much narrower space than theoretical capability would suggest, which is part of why deployment questions matter so much Anthropic, 2026. Engineering sharpens that pattern. It forces a harder question than many benchmark settings do: where does reliable capability actually live? How much is native to the model, how much is supplied by tools and verifiers, and how much emerges only when the environment is designed correctly, especially in a landscape where benchmark coverage is still skewed toward more convenient task domains.

That is the question this essay is really about.

TL;DR

In engineering, useful agent performance depends less on generic fluency than on staying attached to the specific artifact, assumptions, constraints, and output format of the task at hand.
That means capability does not live in the base model alone. It is distributed across the full system: tools, verifiers, control flow, output contracts, interfaces, and human review structure.
The real design problem is therefore not just model selection. It is harness design: deciding what the agent should infer, what should be externalised into tools, what should be checked, and how the work should remain controllable.
Many important engineering workflows are better understood as processes rather than isolated tasks, which makes orchestration, UX, visibility, and intervention part of the capability story too.
If we want meaningful progress in engineering and AEC, we need better experimental environments and benchmarks that reflect real artifacts, real failure modes, and the actual conditions required for reliable work.

Why AEC Engineering Breaks Most AI Agent Approaches

It is tempting to treat engineering as one more expert domain that large models will gradually absorb as they become smarter and better trained. There is some truth in that. Engineering work is full of technical language, quantitative reasoning, standards, and procedural knowledge, all of which are at least partially representable in text, math, and code. But that description misses the thing that makes the domain difficult in practice.

Engineering is not hard only because it contains specialized knowledge. It is hard because the work is tied to particular artifacts and particular consequences. A calculation is not just a calculation. It belongs to a drawing set, a schedule, a climate zone, a standard, a collection of assumptions, and a downstream chain of decisions that may depend on it being right for this case and not some nearby case. In that environment, generic competence is not enough. A system can sound perfectly fluent and still fail the job by drifting away from the assigned instance.

That is why instance fidelity matters so much. In many tasks, being approximately in the right conceptual neighborhood is enough to be useful. In engineering, that can be the beginning of failure rather than the end of it. A model that substitutes a different design scenario, silently shifts a governing assumption, applies the wrong interpretation of a standard, or answers in a way that cannot be checked by another party has not produced a near miss. It has broken attachment to the artifact. That kind of failure is especially dangerous because it can still look competent on a casual read.

This is also why engineering is a more interesting challenge than a simple test of mathematical or coding ability. The hard part is often not deriving a formula. It is staying inside the bounds of the real task while carrying the correct assumptions all the way through to a legible output. That starts to look less like raw problem solving and more like a problem of task control: can the system monitor what object it is operating on, track which assumptions still govern the case, notice when more checking is needed, and converge on a form of completion that another party can actually use? General-purpose capability does not automatically guarantee that kind of discipline, and one of the open questions is how much of it is native to the model versus supplied by the surrounding harness.

Seen that way, the question is not whether models can do engineering in some broad, promotional sense. The question is what conditions make engineering work stay controllable. That is a narrower question, but it is also the one that matters if the goal is not spectacle but dependable use.

The Harness Is Where Engineering Capability Gets Made

Once engineering is framed as artifact-bound and verification-sensitive work, the role of the harness looks different. It is no longer reasonable to think of the harness as a thin wrapper around model capability. The harness determines what the system can see, what method it can invoke, what constraints are explicit, what gets checked, what counts as a recoverable error, and what form the final answer must take. In other words, it helps determine not just how the system runs, but what kind of cognition the system is able to express.

That matters because many of the hardest parts of engineering workflows live in precisely those layers. Tools can carry method that would otherwise need to be reconstructed unreliably from prompt context. More than that, many tools are artifacts of accumulated domain expertise. They are places where a field has already embedded procedures, assumptions, checks, tolerances, and accepted ways of doing work. In some cases that expertise is explicit and executable, as in calculations, lookups, and verification routines. In other cases it is more qualitative, showing up as workflows, review habits, and best-practice sequences. Verifiers can enforce habits of checking that a model may not apply consistently on its own. Output contracts can force the system to conclude in a format that is inspectable, comparable, and operationally usable. Turn budgets and control flow can decide whether an agent has enough room to complete a careful review or whether it will collapse into partial work and malformed output.

In that sense, the harness is not just infrastructure. It is part of the cognitive system. It allocates where reasoning happens, where discipline comes from, and how failure is surfaced. Recent work from OpenAI on harness engineering makes the same point from another angle: once agents are doing real work, progress depends heavily on the legibility of the environment, the structure of the feedback loops, and the extent to which knowledge has been made accessible and enforceable inside the system. A calculator tool is not merely a convenience. It is a decision about which parts of the method should be made stable and externalised. A verifier is not just a quality filter. It is a decision to make certain forms of checking structurally available to the agent. Even the shape of the prompt matters less as isolated wording than as one part of a larger control architecture.

This is why questions about orchestration in engineering cannot be reduced to prompt engineering. The meaningful design problem is architectural. What should the agent infer versus look up? What should be encoded in tools versus policy? Which checks should happen during reasoning and which should happen after? When should the system stop and ask for clarification instead of filling gaps with plausible narration? These are not implementation details to be cleaned up after capability arrives. They are part of how capability is built.

Once you see that clearly, a lot of standard debates start to look underspecified. Asking which model is best without asking what environment it is operating in is often the wrong question. In engineering, the more useful question is where the capability actually lives. Some of it lives in the model. Some of it lives in the tools, verifiers, and control policies around it. And some of it may only emerge when those pieces are composed in the right way. If that is true, then progress in this domain will depend not only on better models, but on better harness engineering for engineering.

If that is true, then progress in this domain will depend not only on better models, but on better harness engineering for engineering.

Open Questions for AI Agents in Engineering and AEC

This is where the real research agenda begins. If we are serious about agentic engineering, there are several questions that still look underexplored and experimentally open. They are not small implementation questions. They are structural questions about where competence comes from, how reliability is made, and what exactly we are trying to optimise for.

The first question is where domain knowledge should live. Some engineering method can be carried in the prompt, some in retrieved standards, some in dedicated tools, and some in verifier logic or decomposition policy. Those choices are not equivalent. A method embedded in prose instructions is available in a very different way from a method embedded in a tool or a structured reference. One of the central design problems in this domain is deciding which knowledge should remain internal to the model's reasoning and which should be stabilised outside it.

The second question is the unit of work. We still do not know the right granularity for engineering agents. Some tasks may be small enough to support reliable execution but too narrow to create useful leverage. Others may be large enough to matter operationally but so broad that the agent starts drifting, skipping checks, or losing attachment to the governing artifact. The choice is not between tiny tasks and ambitious autonomy as abstract ideals. It is about finding the span of work within which an agent can still remain controllable. And once the relevant unit turns out not to be a task but a process, questions of decomposition, review structure, and interaction design move from the margins to the centre.

The third question is what actually predicts reliability. Final scores matter, but they are not enough. A system that arrives at a good answer through a fragile process may be much less useful than a system that exposes when it is uncertain, revisits questionable steps, compresses toward clear findings, and behaves differently when it is in trouble. We do not yet know which trace-level signals are robust enough to support runtime monitoring or intervention, but that is exactly the kind of question that becomes important once you care about real workflows instead of static evaluation alone. Recent work on reasoning structure and agent behaviour is starting to point in this direction, treating intermediate behaviour as something analysable rather than just an opaque path to the final score, as in Du et al.'s Agentic Bonds framework.

The fourth question is how failure should be described. It is too coarse to talk about engineering failures as if they were all reasoning failures. In practice, the failure may be loss of task attachment, substitution of the instance, unsupported use of standards, malformed completion, or verbose but weakly grounded narration. Those are different breakdowns, and they imply different remedies. If we do not distinguish them, we will end up measuring systems in ways that hide the mechanisms we most need to improve.

The fifth question is how much capability a well-designed environment can supply. This is one of the most important and most delicate unknowns. A good harness can clearly unlock behaviour that would not reliably appear in a weak setup. But that raises a harder interpretive problem. When does scaffolding help a model express real competence, and when does it effectively perform too much of the task on the model's behalf? We need to know that boundary, because it affects how we read both benchmark results and deployment claims.

Each question changes what we should build, what we should measure, and how we should interpret success. Taken together, they suggest that the real frontier in agentic engineering is not just model improvement. It is experimental clarity about how capability is distributed across the whole system.

From Tasks to Processes: Why Engineering Workflows Need More Than Single-Step Agents

One reason this design space is easy to misunderstand is that we often talk about work as if it arrives in neatly bounded tasks. Sometimes it does. But many of the workflows that matter most in engineering are not really tasks in that sense. They are processes: longer-horizon structures with multiple artifacts, multiple experts, natural gates of review and control, and repeated moments where uncertainty has to be managed rather than ignored.

That distinction matters because a process creates a different design problem from a task. A task invites a question like: can the system complete this unit of work correctly? A process invites harder questions. How should work be decomposed across time? Where should evidence accumulate? When should one expert intervene directly, and when should the system continue on its own? What kind of intermediate state has to be visible for redirection, challenge, or sign-off to be meaningful? The moment you move from tasks to processes, orchestration stops being a thin implementation layer and becomes part of the substance of the work.

This is especially visible in workflows that look simple when described at a distance. A due diligence engagement can sound like one task: review the available material and produce a judgment. In practice it is nothing like a single bounded action. It is a process of evidence gathering, interpretation, cross-checking, escalation, synthesis, and review, often with several experts working in parallel and stepping in at different moments for different reasons. The important problem is not just whether an agent can perform one inference. It is whether the overall system can support the sequence of inferences, checks, and interventions that make the result trustworthy.

That is also why LLM UX becomes more important, not less, once agents enter the picture. If the real unit of work is a process, then the interaction pattern between humans and the system becomes part of the capability. We need to know what it means for an expert to drive the loop rather than merely appear in the loop at pre-specified checkpoints. We need to know how a system should request judgment, how it should expose uncertainty, and how it should allow redirection without forcing the human to reconstruct the entire state of the process from scratch.

This is not just a matter of convenience. It is a cognitive issue. These systems can already accumulate more intermediate state, more branching hypotheses, and more raw evidence than any single human can comfortably keep in working memory. So the question becomes: what is the control surface for a process like that? What is the communication layer that abstracts complexity without hiding the very evidence a reviewer may need to inspect? What kinds of summaries, provenance views, exception queues, and escalation mechanisms let a human meaningfully steer a process they cannot fully replay in their head?

Seen this way, agentic engineering is not only about building systems that can do tasks. It is about building systems that can participate in processes without dissolving the human capacity to understand, direct, and verify what is happening. That may be one of the deepest reasons harness design matters so much in engineering. The harness is not just coordinating tools around a model. It is helping define the structure through which work, judgment, and control move over time.

AEC Needs Better Benchmarks, Not Better Demos

If capability is environmentally expressed, then environment design has to become an empirical science rather than a collection of intuitions. In engineering, and especially in AEC, that point still has not been fully absorbed. Too much of the current conversation remains stuck at the level of demos, generic prompting claims, or isolated examples of model fluency. What the field needs instead is controlled experimentation on engineering task environments. Instead of asking whether a model can solve a cherry-picked problem, we need task families that let us vary conditions deliberately, observe what changes, and learn which combinations of tools, constraints, interfaces, and verification loops actually produce dependable work.

That starts with grounding. The tasks have to come from artifacts people actually use, and they have to reflect the problem spaces and value spaces that matter in practice. Every benchmark task is not just an isolated prompt. It is an instance of a broader problem class, and part of the job is identifying those classes deliberately rather than sampling whatever happens to be easy to score. From there, the environment itself has to become legible to experimentation: harness variants, tool access, verifier behaviour, output constraints, turn budgets, and other control variables need to be adjustable so we can see what is actually driving performance. And the measurements cannot stop at end scores. We need outcome metrics, but also trace-level analysis and explicit failure taxonomies that tell us what kind of process the system used and what kind of breakdown occurred when it failed.

The evaluation target also has to expand beyond isolated tasks. In many real settings, the important object is a process: a longer-horizon workflow with multiple artifacts, multiple actors, parallel workstreams, and repeated gates of review and redirection. The benchmark surface therefore cannot stop at correctness on a single step. It has to ask how control is handed off, how intermediate state is exposed, how uncertainty is surfaced, and how the system behaves when a human expert needs to redirect the work without redoing it from scratch.

That pushes evaluation toward a more realistic style. Engineering work is not purely textual, rarely single-step, and often only partly verifiable by one exact answer. Drawings, schedules, specifications, details, markups, diagrams, and spatial context are part of the task, not decorative extras. A serious research program in this space will need multimodal inputs much earlier than many current evals do. It will also need better machinery for tasks where adequacy, prioritisation, and review quality matter more than one final number.

It also raises a cognitive problem as well as a technical one. Once these systems are processing more state, more evidence, and more branching intermediate work than any human can comfortably hold in mind, the design challenge shifts again. What is the communication layer between the system and the expert? How should complexity be abstracted without hiding the very evidence a reviewer may need to see? What summaries, state representations, provenance trails, and escalation patterns let a human steer a process they cannot fully replay in their head? Those questions sit at the boundary between interface design, cognition, and control, and they are likely to matter just as much as the underlying model or toolchain.

The point is not to make benchmarks larger for their own sake. The point is to make them shaped enough like the work that they can teach us something real. If the benchmark is too convenient, it will mainly reward convenience. If the environment is too unlike practice, it will tell us very little about what systems can actually be trusted to do.

Reliability Before Autonomy: What AEC Firms Actually Need from AI Agents

The most useful near-term goal is probably not full engineering autonomy. It is reliable leverage on bounded tasks. Systems that can assist with scoped review, structured checking, discrepancy detection, and well-instrumented analysis may already create value long before end-to-end automation becomes plausible.

That is also a more useful way to think about autonomy itself. In practice, autonomy is not just a property of the model. It is an achievement of the system. A workflow becomes more autonomous when the surrounding environment makes the task legible, encodes the relevant constraints, exposes the right tools, and provides enough feedback for the agent to act with bounded independence rather than uncontrolled freedom.

In practice, autonomy is not just a property of the model. It is an achievement of the system.

That framing matters because it keeps the work tied to real workflow value instead of speculative theatre. Engineering organisations do not need a general claim that an agent is intelligent. They need to know whether a system can reduce time on a defined task while preserving reviewability and limiting failure cost. In many cases, the right near-term design is not autonomy but disciplined assistance inside a human-controlled process.

That is also why reliability has to come before ambition. A narrower system that stays attached to the artifact, exposes its work clearly, and fails in legible ways is often more valuable than a broader one that produces impressive but weakly controllable outputs. In engineering, trust is not a cosmetic feature added after the fact. It is part of the product. If a system cannot show why it acted, preserve the assumptions it relied on, and support meaningful intervention when things go wrong, then the path to autonomy runs through fragility rather than leverage.

This is especially important in domains like AEC, where the work is high consequence, low volume, and deeply review-shaped. The near-term question is not whether a system can replace expertise wholesale. It is whether it can make expert time more powerful without dissolving accountability. The prize is not a machine that replaces engineering judgment in the abstract. It is a system that can participate in engineering workflows in ways that are inspectable, recoverable, and worth trusting on defined slices of the job.

Why Harness Engineering for AEC Matters Now

This is the kind of work that matters if we want progress in engineering AI to be meaningful rather than theatrical. The glamorous parts of this field are easy to spot. The boring middle is not. But the boring middle is where much of the real leverage lives: task selection, workflow design, harness construction, interface design, verification, and evaluation. If we skip that layer, we do not get reliable systems. We get better demos.

If we skip that layer, we do not get reliable systems. We get better demos.

The interesting frontier is not another round of vague claims that models are getting smarter. It is the slower and more consequential work of identifying valuable problem spaces, building task environments around them, designing the right control surfaces, and learning how capability is distributed across models, tools, interfaces, and human judgment.

That is also what makes this an unusually exciting moment to work on these problems. The design space is still open. The operational patterns are not settled. The important domains are still underexplored. That means there is real room to shape the field: not just by building better models, but by building better harnesses, better evaluations, better interfaces, and better ways for experts and autonomous systems to work together on real tasks and real processes.

If engineering and AEC are going to benefit meaningfully from these systems, that work cannot be left to benchmark convenience or generic product abstractions. It will require people who understand the domains, the artifacts, the review practices, and the economics of real workflows to help define what good looks like. Before this becomes a field of grand claims, it should become a field of good experiments, good environments, and good judgment about where these systems can actually be trusted.