Where Capability Actually Lives in Agentic Engineering

Estimated reading time: 16 minutes

Engineering is a useful stress test for agentic systems because it exposes a kind of weakness that general demos often hide. The problem is not that the domain is technical in the abstract. It is that the work is instance-bound, constraint-heavy, and intolerant of plausibly generic answers. To be useful here, an agent has to do more than sound competent. It has to stay attached to the artifact, preserve the method, survive verification, and finish in a form that another system or person can trust.

To be useful here, an agent has to do more than sound competent. It has to stay attached to the artifact, preserve the method, survive verification, and finish in a form that another system or person can trust.

That changes the way we should think about progress. In domains like engineering, capability cannot be understood as a property of the base model alone. What matters is whether the full system can remain faithful to the specific document, drawing, assumptions, standards, and constraints that define the task in front of it. The question is not just whether a model can reason in the neighborhood of the work. It is whether the operating environment lets reliable work happen at all.

This broader shift is already visible across AI development. State-of-the-art behaviour increasingly comes from compound systems (BAIR, 2024) rather than single model calls, and actual work-related use still occupies a much narrower space than theoretical capability would suggest, which is part of why deployment questions matter so much Anthropic, 2026. Engineering sharpens that pattern. It forces a harder question than many benchmark settings do: where does reliable capability actually live? How much is native to the model, how much is supplied by tools and verifiers, and how much emerges only when the environment is designed correctly, especially in a landscape where benchmark coverage is still skewed toward more convenient task domains.

That is the question this essay is really about.

TL;DR

In engineering, useful agent performance depends less on generic fluency than on staying attached to the specific artifact, assumptions, constraints, and output format of the task at hand.
That means capability does not live in the base model alone. It is distributed across the full system: tools, verifiers, control flow, output contracts, interfaces, and human review structure.
The real design problem is therefore not just model selection. It is harness design: deciding what the agent should infer, what should be externalised into tools, what should be checked, and how the work should remain controllable.
Many important engineering workflows are better understood as processes rather than isolated tasks, which makes orchestration, UX, visibility, and intervention part of the capability story too.
If we want meaningful progress in engineering and AEC, we need better experimental environments and benchmarks that reflect real artifacts, real failure modes, and the actual conditions required for reliable work.

Why AEC Engineering Breaks Most AI Agent Approaches

It is tempting to treat engineering as one more expert domain that large models will gradually absorb as they become smarter and better trained. There is some truth in that. Engineering work is full of technical language, quantitative reasoning, standards, and procedural knowledge, all of which are at least partially representable in text, math, and code. But that description misses the thing that makes the domain difficult in practice.

Engineering is not hard only because it contains specialized knowledge. It is hard because the work is tied to particular artifacts and particular consequences. A calculation is not just a calculation. It belongs to a drawing set, a schedule, a climate zone, a standard, a collection of assumptions, and a downstream chain of decisions that may depend on it being right for this case and not some nearby case. In that environment, generic competence is not enough. A system can sound perfectly fluent and still fail the job by drifting away from the assigned instance.

That is why instance fidelity matters so much. In many tasks, being approximately in the right conceptual neighborhood is enough to be useful. In engineering, that can be the beginning of failure rather than the end of it. A model that substitutes a different design scenario, silently shifts a governing assumption, applies the wrong interpretation of a standard, or answers in a way that cannot be checked by another party has not produced a near miss. It has broken attachment to the artifact. That kind of failure is especially dangerous because it can still look competent on a casual read.

This is also why engineering is a more interesting challenge than a simple test of mathematical or coding ability. The hard part is often not deriving a formula. It is staying inside the bounds of the real task while carrying the correct assumptions all the way through to a legible output. That starts to look less like raw problem solving and more like a problem of task control: can the system monitor what object it is operating on, track which assumptions still govern the case, notice when more checking is needed, and converge on a form of completion that another party can actually use? General-purpose capability does not automatically guarantee that kind of discipline, and one of the open questions is how much of it is native to the model versus supplied by the surrounding harness.

Seen that way, the question is not whether models can do engineering in some broad, promotional sense. The question is what conditions make engineering work stay controllable. That is a narrower question, but it is also the one that matters if the goal is not spectacle but dependable use.

The Harness Is Where Engineering Capability Gets Made

Once engineering is framed as artifact-bound and verification-sensitive work, the role of the harness looks different. It is no longer reasonable to think of the harness as a thin wrapper around model capability. The harness determines what the system can see, what method it can invoke, what constraints are explicit, what gets checked, what counts as a recoverable error, and what form the final answer must take. In other words, it helps determine not just how the system runs, but what kind of cognition the system is able to express.

That matters because many of the hardest parts of engineering workflows live in precisely those layers. Tools can carry method that would otherwise need to be reconstructed unreliably from prompt context. More than that, many tools are artifacts of accumulated domain expertise. They are places where a field has already embedded procedures, assumptions, checks, tolerances, and accepted ways of doing work. In some cases that expertise is explicit and executable, as in calculations, lookups, and verification routines. In other cases it is more qualitative, showing up as workflows, review habits, and best-practice sequences. Verifiers can enforce habits of checking that a model may not apply consistently on its own. Output contracts can force the system to conclude in a format that is inspectable, comparable, and operationally usable. Turn budgets and control flow can decide whether an agent has enough room to complete a careful review or whether it will collapse into partial work and malformed output.

In that sense, the harness is not just infrastructure. It is part of the cognitive system. It allocates where reasoning happens, where discipline comes from, and how failure is surfaced. Recent work from OpenAI on harness engineering makes the same point from another angle: once agents are doing real work, progress depends heavily on the legibility of the environment, the structure of the feedback loops, and the extent to which knowledge has been made accessible and enforceable inside the system. A calculator tool is not merely a convenience. It is a decision about which parts of the method should be made stable and externalised. A verifier is not just a quality filter. It is a decision to make certain forms of checking structurally available to the agent. Even the shape of the prompt matters less as isolated wording than as one part of a larger control architecture.

This is why questions about orchestration in engineering cannot be reduced to prompt engineering. The meaningful design problem is architectural. What should the agent infer versus look up? What should be encoded in tools versus policy? Which checks should happen during reasoning and which should happen after? When should the system stop and ask for clarification instead of filling gaps with plausible narration? These are not implementation details to be cleaned up after capability arrives. They are part of how capability is built.

Once you see that clearly, a lot of standard debates start to look underspecified. Asking which model is best without asking what environment it is operating in is often the wrong question. In engineering, the more useful question is where the capability actually lives. Some of it lives in the model. Some of it lives in the tools, verifiers, and control policies around it. And some of it may only emerge when those pieces are composed in the right way. If that is true, then progress in this domain will depend not only on better models, but on better harness engineering for engineering.

If that is true, then progress in this domain will depend not only on better models, but on better harness engineering for engineering.

Open Questions for AI Agents in Engineering and AEC

This is where the real research agenda begins. If we are serious about agentic engineering, there are several questions that still look underexplored and experimentally open. They are not small implementation questions. They are structural questions about where competence comes from, how reliability is made, and what exactly we are trying to optimise for.

The first question is where domain knowledge should live. Some engineering method can be carried in the prompt, some in retrieved standards, some in dedicated tools, and some in verifier logic or decomposition policy. Those choices are not equivalent. A method embedded in prose instructions is available in a very different way from a method embedded in a tool or a structured reference. One of the central design problems in this domain is deciding which knowledge should remain internal to the model’s reasoning and which should be stabilised outside it.

The second question is the unit of work. We still do not know the right granularity for engineering agents. Some tasks may be small enough to support reliable execution but too narrow to create useful leverage. Others may be large enough to matter operationally but so broad that the agent starts drifting, skipping checks, or losing attachment to the governing artifact. The choice is not between tiny tasks and ambitious autonomy as abstract ideals. It is about finding the span of work within which an agent can still remain controllable. And once the relevant unit turns out not to be a task but a process, questions of decomposition, review structure, and interaction design move from the margins to the centre.

The third question is what actually predicts reliability. Final scores matter, but they are not enough. A system that arrives at a good answer through a fragile process may be much less useful than a system that exposes when it is uncertain, revisits questionable steps, compresses toward clear findings, and behaves differently when it is in trouble. We do not yet know which trace-level signals are robust enough to support runtime monitoring or intervention, but that is exactly the kind of question that becomes important once you care about real workflows instead of static evaluation alone. Recent work on reasoning structure and agent behaviour is starting to point in this direction, treating intermediate behaviour as something analysable rather than just an opaque path to the final score, as in Du et al.’s Agentic Bonds framework.

The fourth question is how failure should be described. It is too coarse to talk about engineering failures as if they were all reasoning failures. In practice, the failure may be loss of task attachment, substitution of the instance, unsupported use of standards, malformed completion, or verbose but weakly grounded narration. Those are different breakdowns, and they imply different remedies. If we do not distinguish them, we will end up measuring systems in ways that hide the mechanisms we most need to improve.

The fifth question is how much capability a well-designed environment can supply. This is one of the most important and most delicate unknowns. A good harness can clearly unlock behaviour that would not reliably appear in a weak setup. But that raises a harder interpretive problem. When does scaffolding help a model express real competence, and when does it effectively perform too much of the task on the model’s behalf? We need to know that boundary, because it affects how we read both benchmark results and deployment claims.

Each question changes what we should build, what we should measure, and how we should interpret success. Taken together, they suggest that the real frontier in agentic engineering is not just model improvement. It is experimental clarity about how capability is distributed across the whole system.

From Tasks to Processes: Why Engineering Workflows Need More Than Single-Step Agents

One reason this design space is easy to misunderstand is that we often talk about work as if it arrives in neatly bounded tasks. Sometimes it does. But many of the workflows that matter most in engineering are not really tasks in that sense. They are processes: longer-horizon structures with multiple artifacts, multiple experts, natural gates of review and control, and repeated moments where uncertainty has to be managed rather than ignored.

That distinction matters because a process creates a different design problem from a task. A task invites a question like: can the system complete this unit of work correctly? A process invites harder questions. How should work be decomposed across time? Where should evidence accumulate? When should one expert intervene directly, and when should the system continue on its own? What kind of intermediate state has to be visible for redirection, challenge, or sign-off to be meaningful? The moment you move from tasks to processes, orchestration stops being a thin implementation layer and becomes part of the substance of the work.

This is especially visible in workflows that look simple when described at a distance. A due diligence engagement can sound like one task: review the available material and produce a judgment. In practice it is nothing like a single bounded action. It is a process of evidence gathering, interpretation, cross-checking, escalation, synthesis, and review, often with several experts working in parallel and stepping in at different moments for different reasons. The important problem is not just whether an agent can perform one inference. It is whether the overall system can support the sequence of inferences, checks, and interventions that make the result trustworthy.

That is also why LLM UX becomes more important, not less, once agents enter the picture. If the real unit of work is a process, then the interaction pattern between humans and the system becomes part of the capability. We need to know what it means for an expert to drive the loop rather than merely appear in the loop at pre-specified checkpoints. We need to know how a system should request judgment, how it should expose uncertainty, and how it should allow redirection without forcing the human to reconstruct the entire state of the process from scratch.

This is not just a matter of convenience. It is a cognitive issue. These systems can already accumulate more intermediate state, more branching hypotheses, and more raw evidence than any single human can comfortably keep in working memory. So the question becomes: what is the control surface for a process like that? What is the communication layer that abstracts complexity without hiding the very evidence a reviewer may need to inspect? What kinds of summaries, provenance views, exception queues, and escalation mechanisms let a human meaningfully steer a process they cannot fully replay in their head?

Seen this way, agentic engineering is not only about building systems that can do tasks. It is about building systems that can participate in processes without dissolving the human capacity to understand, direct, and verify what is happening. That may be one of the deepest reasons harness design matters so much in engineering. The harness is not just coordinating tools around a model. It is helping define the structure through which work, judgment, and control move over time.

AEC Needs Better Benchmarks, Not Better Demos

If capability is environmentally expressed, then environment design has to become an empirical science rather than a collection of intuitions. In engineering, and especially in AEC, that point still has not been fully absorbed. Too much of the current conversation remains stuck at the level of demos, generic prompting claims, or isolated examples of model fluency. What the field needs instead is controlled experimentation on engineering task environments. Instead of asking whether a model can solve a cherry-picked problem, we need task families that let us vary conditions deliberately, observe what changes, and learn which combinations of tools, constraints, interfaces, and verification loops actually produce dependable work.

That starts with grounding. The tasks have to come from artifacts people actually use, and they have to reflect the problem spaces and value spaces that matter in practice. Every benchmark task is not just an isolated prompt. It is an instance of a broader problem class, and part of the job is identifying those classes deliberately rather than sampling whatever happens to be easy to score. From there, the environment itself has to become legible to experimentation: harness variants, tool access, verifier behaviour, output constraints, turn budgets, and other control variables need to be adjustable so we can see what is actually driving performance. And the measurements cannot stop at end scores. We need outcome metrics, but also trace-level analysis and explicit failure taxonomies that tell us what kind of process the system used and what kind of breakdown occurred when it failed.

The evaluation target also has to expand beyond isolated tasks. In many real settings, the important object is a process: a longer-horizon workflow with multiple artifacts, multiple actors, parallel workstreams, and repeated gates of review and redirection. The benchmark surface therefore cannot stop at correctness on a single step. It has to ask how control is handed off, how intermediate state is exposed, how uncertainty is surfaced, and how the system behaves when a human expert needs to redirect the work without redoing it from scratch.

That pushes evaluation toward a more realistic style. Engineering work is not purely textual, rarely single-step, and often only partly verifiable by one exact answer. Drawings, schedules, specifications, details, markups, diagrams, and spatial context are part of the task, not decorative extras. A serious research program in this space will need multimodal inputs much earlier than many current evals do. It will also need better machinery for tasks where adequacy, prioritisation, and review quality matter more than one final number.

It also raises a cognitive problem as well as a technical one. Once these systems are processing more state, more evidence, and more branching intermediate work than any human can comfortably hold in mind, the design challenge shifts again. What is the communication layer between the system and the expert? How should complexity be abstracted without hiding the very evidence a reviewer may need to see? What summaries, state representations, provenance trails, and escalation patterns let a human steer a process they cannot fully replay in their head? Those questions sit at the boundary between interface design, cognition, and control, and they are likely to matter just as much as the underlying model or toolchain.

The point is not to make benchmarks larger for their own sake. The point is to make them shaped enough like the work that they can teach us something real. If the benchmark is too convenient, it will mainly reward convenience. If the environment is too unlike practice, it will tell us very little about what systems can actually be trusted to do.

Reliability Before Autonomy: What AEC Firms Actually Need from AI Agents

The most useful near-term goal is probably not full engineering autonomy. It is reliable leverage on bounded tasks. Systems that can assist with scoped review, structured checking, discrepancy detection, and well-instrumented analysis may already create value long before end-to-end automation becomes plausible.

That is also a more useful way to think about autonomy itself. In practice, autonomy is not just a property of the model. It is an achievement of the system. A workflow becomes more autonomous when the surrounding environment makes the task legible, encodes the relevant constraints, exposes the right tools, and provides enough feedback for the agent to act with bounded independence rather than uncontrolled freedom.

In practice, autonomy is not just a property of the model. It is an achievement of the system.

That framing matters because it keeps the work tied to real workflow value instead of speculative theatre. Engineering organisations do not need a general claim that an agent is intelligent. They need to know whether a system can reduce time on a defined task while preserving reviewability and limiting failure cost. In many cases, the right near-term design is not autonomy but disciplined assistance inside a human-controlled process.

That is also why reliability has to come before ambition. A narrower system that stays attached to the artifact, exposes its work clearly, and fails in legible ways is often more valuable than a broader one that produces impressive but weakly controllable outputs. In engineering, trust is not a cosmetic feature added after the fact. It is part of the product. If a system cannot show why it acted, preserve the assumptions it relied on, and support meaningful intervention when things go wrong, then the path to autonomy runs through fragility rather than leverage.

This is especially important in domains like AEC, where the work is high consequence, low volume, and deeply review-shaped. The near-term question is not whether a system can replace expertise wholesale. It is whether it can make expert time more powerful without dissolving accountability. The prize is not a machine that replaces engineering judgment in the abstract. It is a system that can participate in engineering workflows in ways that are inspectable, recoverable, and worth trusting on defined slices of the job.

Why Harness Engineering for AEC Matters Now

This is the kind of work that matters if we want progress in engineering AI to be meaningful rather than theatrical. The glamorous parts of this field are easy to spot. The boring middle is not. But the boring middle is where much of the real leverage lives: task selection, workflow design, harness construction, interface design, verification, and evaluation. If we skip that layer, we do not get reliable systems. We get better demos.

If we skip that layer, we do not get reliable systems. We get better demos.

The interesting frontier is not another round of vague claims that models are getting smarter. It is the slower and more consequential work of identifying valuable problem spaces, building task environments around them, designing the right control surfaces, and learning how capability is distributed across models, tools, interfaces, and human judgment.

That is also what makes this an unusually exciting moment to work on these problems. The design space is still open. The operational patterns are not settled. The important domains are still underexplored. That means there is real room to shape the field: not just by building better models, but by building better harnesses, better evaluations, better interfaces, and better ways for experts and autonomous systems to work together on real tasks and real processes.

If engineering and AEC are going to benefit meaningfully from these systems, that work cannot be left to benchmark convenience or generic product abstractions. It will require people who understand the domains, the artifacts, the review practices, and the economics of real workflows to help define what good looks like. Before this becomes a field of grand claims, it should become a field of good experiments, good environments, and good judgment about where these systems can actually be trusted.