Archive

Writing

Notes on harness engineering, agent evaluation, and building reliable AI systems for real engineering work.

2026 11 pieces

2026-07-08

Mediation, Not Intermediation

Why the 'fix your foundations before AI' message has it backwards: agentic workflows are the way out of legacy data, and governance worth having is co-designed from practice, not committees.

agentic-aiharness-engineeringdata-engineeringgovernance
2026-06-24

Task Worlds and Meta-Harnesses

How task worlds, Badiou, Plasticity, and the AEC-Bench meta-harness turn task prose, evidence, review, governance, and repair into runnable machinery.

agentic-aiharness-engineeringaec-benchtask-worlds
2026-06-02

Plausible Answers, Failed Workflows

An AEC-Bench release evaluation read as workflow reliability, not prose quality. Chapter by chapter: why a model can produce a plausible answer and still fail the durable record a project has to audit.

harness-engineeringagentic-aiai-in-aecai-benchmarks
2026-05-16

Making aec-bench Trainable with Prime Lab

How aec-bench and Prime Intellect's Lab turn engineering benchmarks into verifier-backed RL environments, adapter training runs, and inspectable traces.

aec-benchprime-labreinforcement-learningagent-evaluation
2026-05-03

Executable Standards

Better tools and verifiers are not enough. The next harness boundary is the clause itself — turning standards, briefs, and codes into versioned predicates and replayable certificates.

harness-engineeringautoformalisationai-in-aecformal-methods
2026-04-18

The Third Axis

What happens when you let the harness improve itself — two experiments in feedback-driven harness evolution, and an honest look at how rough the trajectory actually is.

harness-engineeringagentic-aiself-improvementautoresearch
2026-04-04 Featured

Recursive by Design

Building Recursive Language Model agents for real engineering tasks — from 1.5M tokens to 53K with Lambda-RLM, and what we learned about agent harness design along the way.

recursive-language-modelsrlmlambda-rlmharness-design
2026-04-03 Featured

The Harness Is All You Need

Why domain-specific agent harnesses, not bigger models, are what close the AI performance gap on real engineering tasks — and why the AEC industry needs proper benchmarks to prove it.

benchmarksaecai-agentsharness-design
2026-03-14

What If the Harness Could Improve Itself?

Applying the autoresearch pattern to self-improve an engineering agent harness. Automated prompt optimisation across HVAC audit tasks on Claude and GPT-4.1-mini, showing how harness engineering compounds when the improvement loop runs itself.

harness-engineeringautoresearchagentic-aiai-in-aec
2026-03-12

Benchmarking Agents on Real Engineering Work Is Already Teaching Us Something Important

Benchmarking AI agents on real HVAC engineering tasks across Claude and GPT models. Results on harness-dependent capability, agent evaluation design, and why AEC-domain benchmarks reveal what general benchmarks miss.

harness-engineeringagentic-aiai-in-aecai-benchmarks
2026-03-10

Where Capability Actually Lives in Agentic Engineering

In AEC and domain-specific engineering, AI agent capability lives not in the model alone but in harness engineering — the tools, verifiers, orchestration, and process design that make agentic work reliable.

harness-engineeringagentic-aiai-in-aecengineering-ai