Writing
Notes on harness engineering, agent evaluation, and building reliable AI systems for real engineering work.
-
Making aec-bench Trainable with Prime Lab
How aec-bench and Prime Intellect's Lab turn engineering benchmarks into verifier-backed RL environments, adapter training runs, and inspectable traces.
-
Executable Standards
Better tools and verifiers are not enough. The next harness boundary is the clause itself — turning standards, briefs, and codes into versioned predicates and replayable certificates.
-
The Third Axis
What happens when you let the harness improve itself — two experiments in feedback-driven harness evolution, and an honest look at how rough the trajectory actually is.
-
Recursive by Design
Building Recursive Language Model agents for real engineering tasks — from 1.5M tokens to 53K with Lambda-RLM, and what we learned about agent harness design along the way.
-
The Harness Is All You Need
Why domain-specific agent harnesses, not bigger models, are what close the AI performance gap on real engineering tasks — and why the AEC industry needs proper benchmarks to prove it.
-
What If the Harness Could Improve Itself?
Applying the autoresearch pattern to self-improve an engineering agent harness. Automated prompt optimisation across HVAC audit tasks on Claude and GPT-4.1-mini, showing how harness engineering compounds when the improvement loop runs itself.
-
Benchmarking Agents on Real Engineering Work Is Already Teaching Us Something Important
Benchmarking AI agents on real HVAC engineering tasks across Claude and GPT models. Results on harness-dependent capability, agent evaluation design, and why AEC-domain benchmarks reveal what general benchmarks miss.
-
Where Capability Actually Lives in Agentic Engineering
In AEC and domain-specific engineering, AI agent capability lives not in the model alone but in harness engineering — the tools, verifiers, orchestration, and process design that make agentic work reliable.