Topic

Agent Evaluation

4 pieces in this thread.

2026-06-02

Plausible Answers, Failed Workflows

An AEC-Bench release evaluation read as workflow reliability, not prose quality. Chapter by chapter: why a model can produce a plausible answer and still fail the durable record a project has to audit.

harness-engineeringagentic-aiai-in-aecai-benchmarks
2026-05-16

Making aec-bench Trainable with Prime Lab

How aec-bench and Prime Intellect's Lab turn engineering benchmarks into verifier-backed RL environments, adapter training runs, and inspectable traces.

aec-benchprime-labreinforcement-learningagent-evaluation
2026-03-14

What If the Harness Could Improve Itself?

Applying the autoresearch pattern to self-improve an engineering agent harness. Automated prompt optimisation across HVAC audit tasks on Claude and GPT-4.1-mini, showing how harness engineering compounds when the improvement loop runs itself.

harness-engineeringautoresearchagentic-aiai-in-aec
2026-03-12

Benchmarking Agents on Real Engineering Work Is Already Teaching Us Something Important

Benchmarking AI agents on real HVAC engineering tasks across Claude and GPT models. Results on harness-dependent capability, agent evaluation design, and why AEC-domain benchmarks reveal what general benchmarks miss.

harness-engineeringagentic-aiai-in-aecai-benchmarks