#agent-evaluation
Posts in this topic thread.
What If the Harness Could Improve Itself?
harness-engineering autoresearch agentic-ai ai-in-aec ai-benchmarks agent-evaluation prompt-optimisation design-review 2026-03-12 Applying the autoresearch pattern to self-improve an engineering agent harness. Automated prompt optimisation across HVAC audit tasks on Claude and GPT-4.1-mini, showing how harness engineering compounds when the improvement loop runs itself.
Benchmarking Agents on Real Engineering Work Is Already Teaching Us Something Important
harness-engineering agentic-ai ai-in-aec ai-benchmarks agent-evaluation hvac-ai design-review mep-automation Benchmarking AI agents on real HVAC engineering tasks across Claude and GPT models. Results on harness-dependent capability, agent evaluation design, and why AEC-domain benchmarks reveal what general benchmarks miss.