Topic

Ai Benchmarks

2 pieces in this thread.

  1. What If the Harness Could Improve Itself?

    Applying the autoresearch pattern to self-improve an engineering agent harness. Automated prompt optimisation across HVAC audit tasks on Claude and GPT-4.1-mini, showing how harness engineering compounds when the improvement loop runs itself.

    harness-engineeringautoresearchagentic-aiai-in-aec
  2. Benchmarking Agents on Real Engineering Work Is Already Teaching Us Something Important

    Benchmarking AI agents on real HVAC engineering tasks across Claude and GPT models. Results on harness-dependent capability, agent evaluation design, and why AEC-domain benchmarks reveal what general benchmarks miss.

    harness-engineeringagentic-aiai-in-aecai-benchmarks