Benchmarking Agents on Real Engineering Work Is Already Teaching Us Something Important
Benchmarking AI agents on real HVAC engineering tasks across Claude and GPT models. Results on harness-dependent capability, agent evaluation design, and why AEC-domain benchmarks reveal what general benchmarks miss.
harness-engineeringagentic-aiai-in-aecai-benchmarks