Plausible Answers, Failed Workflows
An AEC-Bench release evaluation read as workflow reliability, not prose quality. Chapter by chapter: why a model can produce a plausible answer and still fail the durable record a project has to audit.
harness-engineeringagentic-aiai-in-aecai-benchmarks