Foreman AI Benchmark 001
Empirical Comparison of General-Purpose AI vs Domain-Tuned Plan Intelligence
Abstract
This document presents Benchmark 001 in an ongoing evaluation series conducted by Foreman AI. The benchmark compares a state-of-the-art general-purpose large language model (LLM) against Foreman AI when applied to real-world construction plan sets.
The objective is not to measure abstract reasoning or conversational ability, but to evaluate practical construction plan intelligence: the ability to identify, extract, classify, and defend plan-stated information under real estimating and preconstruction constraints.
The results reveal a consistent and material performance gap favoring Foreman AI, particularly in recall depth, sheet coverage, and technical completeness.
Benchmark Scope
Benchmark 001 evaluates both systems against identical plan sets containing:
- Architectural floor plans
- Structural framing plans
- Elevations and sections
- MEP drawings and equipment schedules
- Dense annotations and small-font technical notes
Both systems were bound by the same rules:
- Plan-stated text only
- No geometric scaling
- No inferred quantities
- Mandatory refusal where proof was missing
- Verbatim source attribution required for all extracted data
Key Observations
1. Recall Depth Is the Primary Differentiator
The general-purpose LLM reliably captured high-visibility elements such as room labels, prominent dimensions, and major annotations. However, it consistently under-recalled:
- Small-font structural notes
- Connector and hardware callouts
- Repetitive framing specifications
- Scope-modifying notes governed by "U.N.O."
- Dense annotations embedded within drawing fields
Foreman AI demonstrated materially higher recall across these same categories, without sacrificing refusal discipline.
2. Page Classification Coverage Diverged Significantly
The LLM classified some sheets correctly but struggled to:
- Identify all pages in large plan sets
- Maintain consistent sheet ID attribution
- Recognize alternates and option sheets
- Detect schedules embedded mid-set
Foreman AI showed near-complete sheet coverage and stable classification across architectural, structural, and MEP disciplines.
3. Vision-Based Models Favor Prominence Over Importance
General-purpose vision models exhibited a bias toward visually prominent content. Large text, clear labels, and central drawing elements were favored, while technically critical but visually dense regions were frequently under-processed.
This resulted in outputs that appeared confident but omitted information that would be considered mandatory in a professional plan review.
4. Conservative Refusal Was Correct — But Often Misapplied
The LLM demonstrated strong safety behavior, refusing to fabricate quantities or infer missing data. However, refusals were sometimes triggered by missed information, not true absence of plan-stated data.
Foreman AI showed equivalent refusal behavior while maintaining higher recall of verifiable content.
5. Schedule Detection and Use Was a Major Gap
Schedules proved to be a decisive differentiator.
The general-purpose LLM frequently:
- Missed schedules unless clearly isolated
- Failed to reconcile schedule entries with plan callouts
- Treated schedules as optional context
Foreman AI consistently identified schedules when present and extracted their contents verbatim, while refusing reconciliation when schedules were genuinely absent.
Why This Matters
Construction drawings are not narrative documents. They are high-density technical artifacts where:
- Critical information is distributed, not centralized
- Visual prominence does not correlate with importance
- Shorthand and repetition are intentional
- Missing a single note can materially impact cost, scope, or constructability
Systems optimized for conversational or semantic understanding struggle under these conditions.
Benchmark 001 demonstrates that domain alignment and document intelligence outweigh raw model generality in construction applications.
Industry Implications
The findings suggest that:
- Larger or newer general-purpose models alone will not close this gap
- Vision-first approaches plateau quickly on dense technical drawings
- The core limitation is recall fidelity, not reasoning ability
- Construction AI must be evaluated on what it captures, not just what it refuses
This helps explain why many AI estimating tools perform well in demos but fail under real-world project conditions.
Conclusion
Benchmark 001 revealed a clear distinction in system behavior.
General-purpose LLM
Functioned as a cautious assistant
Foreman AI
Functioned as a plan reviewer
That distinction — between reading a plan and understanding a plan — defines the performance gap observed in this benchmark.
Additional benchmarks in this series will expand on these findings across different drawing types, project scales, and technical disciplines.
Experience the difference
Upload your own plan set and see how Foreman AI performs on your drawings.