A Large-Scale, Vector-Native Dataset for Construction Document Intelligence
Abstract
Modern large language models and multimodal systems have demonstrated impressive capabilities across natural images, text, and general documents. However, one domain has remained largely inaccessible to meaningful model training and evaluation: professionally produced construction documents.
This publication introduces and characterizes a large-scale dataset composed of nearly one million pages of real construction plans, the vast majority of which are vector-native technical drawings generated in active professional practice. To our knowledge, no publicly available dataset exists today that enables training or benchmarking of AI systems on construction documents at this scale, fidelity, or realism.
We believe this dataset represents a necessary foundation for advancing construction document intelligence as a distinct and underexplored genre within applied AI.
1. Why Construction Documents Are a Missing Modality
Most contemporary foundation models are trained on combinations of natural images, web text, scanned documents, and synthetic diagrams. While effective for general-purpose reasoning, these sources fail to capture the structure, density, and conventions of construction plans.
Construction documents are:
- Vector-native, not photographic
- Multi-page and multi-discipline, often exceeding 100 sheets per project
- Annotation-dense, with hundreds of interdependent dimensions, symbols, and notes
- Contextual across sheets, requiring cross-referencing and reconciliation
As a result, even state-of-the-art multimodal models frequently struggle not because of reasoning limitations, but because they have never seen data like this at scale.
2. Dataset Overview and Scale
The dataset consists of 7,550 complete construction plan sets, totaling an estimated 954,892 pages and 276.62 GB of vector-heavy technical content.
Core Statistics
- Total PDF files: 7,550
- Estimated total pages: 954,892
- Average pages per plan set: 133.6
- Median pages per plan set: 95
- Number of categories: 16
Nearly half of all plan sets exceed 100 pages, reflecting the reality of modern construction documentation rather than curated academic examples.
3. Composition by Project Type
The dataset spans residential, commercial, multi-family, mixed-use, infrastructure, and specialty construction.
| Category | PDFs | Avg Pages | Est. Pages | Size (GB) |
|---|---|---|---|---|
| Residential | 4,924 | 114.8 | 565,275 | 110.66 |
| Commercial | 1,359 | 132.9 | 180,679 | 95.06 |
| Multi-Family Residential | 78 | 982.5 | 76,639 | 18.83 |
| Mixed-Use | 82 | 889.3 | 72,923 | 28.46 |
| New Residence | 369 | 94.3 | 34,815 | 13.53 |
| Remaining Categories | 738 | — | 24,561 | 9.90 |
Residential and commercial projects account for nearly 80% of all pages, while multi-family and mixed-use plans contribute exceptional depth and scale.
4. Vector-Native Characteristics
A defining attribute of this dataset is its overwhelmingly vector-based format.
- Vector-native PDFs: 7,399 (98.0%)
- Raster/scanned PDFs: 151 (2.0%)
This enables access to geometry, topology, and precise text placement that is fundamentally unavailable in raster-only datasets.
Estimated Vector Elements
- Total vector elements: 12.59 billion
- Lines: 12.27 billion
- Curves: 284 million
- Rectangles: 39 million
- Text characters: 1.14 billion
These figures are orders of magnitude beyond what is typically available in document AI benchmarks.
5. Page-Type Diversity
From a representative sample:
| Page Type | Percentage |
|---|---|
| Plumbing | 18.3% |
| Floor Plans | 9.0% |
| Elevations | 8.3% |
| Electrical | 6.9% |
| Foundation | 6.3% |
| HVAC | 5.7% |
| Sections | 5.2% |
| Structural | 2.1% |
| Other | 28.5% |
This diversity forces models to reason across disciplines, conventions, and scales — a capability rarely tested in existing benchmarks.
6. Annotation Density and Real-World Complexity
Construction plans are not clean datasets. They are iterative, inconsistent, and context-heavy — exactly the conditions under which AI systems must operate in practice.
- Average dimensions per page: 42.3
- Pages with ≥50 dimensions: 18.0%
- High-complexity pages: 71.1%
Annotation quality varies widely, mirroring real production environments rather than idealized academic samples.
7. Architectural and Structural Signal
Even a small subset of the dataset reveals dense architectural signal:
- Average per floor plan:
- 47 walls
- 16,020 vector lines
- Multiple door, window, and room labels
This makes the dataset particularly valuable for tasks such as spatial reasoning, structural inference, and plan interpretation.
8. Why This Dataset Is Different
To our knowledge:
- There is no publicly available dataset at this scale composed of real, vector-native construction plans
- There is no benchmark corpus that reflects the size, density, and multi-discipline nature of professional construction documentation
- There is no existing training set that allows models to meaningfully learn construction-specific document structure rather than approximating it from unrelated data
This dataset fills that gap.
9. Implications for Model Development
Early benchmarking suggests that general-purpose multimodal models often struggle with construction plans not due to reasoning deficiencies, but due to domain unfamiliarity and structural mismatch.
Conversely, systems exposed to construction-native data demonstrate materially different behavior in:
- Page classification
- Fine-grained text recall
- Context preservation across large plan sets
- Refusal accuracy when information is absent
These observations reinforce the idea that construction document intelligence is its own modality, not a subset of generic document understanding.
10. Research Access and Collaboration
Foreman AI is open to research partnerships and dataset licensing for qualified organizations interested in advancing construction-focused AI, subject to appropriate agreements.
Conclusion
Construction documents represent one of the largest untapped data modalities in applied AI. This dataset establishes a foundation for rigorous benchmarking, meaningful training, and credible evaluation in a domain that directly underpins the physical world.
We believe access to data of this nature is a prerequisite for advancing the next generation of AI systems capable of reasoning about the built environment.