A Large-Scale, Vector-Native Dataset for Construction Document Intelligence

Abstract

Modern large language models and multimodal systems have demonstrated impressive capabilities across natural images, text, and general documents. However, one domain has remained largely inaccessible to meaningful model training and evaluation: professionally produced construction documents.

This publication introduces and characterizes a large-scale dataset composed of nearly one million pages of real construction plans, the vast majority of which are vector-native technical drawings generated in active professional practice. To our knowledge, no publicly available dataset exists today that enables training or benchmarking of AI systems on construction documents at this scale, fidelity, or realism.

We believe this dataset represents a necessary foundation for advancing construction document intelligence as a distinct and underexplored genre within applied AI.

1. Why Construction Documents Are a Missing Modality

Most contemporary foundation models are trained on combinations of natural images, web text, scanned documents, and synthetic diagrams. While effective for general-purpose reasoning, these sources fail to capture the structure, density, and conventions of construction plans.

Construction documents are:

Vector-native, not photographic
Multi-page and multi-discipline, often exceeding 100 sheets per project
Annotation-dense, with hundreds of interdependent dimensions, symbols, and notes
Contextual across sheets, requiring cross-referencing and reconciliation

As a result, even state-of-the-art multimodal models frequently struggle not because of reasoning limitations, but because they have never seen data like this at scale.

2. Dataset Overview and Scale

The dataset consists of 7,550 complete construction plan sets, totaling an estimated 954,892 pages and 276.62 GB of vector-heavy technical content.

Core Statistics

Total PDF files: 7,550
Estimated total pages: 954,892
Average pages per plan set: 133.6
Median pages per plan set: 95
Number of categories: 16

Nearly half of all plan sets exceed 100 pages, reflecting the reality of modern construction documentation rather than curated academic examples.

3. Composition by Project Type

The dataset spans residential, commercial, multi-family, mixed-use, infrastructure, and specialty construction.

Category	PDFs	Avg Pages	Est. Pages	Size (GB)
Residential	4,924	114.8	565,275	110.66
Commercial	1,359	132.9	180,679	95.06
Multi-Family Residential	78	982.5	76,639	18.83
Mixed-Use	82	889.3	72,923	28.46
New Residence	369	94.3	34,815	13.53
Remaining Categories	738	—	24,561	9.90

Residential and commercial projects account for nearly 80% of all pages, while multi-family and mixed-use plans contribute exceptional depth and scale.

4. Vector-Native Characteristics

A defining attribute of this dataset is its overwhelmingly vector-based format.

Vector-native PDFs: 7,399 (98.0%)
Raster/scanned PDFs: 151 (2.0%)

This enables access to geometry, topology, and precise text placement that is fundamentally unavailable in raster-only datasets.

Estimated Vector Elements

Total vector elements: 12.59 billion
Lines: 12.27 billion
Curves: 284 million
Rectangles: 39 million
Text characters: 1.14 billion

These figures are orders of magnitude beyond what is typically available in document AI benchmarks.

5. Page-Type Diversity

From a representative sample:

Page Type	Percentage
Plumbing	18.3%
Floor Plans	9.0%
Elevations	8.3%
Electrical	6.9%
Foundation	6.3%
HVAC	5.7%
Sections	5.2%
Structural	2.1%
Other	28.5%

This diversity forces models to reason across disciplines, conventions, and scales — a capability rarely tested in existing benchmarks.

6. Annotation Density and Real-World Complexity

Construction plans are not clean datasets. They are iterative, inconsistent, and context-heavy — exactly the conditions under which AI systems must operate in practice.

Average dimensions per page: 42.3
Pages with ≥50 dimensions: 18.0%
High-complexity pages: 71.1%

Annotation quality varies widely, mirroring real production environments rather than idealized academic samples.

7. Architectural and Structural Signal

Even a small subset of the dataset reveals dense architectural signal:

Average per floor plan:
- 47 walls
- 16,020 vector lines
- Multiple door, window, and room labels

This makes the dataset particularly valuable for tasks such as spatial reasoning, structural inference, and plan interpretation.

8. Why This Dataset Is Different

To our knowledge:

There is no publicly available dataset at this scale composed of real, vector-native construction plans
There is no benchmark corpus that reflects the size, density, and multi-discipline nature of professional construction documentation
There is no existing training set that allows models to meaningfully learn construction-specific document structure rather than approximating it from unrelated data

This dataset fills that gap.

9. Implications for Model Development

Early benchmarking suggests that general-purpose multimodal models often struggle with construction plans not due to reasoning deficiencies, but due to domain unfamiliarity and structural mismatch.

Conversely, systems exposed to construction-native data demonstrate materially different behavior in:

Page classification
Fine-grained text recall
Context preservation across large plan sets
Refusal accuracy when information is absent

These observations reinforce the idea that construction document intelligence is its own modality, not a subset of generic document understanding.

10. Research Access and Collaboration

Foreman AI is open to research partnerships and dataset licensing for qualified organizations interested in advancing construction-focused AI, subject to appropriate agreements.

Conclusion

Construction documents represent one of the largest untapped data modalities in applied AI. This dataset establishes a foundation for rigorous benchmarking, meaningful training, and credible evaluation in a domain that directly underpins the physical world.

We believe access to data of this nature is a prerequisite for advancing the next generation of AI systems capable of reasoning about the built environment.