Foreman AI icon Foreman AI

← Back to Research
Publication December 2025

A Large-Scale, Vector-Native Dataset for Construction Document Intelligence

Abstract

Modern large language models and multimodal systems have demonstrated impressive capabilities across natural images, text, and general documents. However, one domain has remained largely inaccessible to meaningful model training and evaluation: professionally produced construction documents.

This publication introduces and characterizes a large-scale dataset composed of nearly one million pages of real construction plans, the vast majority of which are vector-native technical drawings generated in active professional practice. To our knowledge, no publicly available dataset exists today that enables training or benchmarking of AI systems on construction documents at this scale, fidelity, or realism.

We believe this dataset represents a necessary foundation for advancing construction document intelligence as a distinct and underexplored genre within applied AI.

1. Why Construction Documents Are a Missing Modality

Most contemporary foundation models are trained on combinations of natural images, web text, scanned documents, and synthetic diagrams. While effective for general-purpose reasoning, these sources fail to capture the structure, density, and conventions of construction plans.

Construction documents are:

  • Vector-native, not photographic
  • Multi-page and multi-discipline, often exceeding 100 sheets per project
  • Annotation-dense, with hundreds of interdependent dimensions, symbols, and notes
  • Contextual across sheets, requiring cross-referencing and reconciliation

As a result, even state-of-the-art multimodal models frequently struggle not because of reasoning limitations, but because they have never seen data like this at scale.

2. Dataset Overview and Scale

The dataset consists of 7,550 complete construction plan sets, totaling an estimated 954,892 pages and 276.62 GB of vector-heavy technical content.

Core Statistics

  • Total PDF files: 7,550
  • Estimated total pages: 954,892
  • Average pages per plan set: 133.6
  • Median pages per plan set: 95
  • Number of categories: 16

Nearly half of all plan sets exceed 100 pages, reflecting the reality of modern construction documentation rather than curated academic examples.

3. Composition by Project Type

The dataset spans residential, commercial, multi-family, mixed-use, infrastructure, and specialty construction.

Category PDFs Avg Pages Est. Pages Size (GB)
Residential4,924114.8565,275110.66
Commercial1,359132.9180,67995.06
Multi-Family Residential78982.576,63918.83
Mixed-Use82889.372,92328.46
New Residence36994.334,81513.53
Remaining Categories73824,5619.90

Residential and commercial projects account for nearly 80% of all pages, while multi-family and mixed-use plans contribute exceptional depth and scale.

4. Vector-Native Characteristics

A defining attribute of this dataset is its overwhelmingly vector-based format.

  • Vector-native PDFs: 7,399 (98.0%)
  • Raster/scanned PDFs: 151 (2.0%)

This enables access to geometry, topology, and precise text placement that is fundamentally unavailable in raster-only datasets.

Estimated Vector Elements

  • Total vector elements: 12.59 billion
  • Lines: 12.27 billion
  • Curves: 284 million
  • Rectangles: 39 million
  • Text characters: 1.14 billion

These figures are orders of magnitude beyond what is typically available in document AI benchmarks.

5. Page-Type Diversity

From a representative sample:

Page Type Percentage
Plumbing18.3%
Floor Plans9.0%
Elevations8.3%
Electrical6.9%
Foundation6.3%
HVAC5.7%
Sections5.2%
Structural2.1%
Other28.5%

This diversity forces models to reason across disciplines, conventions, and scales — a capability rarely tested in existing benchmarks.

6. Annotation Density and Real-World Complexity

Construction plans are not clean datasets. They are iterative, inconsistent, and context-heavy — exactly the conditions under which AI systems must operate in practice.

  • Average dimensions per page: 42.3
  • Pages with ≥50 dimensions: 18.0%
  • High-complexity pages: 71.1%

Annotation quality varies widely, mirroring real production environments rather than idealized academic samples.

7. Architectural and Structural Signal

Even a small subset of the dataset reveals dense architectural signal:

  • Average per floor plan:
    • 47 walls
    • 16,020 vector lines
    • Multiple door, window, and room labels

This makes the dataset particularly valuable for tasks such as spatial reasoning, structural inference, and plan interpretation.

8. Why This Dataset Is Different

To our knowledge:

  • There is no publicly available dataset at this scale composed of real, vector-native construction plans
  • There is no benchmark corpus that reflects the size, density, and multi-discipline nature of professional construction documentation
  • There is no existing training set that allows models to meaningfully learn construction-specific document structure rather than approximating it from unrelated data

This dataset fills that gap.

9. Implications for Model Development

Early benchmarking suggests that general-purpose multimodal models often struggle with construction plans not due to reasoning deficiencies, but due to domain unfamiliarity and structural mismatch.

Conversely, systems exposed to construction-native data demonstrate materially different behavior in:

  • Page classification
  • Fine-grained text recall
  • Context preservation across large plan sets
  • Refusal accuracy when information is absent

These observations reinforce the idea that construction document intelligence is its own modality, not a subset of generic document understanding.

10. Research Access and Collaboration

Foreman AI is open to research partnerships and dataset licensing for qualified organizations interested in advancing construction-focused AI, subject to appropriate agreements.

Conclusion

Construction documents represent one of the largest untapped data modalities in applied AI. This dataset establishes a foundation for rigorous benchmarking, meaningful training, and credible evaluation in a domain that directly underpins the physical world.

We believe access to data of this nature is a prerequisite for advancing the next generation of AI systems capable of reasoning about the built environment.

Try Foreman AI Blueprints

Experience domain-specialized construction plan analysis with our free tier.

Get Started Free