a16z Research · December 2025

Every Company Is Drowning
In Its Own Documents.

Jennifer Li’s 2026 infrastructure call names the bottleneck precisely: not the model, but the document layer the model is asked to read. Here is what that means for any organization with a PDF library in a document database and teams that still cannot query it in plain language. And how LazyFox solves this.

Alexander Braun · December 2025 · 11 min read
“The limiting factor for AI companies is now data entropy: the steady decay of freshness, structure, and truth inside the unstructured universe where 80% of corporate knowledge now lives.”
Jennifer Li, General Partner, Andreessen Horowitz · a16z Big Ideas 2026
80%
of corporate knowledge now lives in unstructured form: PDFs, logs, emails, screenshots. Li’s diagnosis is that the bottleneck is not model capability. It is the entropy accumulating inside the document layer that every enterprise has been building for years without any governance to make it readable at inference time.
Jennifer Li, a16z · Dec 2025
9+
enterprise use cases Li names as blocked by unstructured data: contract analysis, onboarding, claims handling, compliance, support, procurement, engineering search, sales enablement, and analytics pipelines. Every one requires reading documents. Every one fails when the document layer carries no governance.
Jennifer Li, a16z · Dec 2025
167%
Net Revenue Retention from a mid-market finance organization that made its MongoDB-hosted document library queryable through governed Natural Language reporting. When PDFs stop being retrieval problems and become governed semantic assets, the business case for the infrastructure layer compounds.
LazyFox customer · 2026
1×
The number of times LazyFox processes each document. Business context is indexed once at ingestion. Every Natural Language query runs from the governed representation, not from re-sending raw PDFs to a model on each call. No per-query token cost on documents already in the semantic layer.
LazyFox Architecture
Executive Summary
Key Finding
RAG systems hallucinate and agents break in subtle, expensive ways because model inputs arrive at inference time in a state that was never designed for machine reading. The models improved. The document layer did not.
Root Cause
Data entropy: the continuous decay of freshness, structure, and truth inside unstructured data stores. 80% of corporate knowledge lives in documents that accumulate, drift, and conflict without any governance layer to keep their meaning current and queryable.
Market Recommendation
Enterprises need a continuous, persistent governance layer, not a one-time ETL job, that cleans, structures, validates, and reconciles multimodal data as it arrives. Li argues the companies that build this infrastructure will own enterprise knowledge and process.
How LazyFox Delivers on This LazyFox
Structural Ingestion from Document Databases
LazyFox’s structural layer maps schemas across all connected systems, including MongoDB and other document stores. When PDFs arrive at ingestion, LazyFox extracts governed representations of their content and maps them to the organization’s semantic layer. No data migration required. No raw text extraction handed to a model at query time.
Semantic Drift Detection Across Document Sources
Li’s data entropy manifests in enterprise systems as definitions diverging across documents and data sources over time. LazyFox’s logical layer flags automatically when a concept defined in one document or system conflicts with how it appears in another, preventing stale or conflicting context from reaching Natural Language query results.
Natural Language Reporting Without Per-Query Re-processing
The gap Li identifies is between model capability and data readiness. LazyFox closes it at the ingestion layer: documents are processed once, and every Natural Language query runs against the governed semantic representation. Reporting accuracy comes from the governance layer, not from model inference on raw documents each time a question is asked.
Token Efficiency & Vendor Independence
Business context is indexed once. Every subsequent agent query runs from governed code: no re-tokenization per call, no organizational knowledge migrating into a model provider’s weights. The semantic governance layer is a company asset. Swapping model providers does not reset institutional memory. The model is interchangeable; the governed document layer is not.
Read the full analysis below
The Article

Data Entropy Is the Real AI Bottleneck

Li’s argument is not about model quality. It is about what the model is asked to read, and why that layer is harder to fix than any model upgrade.

Jennifer Li’s contribution to the a16z Big Ideas 2026 newsletter is brief. But the argument is precise, and it names something that most enterprise AI buyers already recognize but cannot quite articulate: the problem is not the model. The problem is what the model is asked to read.

Li describes the situation in terms of “data entropy”: the continuous decay of freshness, structure, and verifiable truth inside the unstructured data universe that now holds 80% of corporate knowledge. PDFs, emails, screenshots, logs, semi-structured exports. Every enterprise has built up a library of this material over years, and none of it was designed to be machine-readable at inference time. Models improve. The document layer stays exactly as disordered as it was before.

The symptoms Li names are specific: RAG systems that hallucinate, agents that break in subtle and expensive ways, and critical workflows that still depend on human QA to catch errors before they reach a report or a decision. These are not model failures. They are data governance failures, visible as model failures.

The implication Li draws is a market-scale infrastructure opportunity. Enterprises need a continuous way to clean, structure, validate, and govern their multimodal data: not a one-time migration, but a persistent layer that keeps the document library fresh and queryable. The use cases span every major enterprise vertical: contract analysis, claims handling, compliance, procurement, engineering search, analytics pipelines. The companies that build the governance infrastructure for this layer, Li argues, “hold the key to the kingdom of enterprise knowledge and process.”

What follows is an analysis of the three arguments Li makes: the failure mode, the root cause, and the market call, and what each one demands from any organization serious about making its document library available to natural language queries.

Finding 1

Models Got Smarter. The Document Layer Did Not.

The performance ceiling for enterprise AI is not the model. It is the document layer the model reads, and that layer was never built for machine inference.

Li’s argument opens with a paradox. AI models have improved faster than almost any technology in recent memory. Context windows expanded. Reasoning improved. Retrieval got faster. And yet RAG systems still hallucinate, agents still break in ways that are hard to detect, and finance teams still manually verify numbers before they go into a board report. If the models improved, why did the outputs not follow?

The answer Li gives is that the inputs did not improve. Every company has spent years accumulating documents in formats built for human readers: PDFs formatted for printing, email threads referencing context from three months prior, screenshots of reports that no longer exist in any structured form. The models are now capable enough to process this material in theory. In practice, the material arrives at inference time in a state that makes accurate processing nearly impossible without some prior layer of governance to establish what any of it means.

This gap carries a measurable operational cost. When a finance team routes a procurement query through an AI agent and the agent reads a PDF from a document database, what it reads depends entirely on how that PDF was ingested. If the ingestion was naive, involving raw text extraction with no structural awareness and no definition of what a given term means in this organization, the model receives context that is accurate in its raw form but misleading in its implied meaning. The agent produces a number. The number looks reasonable. The number is wrong.

“Models keep getting smarter but the inputs keep getting messier, which causes RAG systems to hallucinate, agents to break in subtle, expensive ways, and critical workflows to still heavily rely on human QA.”

Jennifer Li, General Partner, Andreessen Horowitz · a16z Big Ideas 2026

Li calls this “subtle, expensive” failure. Subtle because the output is not obviously broken. Expensive because by the time the error surfaces, it has already informed a decision. The detection cost is high. The correction cost is higher. And the organizational response, almost universally, is to add a human QA step, which means the AI agent did not reduce headcount or speed. It changed only where in the pipeline a person intervenes.

For any organization with a large PDF library stored in a document database, this means model quality is not the governing variable. The governing variable is whether the document ingestion layer knows what it is reading. A PDF ingested without semantic context is noise with formatting. A PDF ingested through a governed structural layer, where the system knows how this document type maps to business concepts, is a queryable asset. The same document, under completely different conditions of utility.

What changes between a PDF that hallucinates into a report and one that powers accurate Natural Language queries is not the model. It is the ingestion layer the PDF passed through.

Without Document Governance
PDF in MongoDB
Invoice, contract, compliance report, analytics export
Raw text extraction
No structural awareness. No definition of what any term means in this organization. Formatting stripped, context lost.
No semantic layer
Model reads raw context at query time
Per-query re-processing. “Revenue” may mean gross or net depending on which PDF surfaces first.
Hallucinated or conflicting answer
Looks reasonable. May be wrong. Impossible to verify without reading the source documents manually.
Human QA required before the number can be trusted in a report or decision.
With LazyFox Semantic Governance
PDF in MongoDB
Invoice, contract, compliance report, analytics export
LazyFox structural layer at ingestion
Document type recognized. Business concepts mapped to governed definitions. Context indexed once, stored as governed code.
Indexed once
Natural Language query runs from governed representation
No re-processing. No raw PDF sent to model. Every query hits the same governed semantic layer.
Accurate, auditable answer
Governed definition applied. Conflicting sources flagged before they reach the query layer.
No human QA dependency. The governance layer carries what manual review was doing.
The failure mode is not dramatic. The model produces an answer. The answer looks right. It is built on unverified, ungoverned context from whichever documents happened to surface first in retrieval.
The governed path processes each document once. From that point forward, every Natural Language query runs against an already-understood representation of the document's content and what its terms mean in this organization.
Root Cause

80% of What Enterprises Know Lives Where AI Goes Wrong

Data entropy compounds over time. Every document added without governance is another point of failure at inference time.

Li’s diagnosis identifies data entropy as the structural root cause. The term is precise. Entropy, in the thermodynamic sense, describes the tendency of systems toward disorder without continuous energy input to maintain structure. Li applies this to enterprise data: the steady decay of freshness, structure, and truth inside unstructured data stores as documents accumulate, definitions shift, and pipelines age without active governance to hold them together.

The 80% figure is significant. It means that most of what an enterprise knows, covering contracts, customers, compliance obligations, and operational history, lives in a format that current AI infrastructure was not built to govern. The 20% that lives in structured databases benefits from decades of tooling: schemas, constraints, defined relationships, query languages, semantic layers. The 80% that lives in documents has none of this. It sits in object storage or document databases, ingested at some point in time, with whatever structure the original author chose to apply or not apply.

Entropy compounds over time. A contract PDF ingested in 2022 carries assumptions about product definitions, pricing structures, and jurisdiction-specific terms that may no longer apply in 2026. An email thread ingested six months ago references a policy that was updated in January. A compliance document ingested last quarter uses a term that the legal team has since redefined. None of these changes propagate automatically to the ingested representation. The document layer drifts away from organizational reality at a rate that accelerates as the organization changes.

“Enterprises need a continuous way to clean, structure, validate and govern their multimodal data so downstream AI workloads actually work.”

Jennifer Li, General Partner, Andreessen Horowitz · a16z Big Ideas 2026

For organizations whose document library sits in a database like MongoDB, this entropic drift has a specific and familiar shape. A PDF report generated by one system uses “revenue” to mean gross revenue. A PDF generated by another system uses “revenue” to mean net revenue after returns. Both documents are ingested. Both appear in retrieval. An Natural Language query about revenue pulls from both. The model has no way to know which definition applies. The result is a number that is internally consistent but analytically meaningless.

This is not a retrieval problem. Retrieval can find both documents. It is a semantic governance problem: the ingestion layer did not capture what these documents mean in the context of this organization, and so the model cannot adjudicate between conflicting representations. The absence of a governance layer at ingestion is what makes data entropy structurally irreversible without active intervention. The longer an organization waits, the more documents accumulate with unresolved conflicts, and the more expensive the cleanup becomes relative to building governance in from the start.

The same concept defined differently across four document sources. LazyFox’s logical layer detects the conflict at ingestion and serves a single governed definition to every Natural Language query.

Invoice PDF (2022)
“Revenue”
Gross revenue, pre-returns, inclusive of deferred billing from multi-year contracts
Conflict detected
ERP Export PDF (2024)
“Revenue”
Net revenue after customer returns and chargebacks, fiscal year basis ending December
Conflict detected
Compliance Report PDF (Q1 2025)
“Revenue”
Recognized revenue per IFRS 15, excluding pipeline and conditional order commitments
Conflict detected
Analytics Export PDF (Jun 2025)
“Revenue”
ARR-normalized monthly revenue, source: fct_revenue post-migration, fiscal Q2 close
LazyFox Logical Layer
Semantic Drift Detection & Reconciliation
Conflicts flagged at ingestion
Authoritative source identified
Governed definition versioned in code
📊
Governed Output · Every Natural Language Query
Revenue = ARR, fiscal Q ends Mar 31, source: fct_revenue post-migration
Definition locked in the logical layer. Applied consistently across every agent query and Natural Language report regardless of which documents were retrieved. Conflicting PDFs from earlier periods flagged as non-authoritative.
Indexed once at ingestion · Zero re-processing per query · Conflict history retained for audit
Market Call

The Governance Layer Owns the Enterprise

Li does not say AI will resolve the document problem organically. She says the companies that build continuous document governance infrastructure will hold the position that matters.

Li’s market conclusion is unambiguous. The opportunity belongs to companies that build the continuous infrastructure layer for multimodal data governance: not companies that build better retrieval, or larger context windows, or faster vector search. Those problems are largely solved at the model layer. The remaining problem is the one Li names: keeping the document layer clean, structured, governed, and retrievable over time.

The use cases she lists are not niche verticals. Contract analysis, onboarding flows, claims handling, compliance, support, procurement, engineering search, sales enablement, analytics pipelines. These are the operational backbone of every mid-market and enterprise organization. Every one of them requires reading documents. Every one of them breaks when the document layer is unstructured or stale. Li’s list describes where every enterprise AI agent currently carries a QA dependency that human review should not be responsible for.

The architectural implication is that the governance layer cannot be a one-time ETL job. ETL jobs run once. Entropy runs continuously. The governance layer Li describes must clean as new documents arrive, validate as definitions change, reconcile as multiple sources produce conflicting versions of the same concept, and repair pipelines as upstream systems drift. It must treat data freshness not as a project with a completion date but as a property of the infrastructure itself.

“Startups that build the platform that extracts structure from documents, images, and videos; reconciles conflicts; repairs pipelines; or keeps data fresh and retrievable hold the key to the kingdom of enterprise knowledge and process.”

Jennifer Li, General Partner, Andreessen Horowitz · a16z Big Ideas 2026

For organizations with large document repositories in a database like MongoDB, this represents a specific architectural requirement: the system that ingests PDFs must do more than extract text. It must understand what type of document this is, what business concepts it contains, how those concepts map to the organization’s governed definitions, and what to do when a newer document changes the definition that an older document established. Text extraction is a solved problem. Semantic ingestion with drift detection is not.

Li frames this as a generational opportunity. The companies that build this infrastructure will hold the governance layer sitting above every enterprise document store. That position, between the document library and the Natural Language query layer, is where enterprise knowledge becomes queryable, and it compounds in value as more documents are ingested, more definitions are governed, and more agent workflows build on top of it. The model is replaceable. The governed document layer is not.

The organizations that treat document governance as a continuous infrastructure investment rather than a one-time data cleanup project will be the ones whose AI agents produce answers their teams trust without a human QA step. That is the operational difference Li’s analysis points to, and it is the difference that is hardest to close retroactively once entropy has accumulated at scale.

“Startups that build the platform that extracts structure from documents, images, and videos; reconciles conflicts; repairs pipelines; or keeps data fresh and retrievable hold the key to the kingdom of enterprise knowledge and process.”
Jennifer Li, General Partner, Andreessen Horowitz · a16z Big Ideas 2026, December 2025

Your PDFs Are an Asset. Make Them Queryable.

LazyFox ingests documents from MongoDB, governs the semantic layer, and makes every PDF available for natural language reporting without re-processing on each query.