Jennifer Li’s 2026 infrastructure call names the bottleneck precisely: not the model, but the document layer the model is asked to read. Here is what that means for any organization with a PDF library in a document database and teams that still cannot query it in plain language. And how LazyFox solves this.
“The limiting factor for AI companies is now data entropy: the steady decay of freshness, structure, and truth inside the unstructured universe where 80% of corporate knowledge now lives.”Jennifer Li, General Partner, Andreessen Horowitz · a16z Big Ideas 2026
Li’s argument is not about model quality. It is about what the model is asked to read, and why that layer is harder to fix than any model upgrade.
Jennifer Li’s contribution to the a16z Big Ideas 2026 newsletter is brief. But the argument is precise, and it names something that most enterprise AI buyers already recognize but cannot quite articulate: the problem is not the model. The problem is what the model is asked to read.
Li describes the situation in terms of “data entropy”: the continuous decay of freshness, structure, and verifiable truth inside the unstructured data universe that now holds 80% of corporate knowledge. PDFs, emails, screenshots, logs, semi-structured exports. Every enterprise has built up a library of this material over years, and none of it was designed to be machine-readable at inference time. Models improve. The document layer stays exactly as disordered as it was before.
The symptoms Li names are specific: RAG systems that hallucinate, agents that break in subtle and expensive ways, and critical workflows that still depend on human QA to catch errors before they reach a report or a decision. These are not model failures. They are data governance failures, visible as model failures.
The implication Li draws is a market-scale infrastructure opportunity. Enterprises need a continuous way to clean, structure, validate, and govern their multimodal data: not a one-time migration, but a persistent layer that keeps the document library fresh and queryable. The use cases span every major enterprise vertical: contract analysis, claims handling, compliance, procurement, engineering search, analytics pipelines. The companies that build the governance infrastructure for this layer, Li argues, “hold the key to the kingdom of enterprise knowledge and process.”
What follows is an analysis of the three arguments Li makes: the failure mode, the root cause, and the market call, and what each one demands from any organization serious about making its document library available to natural language queries.
The performance ceiling for enterprise AI is not the model. It is the document layer the model reads, and that layer was never built for machine inference.
Li’s argument opens with a paradox. AI models have improved faster than almost any technology in recent memory. Context windows expanded. Reasoning improved. Retrieval got faster. And yet RAG systems still hallucinate, agents still break in ways that are hard to detect, and finance teams still manually verify numbers before they go into a board report. If the models improved, why did the outputs not follow?
The answer Li gives is that the inputs did not improve. Every company has spent years accumulating documents in formats built for human readers: PDFs formatted for printing, email threads referencing context from three months prior, screenshots of reports that no longer exist in any structured form. The models are now capable enough to process this material in theory. In practice, the material arrives at inference time in a state that makes accurate processing nearly impossible without some prior layer of governance to establish what any of it means.
This gap carries a measurable operational cost. When a finance team routes a procurement query through an AI agent and the agent reads a PDF from a document database, what it reads depends entirely on how that PDF was ingested. If the ingestion was naive, involving raw text extraction with no structural awareness and no definition of what a given term means in this organization, the model receives context that is accurate in its raw form but misleading in its implied meaning. The agent produces a number. The number looks reasonable. The number is wrong.
“Models keep getting smarter but the inputs keep getting messier, which causes RAG systems to hallucinate, agents to break in subtle, expensive ways, and critical workflows to still heavily rely on human QA.”
Jennifer Li, General Partner, Andreessen Horowitz · a16z Big Ideas 2026Li calls this “subtle, expensive” failure. Subtle because the output is not obviously broken. Expensive because by the time the error surfaces, it has already informed a decision. The detection cost is high. The correction cost is higher. And the organizational response, almost universally, is to add a human QA step, which means the AI agent did not reduce headcount or speed. It changed only where in the pipeline a person intervenes.
For any organization with a large PDF library stored in a document database, this means model quality is not the governing variable. The governing variable is whether the document ingestion layer knows what it is reading. A PDF ingested without semantic context is noise with formatting. A PDF ingested through a governed structural layer, where the system knows how this document type maps to business concepts, is a queryable asset. The same document, under completely different conditions of utility.
What changes between a PDF that hallucinates into a report and one that powers accurate Natural Language queries is not the model. It is the ingestion layer the PDF passed through.
Data entropy compounds over time. Every document added without governance is another point of failure at inference time.
Li’s diagnosis identifies data entropy as the structural root cause. The term is precise. Entropy, in the thermodynamic sense, describes the tendency of systems toward disorder without continuous energy input to maintain structure. Li applies this to enterprise data: the steady decay of freshness, structure, and truth inside unstructured data stores as documents accumulate, definitions shift, and pipelines age without active governance to hold them together.
The 80% figure is significant. It means that most of what an enterprise knows, covering contracts, customers, compliance obligations, and operational history, lives in a format that current AI infrastructure was not built to govern. The 20% that lives in structured databases benefits from decades of tooling: schemas, constraints, defined relationships, query languages, semantic layers. The 80% that lives in documents has none of this. It sits in object storage or document databases, ingested at some point in time, with whatever structure the original author chose to apply or not apply.
Entropy compounds over time. A contract PDF ingested in 2022 carries assumptions about product definitions, pricing structures, and jurisdiction-specific terms that may no longer apply in 2026. An email thread ingested six months ago references a policy that was updated in January. A compliance document ingested last quarter uses a term that the legal team has since redefined. None of these changes propagate automatically to the ingested representation. The document layer drifts away from organizational reality at a rate that accelerates as the organization changes.
“Enterprises need a continuous way to clean, structure, validate and govern their multimodal data so downstream AI workloads actually work.”
Jennifer Li, General Partner, Andreessen Horowitz · a16z Big Ideas 2026For organizations whose document library sits in a database like MongoDB, this entropic drift has a specific and familiar shape. A PDF report generated by one system uses “revenue” to mean gross revenue. A PDF generated by another system uses “revenue” to mean net revenue after returns. Both documents are ingested. Both appear in retrieval. An Natural Language query about revenue pulls from both. The model has no way to know which definition applies. The result is a number that is internally consistent but analytically meaningless.
This is not a retrieval problem. Retrieval can find both documents. It is a semantic governance problem: the ingestion layer did not capture what these documents mean in the context of this organization, and so the model cannot adjudicate between conflicting representations. The absence of a governance layer at ingestion is what makes data entropy structurally irreversible without active intervention. The longer an organization waits, the more documents accumulate with unresolved conflicts, and the more expensive the cleanup becomes relative to building governance in from the start.
The same concept defined differently across four document sources. LazyFox’s logical layer detects the conflict at ingestion and serves a single governed definition to every Natural Language query.
Li does not say AI will resolve the document problem organically. She says the companies that build continuous document governance infrastructure will hold the position that matters.
Li’s market conclusion is unambiguous. The opportunity belongs to companies that build the continuous infrastructure layer for multimodal data governance: not companies that build better retrieval, or larger context windows, or faster vector search. Those problems are largely solved at the model layer. The remaining problem is the one Li names: keeping the document layer clean, structured, governed, and retrievable over time.
The use cases she lists are not niche verticals. Contract analysis, onboarding flows, claims handling, compliance, support, procurement, engineering search, sales enablement, analytics pipelines. These are the operational backbone of every mid-market and enterprise organization. Every one of them requires reading documents. Every one of them breaks when the document layer is unstructured or stale. Li’s list describes where every enterprise AI agent currently carries a QA dependency that human review should not be responsible for.
The architectural implication is that the governance layer cannot be a one-time ETL job. ETL jobs run once. Entropy runs continuously. The governance layer Li describes must clean as new documents arrive, validate as definitions change, reconcile as multiple sources produce conflicting versions of the same concept, and repair pipelines as upstream systems drift. It must treat data freshness not as a project with a completion date but as a property of the infrastructure itself.
“Startups that build the platform that extracts structure from documents, images, and videos; reconciles conflicts; repairs pipelines; or keeps data fresh and retrievable hold the key to the kingdom of enterprise knowledge and process.”
Jennifer Li, General Partner, Andreessen Horowitz · a16z Big Ideas 2026For organizations with large document repositories in a database like MongoDB, this represents a specific architectural requirement: the system that ingests PDFs must do more than extract text. It must understand what type of document this is, what business concepts it contains, how those concepts map to the organization’s governed definitions, and what to do when a newer document changes the definition that an older document established. Text extraction is a solved problem. Semantic ingestion with drift detection is not.
Li frames this as a generational opportunity. The companies that build this infrastructure will hold the governance layer sitting above every enterprise document store. That position, between the document library and the Natural Language query layer, is where enterprise knowledge becomes queryable, and it compounds in value as more documents are ingested, more definitions are governed, and more agent workflows build on top of it. The model is replaceable. The governed document layer is not.
The organizations that treat document governance as a continuous infrastructure investment rather than a one-time data cleanup project will be the ones whose AI agents produce answers their teams trust without a human QA step. That is the operational difference Li’s analysis points to, and it is the difference that is hardest to close retroactively once entropy has accumulated at scale.
“Startups that build the platform that extracts structure from documents, images, and videos; reconciles conflicts; repairs pipelines; or keeps data fresh and retrievable hold the key to the kingdom of enterprise knowledge and process.”Jennifer Li, General Partner, Andreessen Horowitz · a16z Big Ideas 2026, December 2025
LazyFox ingests documents from MongoDB, governs the semantic layer, and makes every PDF available for natural language reporting without re-processing on each query.