The Karpathy Series · Part 2 of 3

Three Layers of AI Token Waste
in the Enterprise

A taxonomy for CFOs who can’t tell which fix to buy, and why buying the wrong one means paying twice without solving the original problem.

By Alexander Braun · April 2026 · 6 min read
01 Karpathy & the CFO layer problem 02 Three layers of token waste 03 Wiki costs at enterprise scale
$300M — Salesforce’s entire 2026 Anthropic spend·
5–7% of every knowledge worker’s salary — Rory O’Driscoll, 20VC·
71% of companies exceeded their AI budget in 2025·
60–80% of enterprise queries never need to hit the model at all·
36% annual growth in AI token costs — CloudZero 2026·
10 margin points — what AI governance is worth by 2029 (Gartner)·
$300M — Salesforce’s entire 2026 Anthropic spend·
5–7% of every knowledge worker’s salary — Rory O’Driscoll, 20VC·
71% of companies exceeded their AI budget in 2025·
60–80% of enterprise queries never need to hit the model at all·
36% annual growth in AI token costs — CloudZero 2026·
10 margin points — what AI governance is worth by 2029 (Gartner)·

In Part 1, we established that Karpathy’s wiki architecture solves a content-layer problem, not the semantic-layer problem most enterprises are actually paying for. Here we lay out all three layers and what each requires.

Every CFO has noticed the AI bill. Not every CFO has noticed that the AI bill is actually three different bills, with three different drivers, that look identical on the invoice. The reference architectures going around online conflate all three. Buying the wrong fix for your actual bottleneck is how companies end up paying twice and still having the original problem.

This is the taxonomy. Three layers, three waste modes, three different shapes of solution.

Layer 1: Content, re-reading prose

When an AI agent answers a question about prose-heavy material, a research paper, a legal contract, a 40-page policy doc, most of the tokens go into re-reading that material on every query. The waste mode is "I have already read this PDF nine times today, and the tenth person is going to ask the same thing about it tomorrow."

This is the layer Andrej Karpathy is solving with his LLM-compiled wiki. The architecture is to read each source once, compile it into a dense, deduplicated, cross-linked Markdown corpus, and let subsequent queries operate on the compiled artifact instead of the raw files. For workloads that genuinely run on prose, research labs, law firms, policy shops, due-diligence teams, the savings are real and large.

Two warnings even when you have a content-layer problem: the wiki has to be maintained (at enterprise scale, recompile and lint costs scale linearly with corpus size), and it does nothing for queries that aren’t really about prose.

Layer 2: Semantic, re-resolving meaning

Most enterprise AI queries are not "summarize this document." They’re "compute this number under these conditions."

  • "Revenue for Q1 in DACH, on the new accounting standard."
  • "Open invoices over 30 days, excluding inter-company."
  • "Pipeline coverage at 2x quota, weighted by stage probability."

Every one of these depends on a definition. Revenue. Open. Pipeline. Coverage. In most enterprises these terms aren’t ambiguous because nobody thought about them, they’re ambiguous because each business unit got them right for their own context. Finance recognizes revenue. Sales books contracts. Both are correct. The conflict is real and load-bearing.

When an AI agent runs into this, it does not say "definitions conflict." It guesses. It retries. It hedges. It quietly chooses whichever interpretation the prompt nudged it toward. The token bill is the visible cost. The wrong number in the next board deck is the invisible one.

This is the semantic layer. It is where most enterprise AI token spend actually lives, and the layer most teams don’t recognize as a layer at all. They call it "data quality," or "context," or "AI accuracy," and try to solve it with prompting tricks, longer context windows, or another retrieval system. None of that works for long, because none of it carries the version history of how Finance defined ARR last quarter.

Layer 3: Procedural, re-discovering how decisions get made

The third layer is the one nobody talks about until an AI agent has to actually do something, resolve a customer ticket, approve a refund, escalate an incident, route an exception. Now it needs to know how the work actually gets done in this company. Not "what does revenue mean", but "what does a senior support rep do when a tier-2 customer churns mid-contract for the third time this year."

That knowledge usually lives in nobody’s head and everybody’s hands. It’s procedural. It’s tacit. It rarely makes it into documentation because the people who know it don’t have time to write it down, and what they do write becomes outdated within a quarter.

This is the layer companies like Interloom are solving, by ingesting operational records and mapping how decisions actually get made by the experienced humans in the room. It’s a real category, adjacent to ours, and a different fix again: you can’t compile prose to solve it, and you can’t compile a definition to solve it.

Three waste modes.
Three different fixes.

Most reference architectures pick one layer, usually content, because Karpathy’s setup is the most visible to copy, and apply it to all three. That wiki paid setup and recompile costs but didn’t fix the bill, because the bill was being burnt on the semantic layer underneath.

01

Content Layer

10–20%
of the typical enterprise AI bill

Re-reading prose on every query. Research papers, contracts, policy docs. The waste Karpathy’s wiki architecture is designed to eliminate, and does, at the right scale.

Fix: build a wiki
03

Procedural Layer

5–15%
of the typical enterprise AI bill

Re-discovering how decisions get made. Tacit operational knowledge that lives in nobody’s documentation but everybody’s institutional memory. Can’t be fixed by prose or definitions alone.

Fix: procedural infrastructure

How the layers compose

Imagine an AI agent reasoning about a tier-2 customer churning mid-contract for the third time. To answer well it needs:

  • Procedural context. How do senior reps usually handle this?
  • Semantic context. What counts as a "tier-2" customer this quarter? What is "mid-contract" given the renewal clause? What is "third time", within this fiscal year or lifetime?
  • Content context. What does the customer’s most recent support email actually say?

Three layers, three fixes. Compress each in the right shape and you take 70–80% of the cost out of the workflow. Compress them in the wrong shape and you reinvent expensive architecture for the wrong layer.

A practical test for your stack

Take a thousand of your real AI queries from last week. Look at where the tokens went. Bucket them by waste mode.

  • If most of the spend was on re-reading prose, you have a Layer 1 problem. Build a wiki.
  • If most of the spend was on the model retrying, hedging, asking for clarification, or re-resolving the same definitions, you have a Layer 2 problem. Build a semantic layer.
  • If most of the spend was on the model trying to figure out how to do something it’s never been shown how to do, you have a Layer 3 problem. Build or buy procedural infrastructure.

What we see in mid-market and enterprise customers: Layer 2 is 60–80% of the bill, Layer 1 is 10–20%, Layer 3 is the rest. Your numbers will differ. But knowing which mileage is which is the difference between fixing your AI cost and generating consulting work for someone.

Architecture wins beat brand wins in enterprise software. The team that identifies the right layer for their bottleneck spends a fraction of what the team that copies the trendiest reference architecture does. In Part 3, we run the cost math to show exactly how large that gap is.

Find out which layer your bill is being burnt against.

Share your email and we’ll run a free token cost audit, showing where your knowledge worker queries are burning budget and what a 90-day reduction path looks like.