Most "Data Moats" Are Fake: Three Questions That Kill Nine Out of Ten Stories

· 1d

Most "Data Moats" Are Fake: Three Questions That Kill Nine Out of Ten Stories

Most "data moats" are fake. "Data is the new oil" is the most misleading metaphor of this cycle — data is the water, position is the riverbed. Three questions kill nine out of ten flywheel stories; the sharpest asks whether the scenario produces free ground truth. Plus the quiet repricing from training fuel to inference context — and why that's lock-in.

If you’ve ever heard “we have a data moat” in a pitch — this piece gives you three questions to test it on the spot. If you’re building, it answers something far more important than which model to use: choose your referee before you choose your arena.

First, kill the metaphor that’s run for fifteen years: “data is the new oil.” It’s wrong in three fundamental ways — oil is consumed when burned, data copies for free; oil is identical to every buyer, data loses most of its value outside the context that produced it; oil grows scarcer as it’s extracted, while data’s marginal value to a model decays as it accumulates.

Swap out the metaphor and the real question appears: what’s valuable is never the data itself — it’s the position others can’t get the data from. Data is the water; position is the riverbed.

Three Kinds of Data, Three Fates

Public training data: the high-grade veins have been mined nearly bare by frontier labs and open source alike — and more fundamentally, a mine anyone can enter was never a moat. What remains of the story is cost-side: rights holders waking up, free fuel becoming priced fuel. Stop paying for “look how much data we scraped.”

Proprietary static data (industry databases, archives): genuinely valuable, but structured like a mine, not a river — sellable, not defensible. Each license transfers the value once, and it doesn’t grow with usage. That’s an asset business, not a flywheel business — it deserves asset multiples, not growth multiples.

Closed-loop data: usage generates data → data improves the product → a better product drives more usage. The only real flywheel — but every founder can draw this loop, and drawing it and spinning it are two different things.

The Fake-Flywheel Test: Three Questions

One: does the data actually improve the product? Most conversation logs and clickstreams contribute almost nothing at the margin — capability gains come from methods and compute, not another batch of chat history. Data accumulation without an improvement mechanism is just a storage bill.

Two: is the improvement perceptible to users? A product that’s 3% better in ways nobody feels has no third leg — the wheel doesn’t turn.

Three (the sharpest): does the loop contain a free signal of right and wrong?

The scarce resource of a real flywheel is ground truth that the scenario produces naturally, for free.

Why was the coding agent one of the first businesses of the LLM era to scale? Because the coding scenario ships with its own referee: did it run, did the tests pass — a free, instant, unambiguous verdict — so every use automatically generates a labeled training signal. Compare customer support (satisfaction is fuzzy), writing (quality is subjective), legal (feedback arrives in months): the clearer, faster, and cheaper the referee, the realer the flywheel.

Pre-empting one rebuttal: fuzzy scenarios don’t lack flywheels — their flywheels are slower and more expensive, because the referee must be hired (expert labeling, lagging outcome data). That changes the unit economics, not the possibility. But “the referee costs money” is itself a filter: only scenarios whose contract values can carry the referee’s cost can spin the wheel — which is exactly why vertical AI produces real moats more often than general-purpose AI.

A Quiet Repricing: From Training Fuel to Inference Context

The center of gravity of data’s value is shifting from training time to inference time. The real use of enterprise data is less and less “fine-tune a model” and more “feed it as context at inference” — retrieval, memory, personalization, workflow state.

That moves the moat: in the training era, data’s value was realized once (smelted into weights); in the context era, it’s realized continuously — every call needs it present, and presence is lock-in. Three years of accumulated workflow context, permissions, and organizational memory can’t be migrated by exporting data; you’d have to rebuild every connection between the data and the processes. Whoever holds the accumulated context holds the lock-in.

Three sentences to close. Investors: run the three questions on every “data moat” — especially the third; a flywheel without ground truth is a slide-deck flywheel. Builders: pick your referee before your arena — is your scenario’s verdict free or purchased, instant or lagging? That choice matters more than your model choice. Everyone: make context accumulation a design goal — every use should make you harder to uninstall, or you’re just doing distribution for a model company.

In your scenario — is the verdict free? And how fast does it come back? Comments open.

Data sources (verified, June 2026): this piece is a framework argument; key factual anchors — coding-agent commercialization scale (Claude Code >$2.5B annualized as of Feb 2026, reported); rising synthetic-data share, copyright litigation and licensing in parallel (public record).

— From Chapter 7 of a book in progress, working title The Deflation Sandwich

#AI #DataMoat #AIStartups

Download Pickful App

Better experience on mobile

iOS Android APK

iOS

Android

APK