Hey hey,

I’m sure, you’ve also been through this.

You open a long chat with Claude or ChatGPT, paste in a big document, ask a few follow-ups, and somewhere around message fifteen, the answers start drifting.

It forgets an instruction you gave at the top. It contradicts itself. It misses a number that is right there in the document you pasted.

  • The first instinct is "the model got lazy."

  • The second is "maybe I need a bigger context window."

But both are wrong. What you are seeing has a name now. It is called context rot, and it is one of the most important things an AI PM can understand in 2026.

Because the fix isn't a bigger window, but how you manage what goes in.

In this edition, you will learn what context rot is, the three reasons it happens, the research that proves every frontier model does it, and a framework to design around it.

Let's go.

What Is Context Rot?

First, the context window is everything the model can see: the prompt, the chat history, documents pasted, tool results, the system instructions. (Detailed explanation here)

Most people assume the context window works like RAM. As long as you stay under the limit, everything inside is equally available. And that's the trap most people fall into.

A 1 million token window means you can drop a whole codebase or a hundred-page contract in there, and the model treats it all with the same care.

It does not. Context rot is the measurable drop in answer quality as the amount of context grows. Not when the window is full. Well before that.

To put it simply, the same question, with more stuff in the context, gets a worse answer. You gave the model more to hold, and holding more makes it worse at using any of it.

Why Bigger Context Windows Don’t (Always) Help

In 2025 and 2026, the labs raced to massive context windows.

A million tokens became normal across GPT, Claude, and Gemini. The marketing said, "Now you never have to worry about what to include. Just put everything in."

But the reality is the opposite. A bigger window is a bigger room, not a better memory. You can fit more furniture in, but the model still can’t pay equal attention to all of it.

Chroma ran the cleanest test of this in July 2025.

They evaluated 18 frontier models, including GPT-4.1, Claude 4, and Gemini 2.5. Each one degraded as context grew, at every length they tested.

Models with million-token windows started slipping at a small fraction of that, long before the window was anywhere near full.

They also found something that should bother every PM building retrieval.

The models performed worse on neatly organised documents than on the same documents shuffled into random order. They got busy tracking the narrative flow instead of finding the answer. That is the headline every PM needs.

The window size on the spec sheet is not the amount of context the model uses well.

How Context Rot Actually Happens

You need the three mechanisms because each one points to a different fix.

1. Lost in the middle.
Models pay the most attention to the start and the end of the context, and the least to the middle. Researchers found that accuracy on information buried in the middle can drop by 30 percent or more compared to the same fact placed at the edges. It is a U-shape. Strong at the top, strong at the bottom, soft in the belly.

So if the one clause that matters is on page 40 of a 90-page contract you pasted, the model is structurally likely to skim past it.

2. Attention dilution.
Every token in the context competes for a fixed pool of attention. Add more tokens, and each one gets a thinner slice. It is also expensive in a way that compounds: attention scales with the square of the length, so 100,000 tokens means roughly 10 billion pairwise relationships to weigh. More noise, less signal per token.

3. Distractor interference.
This is the sneaky one. Content that looks relevant but is not actively pulls the model off course. Ten paragraphs that are loosely about your topic do not help the model find the one that answers the question. They bury it. Similar-but-wrong is worse than absent.

If you notice the pattern, none of these are about the model being weak. They are about the context being crowded.

Login or Subscribe to participate

Why This Matters

This shows up everywhere there is AI in the product. Long conversations. Support bots and copilots that hold a session degrade as the chat grows.

The bug your user hits at turn 20 is often context rot. RAG that retrieves too much. Stuffing 30 documents into the context to be safe makes answers worse, not better. If you are using retrieval, this is the difference between a system that works and one that confidently gives you the wrong answer. I covered the basics in What Is RAG, and context rot is the reason "retrieve less, but the right things" beats "retrieve everything."

An agent that runs for 40 steps accumulates a giant context of tool calls and observations. By step 40, it is reasoning over a rotting pile of its own history. This is one of the real reasons agents lose the plot, and it sits right next to the memory problem I wrote about in Why Your AI Agent Always Forgets Users.

If your product gets worse the longer it runs, suspect context rot first.

What Should You Do About It

You cannot turn off how attention works. But you can design the context so rot never gets a chance to set in. I use a simple checklist called TRIM.

TRIM: a simple checklist for designing context that resists rot.

  • T is for Trim.
    Do not dump. Pass the model the smallest amount of context that can answer the question. Retrieve the three right paragraphs, not the thirty safe ones. Every token you remove is a token that cannot distract.

  • R is for Rank.

    Position beats hope. Put the most important information at the very start or the very end of the context, never in the middle. If you have a critical instruction or a key document, anchor it at an edge.

  • I is for Isolate.

    Split big jobs so each piece runs in a small, clean context. Instead of one agent reasoning over 40 steps of history, use sub-agents or fresh sessions, each carrying only what they need: small context, sharp answer.

  • M is for Manage.

    For anything long-running, compact as you go. Summarise the conversation so far into a tight brief and carry that forward instead of the full transcript. Trade raw history for a clean summary before the history rots.

Here’s How To Get Started Right Away

You can pressure test your own product against context rot in an afternoon.

  1. Find one flow where the AI has a lot of context. It could be a long chat, a heavy RAG call, or a multi-step agent.

  2. Measure quality in low and high contexts on the same task. Does accuracy fall as the context grows? Now you have a baseline.

  3. Apply Trim. Cut what you feed it to the minimum that still answers. Then, re-measure.

  4. Apply Rank. Move your most important instruction or document to the start or end. Re-measure again.

  5. For long-running flows, add a compaction step that summarises history before it gets large.

If quality goes up when you put less in, you found your context rot and just fixed it.

Common Mistakes To Avoid:

Treating the window limit as the usable budget.
The spec says 1 million tokens, but the reliable zone is far smaller. Design for the reliable zone.

Adding more context to fix a wrong answer.
The instinct is to paste more. That usually makes it worse. Subtract before you add.

Putting the vital instruction in the middle.
The system prompt is long, the docs are long, and the rule that matters is buried at line 200. Move it to an edge.

Retrieving for safety.
"Grab 30 documents so we do not miss it" is the classic RAG mistake. Precision beats recall once rot is in play.

In a Nutshell

Bigger context windows did not end the context problem.

They made it easier to overload the model without noticing. Context rot is more context, worse answers, well before the window is full.

It happens because models lose the middle, dilute their attention, and get misled by near misses. And the fix is the discipline about what goes in.

Trim what you feed it. Rank the important things to the edges. Isolate big jobs into small contexts. And manage a long history with summaries.

The PMs who get this stop asking "which model has the biggest window" and start asking "what is the least I can give the model to get this right."

That second question is the whole game.

See you in the next edition,
— Sid

Reply

Avatar

or to participate

Keep Reading