How LLM Context Windows Actually Work Under the Hood

Your AI assistant just forgot what you said three paragraphs ago. The context window is supposed to prevent that. It doesn’t always.

Context windows are one of the most misunderstood features in AI. People treat them like a RAM spec: bigger is always better, and if a number fits inside the limit, the model read it. Neither of those things is reliably true. Here’s what’s actually happening.

The short answer

A context window is the total number of tokens a model can process in a single forward pass: your prompt, any conversation history, any documents you attached, plus the model’s response so far. Everything the model “sees” at inference time must fit inside this window. Anything outside it doesn’t exist, from the model’s perspective.

Claude Sonnet currently supports a 200,000-token context window. That sounds large. Translated to real text, it’s roughly 150,000 words, or about two full-length novels. For comparison, GPT-4 Turbo supports 128,000 tokens. These are genuinely large numbers. They also come with asterisks.

The long answer

Tokens are not words

Before getting to how context windows work, it helps to be precise about tokens. A token is a chunk of text, but not necessarily a word. Common short words like the or is map to a single token. Longer or rarer words get split: cryptocurrency might be two or three tokens depending on the tokenizer. In practice, English text runs around 0.75 words per token, so a 200,000-token window holds roughly 150,000 words.

This matters because the window limit is a token limit, not a word limit, and it applies to everything: your system prompt, your entire conversation history, any documents you paste in, and the model’s own output so far. The counter is always running.

How the model reads: attention

The core mechanism inside a transformer is called self-attention. On every forward pass, every token in the context looks at every other token and decides how much “attention” to pay to it. This produces a weighted summary of the context that the model uses to generate each next token.

Think of it like a room where everyone can hear everyone else simultaneously. A token representing the word “she” looks at the surrounding context and figures out which earlier noun it refers to. A token representing a number looks at nearby tokens to understand its units and meaning. Every token is in conversation with every other token, all at once.

This sounds elegant, and it is. It’s also computationally expensive. The memory required for attention scales with the square of the sequence length. Double the context, quadruple the memory cost. This is why long-context models require more hardware to run and cost more per token at inference: the math gets heavier, not linearly, but quadratically.

The KV cache

To avoid recalculating attention for tokens you’ve already seen, transformers store something called a key-value (KV) cache. When you’re having a conversation, the model doesn’t re-read every prior message from scratch on each turn. It caches the attention representations and reuses them. This is what makes multi-turn conversation practical.

The KV cache also explains why prompt caching exists as a billing feature. If you have a long system prompt that doesn’t change between calls, providers can cache its KV representations and charge you less for re-reading it. The compute work was already done once.

The cache has limits. Cached representations occupy GPU memory, which is finite and expensive. This is part of why running a 200K-context model costs significantly more than running a 4K-context model, even if your actual query is short.

What happens at the edges

A model doesn’t read its context the way you read a document. Research has found a consistent pattern: models tend to be better at using information from the beginning and end of their context than information buried in the middle. A 2023 paper from Stanford and other institutions studied this directly and found that performance on retrieval tasks degraded significantly when the relevant information was placed in the middle of a long context, even when it was well within the model’s stated limit.

The researchers called this “lost in the middle.” It’s a real phenomenon, not a model-specific quirk. It reflects something fundamental about how attention distributes over long sequences: the strong positional signals at the start and end of a context anchor the model’s attention more effectively than the diffuse middle.

Practical implication: if you’re feeding a model a long document and asking a specific question, where you place the relevant information matters. Putting it near the start or end of the prompt tends to produce better recall than burying it on page 12 of a 20-page paste.

Why this matters in 2026

Context windows have grown dramatically in the past three years. Models that supported 4,000 tokens in early 2023 now support 128,000 or 200,000. This is a genuine capability leap, enabling things that were impossible before: feeding an entire codebase into a single prompt, having a 3-hour meeting transcript summarized in one call, or analyzing a full legal document without chunking.

But the growth of context windows has also created a misconception: that longer context automatically means better performance. It doesn’t. The quadratic scaling of attention means longer contexts cost more and, on some tasks, produce worse outputs because the model’s attention dilutes across more tokens. Smaller, focused prompts often outperform bloated ones.

The other shift happening in 2026 is the rise of agentic systems, where models run for many turns without human intervention. Claude Code Routines, launched this week in research preview, runs Claude as a persistent background agent on codebases. These agents accumulate context across runs: tool outputs, prior conversation turns, file contents. Managing context carefully isn’t a nice-to-have in these systems; it’s an engineering discipline. Run out of context mid-task, and the agent loses the thread.

Common misconceptions

If it fits in the context, the model read it. Technically true but practically misleading. Read in the attention sense means every token computed its attention weights relative to every other token. But attention doesn’t mean recall. Information in the middle of a long context is accessed less reliably than information at the edges, as the “lost in the middle” research demonstrates.

A bigger context window means a smarter model. Context window size is an engineering parameter, not an intelligence parameter. A model with a 200K context window isn’t inherently better at reasoning than a model with a 32K window. It’s better at processing more text in one shot. Those are different things.

The context window is like memory. It’s more like a whiteboard. Everything on it is equally visible at inference time (modulo the middle-attention issue), but once the conversation ends, it’s erased. There’s no persistent memory across sessions unless the system is explicitly designed for it. When you start a new chat, the model knows nothing about your last conversation.

Hitting the context limit is a hard error. Most production APIs handle context overflow by truncating the oldest part of the conversation, usually the earliest messages. This can cause the model to silently lose important context mid-task. Catching this in agentic systems requires explicit monitoring of token counts, not just assuming the session is intact.

Where to learn more

Attention Is All You Need (Vaswani et al., 2017): the original transformer paper, still readable and worth the effort
Lost in the Middle: How Language Models Use Long Contexts (Liu et al., 2023): the research on attention degradation in long-context settings