How LLM Context Windows Actually Work Under the Hood

When you paste a 50-page report into Claude and it answers a question about page 38, it didn’t read to page 38. That’s not how any of this works.

The short answer

A context window is the full body of text a language model can see during a single inference call. Every token inside it is available to the model simultaneously when generating each new word. Nothing outside the window exists. Claude 3.7 Sonnet supports up to 200,000 tokens at once. That’s roughly the length of a full novel, all held in view at the same time. The model doesn’t skim it, summarize it, or build notes as it goes. It just… sees all of it.

The long answer

What counts as a token

The OpenAI tokenizer is a useful tool for getting intuition here. A token is roughly 0.75 words in English prose on average, though it varies significantly by language and content type. “tokenization” splits into three tokens. “the” is usually one. Code tokenizes differently from prose, and emoji often consume two to four tokens each.

This matters because every piece of text in a single API call counts against the context window: the system prompt, the full conversation history, the document you pasted, and the model’s own previous replies. If you’re near the limit, something gets cut. Applications typically drop the oldest messages, but the exact behavior depends on how the app is built. Many users don’t realize they’ve pushed old context out until they notice the model has “forgotten” something it acknowledged earlier.

How attention actually works

The architecture behind every major language model is the transformer, introduced in a 2017 paper by Google researchers Ashish Vaswani and colleagues. The defining mechanism is self-attention.

The intuition: imagine every word in a document broadcasting a query to every other word, asking “are you relevant to me?” Each word replies with a relevance score. The model then builds a weighted representation of the entire context based on those scores. “President” attends heavily to “Biden” three sentences later. “error” attends to “function” and “line 47” two paragraphs up. “however” attends to the contrasting claim it’s about to negate.

This happens across multiple attention heads in parallel, each trained to detect different types of relationships: syntactic, semantic, referential, positional. The outputs of all heads are combined, giving each token a representation that encodes how it relates to every other token in the window.

The result is a model that can connect a detail in paragraph 3 to a question about paragraph 47, provided both sit inside the context window. There’s no explicit lookup or cross-referencing logic built by engineers. It emerges from the attention scores learned during training.

The KV cache

One of the key optimizations making long-context models economically viable is the KV (key-value) cache. During inference, the model computes key and value matrices for every token. If part of the context is stable across multiple calls, those matrices don’t need to be recomputed each time.

This is the mechanism behind Anthropic’s prompt caching feature. A long, stable system prompt or document can be prefixed to requests in a cacheable block. You pay the full computation cost on the first call. Subsequent calls that hit the same prefix get a significant discount. For high-volume applications where many users send queries against the same large document, caching can reduce inference costs substantially.

Why long context gets expensive

Self-attention doesn’t scale linearly with context length. It’s roughly quadratic: double the context length and the compute cost roughly quadruples. This is the core engineering challenge in building long-context models.

Sparse attention architectures address this by having tokens attend only to the most relevant subset of the context rather than every other token. Some models implement sliding window attention, where tokens primarily attend to nearby tokens with periodic global attention heads for long-range dependencies. These approaches trade some expressiveness for much better scaling behavior at lengths that would otherwise be impractical.

The result is a landscape where context length and inference cost are genuinely in tension. Pasting your entire codebase into a 200K context is technically feasible, but whether it’s cost-effective depends heavily on your API volume and how stable that context is across calls.

What the model “remembers”

Here’s the subtle part. The context window is not memory in any human sense. The model doesn’t build a mental model of your document and then answer from that understanding. It generates each output token by attending to the raw input text in real time.

Ask the model “what did the report say about Q2 revenue?” and it doesn’t recall a summary it formed while reading. It attends to the raw text, finds patterns relating to “Q2 revenue” across the full context, and synthesizes a response. Fast and powerful, but with a specific failure mode.

When the relevant text is buried in the middle of a very long context, models can underweight it relative to text near the beginning and end of the window. Research from Stanford documented this pattern across multiple model architectures: performance on retrieval tasks degrades for content placed in the middle of long contexts, even when the model nominally has enough context window to see it. The researchers called it “lost in the middle.”

This is why retrieval-augmented generation (RAG) still has a role even as context windows expand. Retrieving and surfacing the two most relevant chunks often outperforms giving the model 50 chunks to sort through, because precision of context matters alongside size of context.

Why this matters in 2026

Context windows have grown so large that some teams are treating them as a substitute for retrieval infrastructure. For a codebase under 150,000 tokens, context stuffing is sometimes simpler and more accurate than maintaining a vector index, because the model sees everything in one shot rather than relying on similarity search to fetch the right pieces.

The economics don’t always follow, though. Filling a 200K-token context on every API call is expensive at current rates, especially at any meaningful request volume. The sweet spot for most production use cases is still some combination of retrieval (to narrow down what goes in the window) plus a reasonably sized context (to give the model enough room to reason across it).

What’s genuinely new in 2026 is how cheaply you can now get to 100K tokens of effective context versus three years ago. The cost curve has dropped dramatically. Long-context workflows that required expensive API tiers in 2023 are now accessible at standard pricing. That’s changing what kinds of applications are worth building.

Common misconceptions

“The model reads the context before answering.” There’s no sequential pass. Attention is computed across the full context simultaneously. The model doesn’t need to “get to” a part of the document: it sees it all at once in each layer.

“A larger context window means better answers.” Not automatically. More context can mean more noise. A model given 50 irrelevant pages alongside 2 relevant ones can perform worse than the same model given just the 2 relevant pages. Precision of context matters as much as volume.

“Tokens are just words.” They’re not. A single word can span multiple tokens, a single token can be a fragment of a word, and the same sequence of characters tokenizes differently depending on where it appears. Short, common English words tend to be single tokens. Technical vocabulary, code, and non-English text often tokenize less efficiently.

“The model loses information near its context limit.” Most models use fixed positional embeddings that treat position 1 and position 200,000 equivalently in terms of raw representational capacity. What changes is the “lost in the middle” degradation pattern described above. The model can see everything in the window; the issue is attention weight distribution, not literal information loss.