What Is a Transformer, Really?

Every AI model you’ve used in the last five years is, at its core, a transformer. GPT, Claude, Gemini, Llama, Mistral: all transformers. The word sounds like it should describe a power grid component. It describes the architecture that quietly replaced everything else in natural language processing.

The short answer

A transformer is a neural network architecture built around a mechanism called self-attention. Instead of processing words one at a time (like earlier recurrent networks), a transformer processes all the words in a sequence simultaneously and calculates, for each word, how much attention to pay to every other word in the context. That parallel computation, combined with stacking many layers of it, turns out to be extraordinarily good at understanding language, writing code, reasoning about logic, and a dozen other tasks nobody predicted in 2017.

The long answer

Before transformers: the recurrence problem

To understand why transformers matter, you need to understand what came before.

For most of the 2010s, sequence modeling relied on recurrent neural networks (RNNs) and their improved variant, LSTMs (Long Short-Term Memory networks). These models read text left to right, maintaining a “hidden state” that was supposed to carry forward information from earlier in the sequence.

The problem: that hidden state was a fixed-size vector. As sequences grew longer, important context from early in the passage got diluted or lost entirely. Ask an LSTM to translate a 200-word paragraph and by the time it reaches the last sentence, the first sentence has mostly evaporated from its working memory.

A second problem: sequential processing was slow. You couldn’t run token 5 until you’d finished token 4. Training was bottlenecked by this dependency chain.

The 2017 paper that changed everything

In June 2017, a team at Google Brain published Attention Is All You Need. The title was a provocation: forget recurrence, forget convolutions. Attention mechanisms alone, they argued, were sufficient to build a state-of-the-art translation model.

They were right.

The paper introduced the transformer architecture: an encoder that reads the input sequence, a decoder that generates the output sequence, and attention mechanisms connecting them. The key innovation was multi-head self-attention: the model could simultaneously attend to different parts of the sequence from multiple “perspectives” at once, then combine those views.

The results on translation benchmarks were better than anything before it. And because the computation was parallelizable, training was dramatically faster.

What self-attention actually does

Here’s the core idea. For every token (word-part) in a sequence, the model asks three questions:

What am I looking for? (the Query)
What do I have to offer? (the Key)
What information should I pass forward? (the Value)

Every token generates all three. Then attention is calculated by comparing each token’s Query against every other token’s Key. Tokens that are highly relevant to each other get high attention scores; their Values get weighted more heavily in the output.

Consider a sentence like: the bank approved the loan because it had strong reserves.

What does the pronoun refer to? A human reader knows immediately: the bank. A recurrent model might struggle if those words are far apart. A transformer computes attention across the full sentence at once. The word “bank” and the pronoun “it” point at each other. The pronoun resolution isn’t special-cased logic; it emerges from the attention calculation.

Positional encoding: the kludge that works

Here’s one thing the transformer doesn’t have by default: any sense of order. Self-attention sees all tokens in parallel and has no built-in notion of which word came before which.

The solution is positional encoding: adding a signal to each token embedding that encodes its position in the sequence. The original paper used a specific sinusoidal function for this. Later architectures have experimented with learned positional embeddings and more sophisticated methods like RoPE and ALiBi, but the core idea is the same: inject order information explicitly, because the attention mechanism itself is order-agnostic.

Stacking layers

One attention layer doesn’t get you a capable model. Modern transformers stack dozens of these layers. Each layer reads the output of the previous one, letting the network build progressively more abstract representations. Early layers capture syntax. Deeper layers capture semantics, reasoning, world knowledge.

GPT-4 and Claude 3 almost certainly have over 90 layers, though the exact architectures aren’t public. What’s public is the scale: hundreds of billions of parameters across those layers, each parameter adjusted during training to minimize prediction error across trillions of tokens.

Encoder vs. decoder models

Not all transformers are the same. The original paper had both an encoder and a decoder. But the field split:

Encoder-only models (like BERT) read a full sequence at once and build rich representations. They’re excellent at classification, question answering, and search. They’re bad at generation.

Decoder-only models (like the GPT series and Claude) generate text left to right, using masked attention so each token only sees what came before it. They’re what most people think of when they say “LLM.”

Encoder-decoder models (like the original transformer and T5) use both: an encoder processes the input, a decoder generates the output. These dominate translation and summarization tasks.

Why this architecture still dominates

The transformer architecture is nine years old and still the foundation of every frontier model. That’s unusual in machine learning, where architectures often get displaced within a few years.

The reason it’s held on: transformers scale predictably. As you add parameters, add data, and add compute, performance keeps improving in measurable ways. The scaling laws that describe this relationship have held across multiple orders of magnitude. Nobody has found a clean way to break them.

What’s changing isn’t the core architecture but how it’s being extended. Mixture-of-experts routing (routing different inputs to specialized sub-networks), state-space model hybrids like Mamba, and sparse attention patterns are all attempts to make transformers cheaper, longer-context, or more efficient. So far, none have definitively replaced the standard dense transformer for frontier use cases.

Common misconceptions

Transformers understand language the way humans do. They don’t. Transformers are pattern-matching engines operating on token co-occurrence statistics across enormous corpora. The representations they learn are useful and sometimes surprising, but they’re not semantic understanding in any philosophical sense. How much this distinction matters in practice is an active debate.

The attention mechanism is how the model thinks. Attention weights tell you what tokens are being attended to during a forward pass. Mechanistic interpretability research suggests the real computation is more distributed and harder to read off from attention patterns alone. Attention is a useful signal, not a complete window into the model’s reasoning.

A larger context window means better reasoning. A larger context window means the model can take in more tokens per call. Whether it actually uses that information effectively is a separate question. Performance on long-context tasks often degrades past a certain point even when the window allows it.

Where to learn more

The Illustrated Transformer by Jay Alammar: the best visual walkthrough of attention mechanics for non-specialists.
Attention Is All You Need: the original paper is more readable than most academic ML papers. The abstract and introduction alone are worth the time.