What Is a Mixture of Experts Model?

Most of the frontier AI models you use every day don’t run all their parameters when processing your message. That’s not a bug. It’s the entire point.

The short answer

A mixture of experts (MoE) model is a neural network architecture that divides its parameters into discrete subnetworks called “experts.” When the model processes a token (a word or word-piece), a small learned network called a router decides which experts activate for that token. Only a fraction of the total experts fire at once, which means the model can be enormous in total size while remaining fast and cheap to run.

The key insight: a model with 100 billion total parameters might activate only 10 to 20 billion per token. You get the knowledge encoded in a very large model with the inference cost of a much smaller one.

The long answer

Where this came from

The concept of mixture of experts predates modern deep learning by decades. Statisticians used it for combining predictions from multiple models. What changed in 2017 was the work of Noam Shazeer and colleagues at Google, who introduced the sparsely-gated MoE layer in “Outrageously Large Neural Networks.” Their paper showed you could scale to 137 billion parameters while keeping training costs manageable by routing each token to just a handful of expert subnetworks out of thousands. That paper is where modern MoE architectures trace their lineage.

Google’s 2021 Switch Transformer pushed the idea further: a one-trillion-parameter model that activates only one expert per token. The trade-off for extreme sparsity is that each expert sees less total training data, so balancing expert load becomes a real engineering problem.

MoE became widely known when Mistral AI published Mixtral 8x7B in late 2023, an open-weight model with 8 experts per layer and 2 active per token. Total parameters: 46.7 billion. Active parameters per token: roughly 12.9 billion. Mixtral matched or exceeded Llama 2 70B on most benchmarks while running considerably faster at inference. That result made it hard for anyone to ignore the architecture.

How routing works

Inside each transformer layer of an MoE model, instead of a single feed-forward network, there are N expert networks, often 8, 16, or 64 depending on the design. Alongside them sits a routing network: a small learned transformation that maps each token’s internal representation to a score for each expert.

The top-K experts (usually K=1 or K=2) receive the token. Each activated expert processes it and produces an output, those outputs get combined proportionally to their routing scores, and the result flows forward to the next layer. The router itself is tiny relative to the experts, so its compute overhead is negligible.

Load balancing is where most of the engineering complexity lives. If one expert attracts the majority of tokens during training, the others see too little data and specialize poorly. Standard practice is to add an auxiliary load-balancing term to the training loss that penalizes uneven routing. Getting this right is one of the reasons MoE training is harder than dense training at equivalent scale.

Memory versus compute: the key distinction

This is the part that confuses most people.

Memory: An MoE model needs all its experts resident in memory during serving. Mixtral 8x7B’s 46.7 billion parameters require roughly 93 GB in half-precision (FP16), even though only 12.9 billion parameters activate per forward pass. You need the hardware to hold all of them.

Compute: Because only K experts fire per token, the actual matrix multiplications per forward pass track the active parameter count, not the total. This is why throughput is high relative to model capacity.

The practical implication: MoE models are well-suited to data centers with abundant GPU memory and high request volume, where the per-token compute savings add up fast. Running Mixtral locally requires around 50 GB of memory, which puts it out of reach for most consumer hardware. A dense model of equivalent capability would need similar RAM and run slower. Neither option is comfortable on a 16 GB laptop.

Dense versus sparse

Dense models activate every parameter for every token. GPT-2, early Llama models, and most consumer-facing small models work this way. Dense is simpler to train, more predictable in its behavior, and typically more memory-efficient to serve per parameter. The cost is that scaling capacity requires proportionally more compute.

MoE models can achieve higher capacity per compute dollar spent, but they come with failure modes dense models don’t have. Routing collapse during training (where all tokens pile into one or two experts) is a real risk. Expert load imbalance creates uneven specialization. Serving infrastructure needs to handle variable activation patterns efficiently, which complicates deployment.

The field has largely concluded that MoE is worth the complexity at scale. GPT-4, Gemini 1.5, and Grok-1 are all widely reported to use MoE architectures. Neither OpenAI nor Google have published complete architecture specifications, so some of these details remain unconfirmed.

Why this matters in 2026

MoE is not a niche trick anymore. It is the dominant architectural choice for frontier labs trying to push capability without proportionally increasing inference costs. When a model improves sharply on benchmarks without a corresponding jump in API pricing, MoE is often a contributing factor.

For developers building on AI APIs from Africa or other emerging markets, where dollar costs matter more relative to local revenue, the practical effect is favorable. MoE architectures are part of why API prices have fallen so fast. A provider can deploy a much more capable model than inference costs alone would suggest, and pass some of that efficiency to customers. Access to genuinely frontier models at consumer price points would have been economically implausible without architectural improvements of this kind.

The other implication: as open-weight MoE models continue to improve, the gap between what you can run locally and what the top API providers offer is narrowing in capability terms, even if memory requirements remain a barrier.

Common misconceptions

“MoE models are faster because they have fewer parameters.” They have the same total parameters. They’re faster because fewer activate per forward pass. The distinction matters because serving cost scales with memory as well as compute.

“Each expert specializes in a domain, like one for coding and one for math.” Intuitive but not accurate. Experts develop statistical specializations from training data, not explicit topic labels. An expert might activate heavily for certain syntactic patterns rather than for a human-legible category. The specializations are real but they don’t map neatly to subject areas.

“You can keep adding experts to make a model smarter.” More experts increase total capacity, but each expert needs enough training data to specialize effectively. The router can become a bottleneck. Load balancing gets harder. Past a point, returns diminish and training instability increases.

“MoE only makes sense for giant models.” Recent work on smaller MoE variants, including MoE adaptations of vision-language models, shows the technique applies at more modest scales. The efficiency gains are less dramatic below a few billion parameters, but the architecture is viable there.