Categories
AI

What Is a Mixture of Experts Model?

Most of the largest AI models running today don’t use all their parameters when they answer you.

The short answer

A mixture of experts (MoE) model splits its neural network into multiple specialized sub-networks, called experts, and uses a learned router to activate only a small subset of them for each input token. Instead of every parameter firing for every token, only the relevant slice lights up. This makes large models faster and cheaper to run than their raw parameter counts suggest, and it explains why frontier AI has gotten dramatically cheaper to serve over the past two years.

The long answer

How a conventional dense model processes tokens

In a standard transformer-based language model, every forward pass activates every parameter. Feed in a token and the full weight matrices do work on it, regardless of what the token is. At smaller model sizes this is manageable. At 70 billion or 100 billion parameters, the compute cost per token becomes significant. You’re running the equivalent of a small country’s calculator fleet for every single output token, whether the word is “the” or a rare technical term that plausibly needs more processing.

Dense models scale well up to a point, but the compute requirements grow proportionally with parameters. There’s no free lunch built into the architecture.

What “experts” actually are

In an MoE model, the dense feed-forward layers inside each transformer block are replaced by a set of parallel sub-networks: the experts. Structurally, each expert resembles the feed-forward network (FFN) you’d find in a conventional dense model. The difference is that instead of one giant FFN doing all the work, you have multiple smaller FFNs that the model can route tokens through selectively.

These experts aren’t hand-labeled or manually assigned subject areas. The model discovers its own routing patterns during training. Researchers have found that experts do develop behavioral specializations, though what those specializations look like from the outside isn’t always obvious. Some studies find syntactic or structural patterns. Others find topic-adjacent clustering. It varies by model and training setup.

What matters functionally is that the model learns to send different tokens to different experts, and the result is a distributed system that performs well across diverse inputs without requiring every component to handle every input.

The router

Between the attention layer and the experts sits a gating network, called the router. It takes the current token’s representation and outputs a probability distribution across all available experts. The model then selects the top-K experts (most commonly top-2) and routes the token through those and only those.

The outputs from the selected experts are weighted by their probability scores and combined to produce the layer’s final output. The rest of the experts do no work on that token.

As HuggingFace’s overview of MoE architectures explains, the routing decision happens independently at each layer, which means a token might go through different expert combinations at different depths of the model.

Active vs total parameters: the core trade-off

Mixtral 8x7B, released by Mistral AI in early 2024, demonstrates the architecture concretely. The model has 46.7 billion total parameters. But only 12.9 billion parameters are active per token, because the top-2 routing means only 2 of the 8 experts process any given input.

That gap is the whole point. You pay the training cost across all 46.7 billion parameters: every expert needs to learn something useful, every expert participates in gradient updates. But you pay the inference compute cost against roughly 12.9 billion active parameters per token. The result is a model that competes with much larger dense models on benchmarks, while running inference at speeds and costs closer to a 12-13B dense model.

This ratio between total and active parameters is now one of the key dimensions to understand when evaluating a model, separate from either number taken alone.

Load balancing and expert collapse

MoE training has one characteristic failure mode: if the router learns to favor one expert overwhelmingly, that expert gets the bulk of the gradient signal and improves, while the others receive little signal and stagnate. You end up with one expert doing most of the work, which defeats the purpose.

To prevent this, MoE training adds an auxiliary load-balancing loss that penalizes the model when traffic to experts becomes too unequal. The goal is to keep utilization distributed across the available experts.

The tradeoff is real: too aggressive a load-balancing constraint forces tokens to sub-optimal experts for the sake of fairness, which hurts quality. Too permissive and you risk collapse. Papers in this area treat the balance between quality and utilization as a live research problem, not a solved one.

The memory constraint

Here’s what consistently surprises people. MoE models have lower compute requirements per inference step, but their memory requirements don’t shrink proportionally. All 46.7 billion parameters of Mixtral 8x7B need to sit in GPU memory, even though only 12.9 billion are doing active computation on any given token.

You need the VRAM to hold the full model. A GPU that can comfortably run a 13B dense model can’t run Mixtral 8x7B without quantization. The compute savings are real, but memory footprint stays tied to total parameter count, not active parameter count. Local deployment planning has to account for this gap.

Why this matters in 2026

MoE is no longer experimental. Several of the most capable models publicly available use this architecture. Grok 1 from xAI is a confirmed MoE with 8 experts. DeepSeek V2 uses a refined variant called DeepSeekMoE that further separates shared from routed experts.

The pattern across all of them is the same: sparse activation allows teams to build models with far more total capacity than they could deploy economically as dense architectures. The representational richness of a very large model, at inference costs closer to a smaller one.

For developers and businesses in Africa integrating AI via API, this matters for a direct reason. The economics of sparse inference are better than dense inference at comparable capability levels. When frontier providers run enormous models behind APIs and still offer declining prices, MoE architecture is part of how that’s possible. The technology isn’t abstract: it’s in the cost structure of tools you’re already using.

The practical implication: stop comparing models by total parameter count as a proxy for capability or cost. A large MoE and a smaller dense model can be similar in inference cost while being very different in capability. Understanding which dimension matters for your use case requires knowing what kind of model you’re working with.

Common misconceptions

“Experts specialize by topic.” Appealing intuition, but not quite right. Experts do develop behavioral patterns, but they’re not cleanly sorted by subject matter in most analyses. You can’t say “expert 3 handles legal text” with confidence. The specializations are subtler and less interpretable than that.

“MoE always beats dense models of the same total size.” The advantage of MoE is in matching performance while reducing inference FLOPs, not in getting more capability from a fixed parameter budget. A well-trained dense model at equivalent active parameters is competitive.

“You can run a large MoE with the same memory as a model of its active size.” Wrong. Memory scales with total parameters. Compute scales with active parameters. People confuse these constantly. Plan for total size when estimating VRAM requirements.

“GPT-4 is definitely a MoE.” Plausible but unconfirmed by OpenAI. Architecture details for most frontier models remain proprietary. Speculating that all frontier models are MoE because some are is an inference error.

Where to learn more

Sources

Categories
AI

What Is a Transformer, Really?

Every AI model you’ve used in the last five years is, at its core, a transformer. GPT, Claude, Gemini, Llama, Mistral: all transformers. The word sounds like it should describe a power grid component. It describes the architecture that quietly replaced everything else in natural language processing.

The short answer

A transformer is a neural network architecture built around a mechanism called self-attention. Instead of processing words one at a time (like earlier recurrent networks), a transformer processes all the words in a sequence simultaneously and calculates, for each word, how much attention to pay to every other word in the context. That parallel computation, combined with stacking many layers of it, turns out to be extraordinarily good at understanding language, writing code, reasoning about logic, and a dozen other tasks nobody predicted in 2017.

The long answer

Before transformers: the recurrence problem

To understand why transformers matter, you need to understand what came before.

For most of the 2010s, sequence modeling relied on recurrent neural networks (RNNs) and their improved variant, LSTMs (Long Short-Term Memory networks). These models read text left to right, maintaining a “hidden state” that was supposed to carry forward information from earlier in the sequence.

The problem: that hidden state was a fixed-size vector. As sequences grew longer, important context from early in the passage got diluted or lost entirely. Ask an LSTM to translate a 200-word paragraph and by the time it reaches the last sentence, the first sentence has mostly evaporated from its working memory.

A second problem: sequential processing was slow. You couldn’t run token 5 until you’d finished token 4. Training was bottlenecked by this dependency chain.

The 2017 paper that changed everything

In June 2017, a team at Google Brain published Attention Is All You Need. The title was a provocation: forget recurrence, forget convolutions. Attention mechanisms alone, they argued, were sufficient to build a state-of-the-art translation model.

They were right.

The paper introduced the transformer architecture: an encoder that reads the input sequence, a decoder that generates the output sequence, and attention mechanisms connecting them. The key innovation was multi-head self-attention: the model could simultaneously attend to different parts of the sequence from multiple “perspectives” at once, then combine those views.

The results on translation benchmarks were better than anything before it. And because the computation was parallelizable, training was dramatically faster.

What self-attention actually does

Here’s the core idea. For every token (word-part) in a sequence, the model asks three questions:

  1. What am I looking for? (the Query)
  2. What do I have to offer? (the Key)
  3. What information should I pass forward? (the Value)

Every token generates all three. Then attention is calculated by comparing each token’s Query against every other token’s Key. Tokens that are highly relevant to each other get high attention scores; their Values get weighted more heavily in the output.

Consider a sentence like: the bank approved the loan because it had strong reserves.

What does the pronoun refer to? A human reader knows immediately: the bank. A recurrent model might struggle if those words are far apart. A transformer computes attention across the full sentence at once. The word “bank” and the pronoun “it” point at each other. The pronoun resolution isn’t special-cased logic; it emerges from the attention calculation.

Positional encoding: the kludge that works

Here’s one thing the transformer doesn’t have by default: any sense of order. Self-attention sees all tokens in parallel and has no built-in notion of which word came before which.

The solution is positional encoding: adding a signal to each token embedding that encodes its position in the sequence. The original paper used a specific sinusoidal function for this. Later architectures have experimented with learned positional embeddings and more sophisticated methods like RoPE and ALiBi, but the core idea is the same: inject order information explicitly, because the attention mechanism itself is order-agnostic.

Stacking layers

One attention layer doesn’t get you a capable model. Modern transformers stack dozens of these layers. Each layer reads the output of the previous one, letting the network build progressively more abstract representations. Early layers capture syntax. Deeper layers capture semantics, reasoning, world knowledge.

GPT-4 and Claude 3 almost certainly have over 90 layers, though the exact architectures aren’t public. What’s public is the scale: hundreds of billions of parameters across those layers, each parameter adjusted during training to minimize prediction error across trillions of tokens.

Encoder vs. decoder models

Not all transformers are the same. The original paper had both an encoder and a decoder. But the field split:

Encoder-only models (like BERT) read a full sequence at once and build rich representations. They’re excellent at classification, question answering, and search. They’re bad at generation.

Decoder-only models (like the GPT series and Claude) generate text left to right, using masked attention so each token only sees what came before it. They’re what most people think of when they say “LLM.”

Encoder-decoder models (like the original transformer and T5) use both: an encoder processes the input, a decoder generates the output. These dominate translation and summarization tasks.

Why this architecture still dominates

The transformer architecture is nine years old and still the foundation of every frontier model. That’s unusual in machine learning, where architectures often get displaced within a few years.

The reason it’s held on: transformers scale predictably. As you add parameters, add data, and add compute, performance keeps improving in measurable ways. The scaling laws that describe this relationship have held across multiple orders of magnitude. Nobody has found a clean way to break them.

What’s changing isn’t the core architecture but how it’s being extended. Mixture-of-experts routing (routing different inputs to specialized sub-networks), state-space model hybrids like Mamba, and sparse attention patterns are all attempts to make transformers cheaper, longer-context, or more efficient. So far, none have definitively replaced the standard dense transformer for frontier use cases.

Common misconceptions

Transformers understand language the way humans do. They don’t. Transformers are pattern-matching engines operating on token co-occurrence statistics across enormous corpora. The representations they learn are useful and sometimes surprising, but they’re not semantic understanding in any philosophical sense. How much this distinction matters in practice is an active debate.

The attention mechanism is how the model thinks. Attention weights tell you what tokens are being attended to during a forward pass. Mechanistic interpretability research suggests the real computation is more distributed and harder to read off from attention patterns alone. Attention is a useful signal, not a complete window into the model’s reasoning.

A larger context window means better reasoning. A larger context window means the model can take in more tokens per call. Whether it actually uses that information effectively is a separate question. Performance on long-context tasks often degrades past a certain point even when the window allows it.

Where to learn more

  • The Illustrated Transformer by Jay Alammar: the best visual walkthrough of attention mechanics for non-specialists.
  • Attention Is All You Need: the original paper is more readable than most academic ML papers. The abstract and introduction alone are worth the time.

Sources

Categories
AI

Claude Opus 4.7 Is Out, and Its Vision Score Jumped from 54.5% to 98.5%

Anthropic shipped Claude Opus 4.7 today, and its visual acuity score jumped from 54.5% to 98.5%.

That’s not a rounding error. The previous flagship could barely handle computer-use vision tasks reliably. This one is near-perfect on the same benchmark. That gap is the story.

Opus 4.7 is available today across Claude.ai, Anthropic’s API, Amazon Bedrock, Google Cloud Vertex AI, and Microsoft Azure AI Foundry. Pricing is unchanged at $5 per million input tokens and $25 per million output tokens. The context window is 1 million tokens.

The vision improvement is tied to a direct resolution increase. Opus 4.7 now accepts images up to 2,576 pixels on the long edge, about 3.75 megapixels and more than three times the resolution prior Claude models accepted. The jump in visual acuity from 54.5% to 98.5% on computer-use benchmarks reflects that change directly.

The coding and agent numbers are also substantial. On Anthropic’s internal 93-task coding benchmark, Opus 4.7 scores 13% higher than Opus 4.6. On Rakuten-SWE-Bench, a production-task benchmark built on real software engineering work, it resolves three times as many tasks. There are four tasks that neither Opus 4.6 nor Claude Sonnet 4.6 could solve at all that Opus 4.7 now handles.

What Changed Under the Hood

The model ships with several architectural updates beyond the resolution increase.

A new xhigh effort level sits between the existing high and max settings, giving developers finer-grained control over compute spend per task. Anthropic also added adaptive thinking, a feature that automatically adjusts reasoning depth based on task complexity. The intent is to avoid burning max-effort compute on simple requests.

Other improvements on Anthropic’s spec sheet: a 14% gain on complex multi-step agent workflows, 21% fewer errors on enterprise document reasoning via the OfficeQA Pro benchmark, and better finance reasoning (0.813 vs. 0.767 on the General Finance evaluation module).

There is also a tokenizer change. Input tokens run 1.0 to 1.35 times higher than before due to the updated tokenizer. Anthropic says net token usage on coding evaluations still improved despite that increase, but developers integrating the model should expect higher token counts on text-heavy workloads.

Safety-wise, the profile is close to Opus 4.6. Prompt injection resistance is better. Cybersecurity capabilities have been intentionally reduced compared to Claude Mythos Preview.

Why We’re Watching

The vision upgrade is the part of this release that changes what African developers can build, not just how well they can build it. Fintech companies across Nigeria, Kenya, and Uganda process enormous volumes of degraded document scans: KYC submissions on worn government IDs, utility bills photographed at odd angles, bank statements exported from low-resolution PDFs. AI vision tools fail on these constantly, which pushes document verification back onto human queues. A jump from 54.5% to 98.5% visual acuity is not marginal. It’s the threshold where automated document processing becomes reliable enough to trust. Combine that with 21% fewer document reasoning errors, and the economics of building a compliant KYC pipeline on top of a frontier model shift meaningfully. The 1M context window has been there for a while. The vision quality to use it on real African documents was not.

The 3x production task gain on Rakuten-SWE-Bench matters separately. Synthetic benchmarks are easy to optimize for. A benchmark built on real production engineering tasks is not.

Watch the 30-day adoption numbers from agentic coding platforms. If Factory Droids report the same 10-15% task success lift Anthropic claims, this release will put measurable pressure on every competing frontier model. The metric to watch on the vision side is whether document-heavy African fintech workflows that required human fallback before start running straight through on Opus 4.7.

Sources

Categories
News

A Bitcoin Proposal Would Freeze 6.9 Million BTC to Survive Quantum Computing

Six Bitcoin developers want to freeze every coin that doesn’t upgrade before quantum computers arrive.

BIP-361, a draft proposal introduced by Jameson Lopp (Casa CTO) and five co-authors, lays out a three-phase migration from ECDSA and Schnorr signatures to quantum-resistant alternatives. The kicker: unmigrated coins become permanently unspendable at the consensus level. Not locked. Not recoverable through some future mechanism. Frozen by the network itself.

The numbers are startling. Over 34% of all bitcoin in circulation (roughly 6.9 million BTC) have exposed public keys on-chain. These are theoretically vulnerable to Shor’s algorithm once quantum hardware reaches sufficient scale. Google’s quantum research team now estimates that could happen with approximately 500,000 physical qubits, 20x fewer than previously thought. Their projected timeline: somewhere between 2027 and 2030.

The three phases span roughly five years:

  1. Phase A (~3 years): Block new transactions from sending funds to legacy address types. Users can still move coins out of vulnerable addresses.
  2. Phase B (~2 years after Phase A): Network nodes reject all ECDSA/Schnorr signatures at the consensus level. Anything not migrated is frozen.
  3. Phase C (timeline uncertain): A limited recovery mechanism using zero-knowledge proofs tied to BIP-39 seed phrases. This part is still under research.

“Even if Bitcoin is not a primary initial target of a cryptographically relevant quantum computer, widespread knowledge that such a computer exists and is capable of breaking Bitcoin’s cryptography will damage faith in the network.”, BIP-361 authors

The elephant in the room is Satoshi Nakamoto’s estimated 1.1 million BTC (roughly $74 billion at current prices). Nobody has those keys. Under BIP-361, those coins would be frozen permanently, effectively removed from the supply. The proposal’s authors invoke Satoshi’s own logic: “Lost coins only make everyone else’s coins worth slightly more.”

Not everyone agrees this is the right approach. Blockstream CEO Adam Back is pushing for voluntary, optional quantum-resistant upgrades instead of a mandatory freeze. BitMEX Research proposed a “canary fund” as a middle ground: conditional freezes triggered only by demonstrated quantum capability.

Lopp himself describes BIP-361 as “a rough sketch” and says he doesn’t believe mandatory migration is necessary yet. That’s an unusual posture for a proposal author, and it signals this is a conversation starter, not a finished plan.

Why We’re Watching

This proposal forces a question Bitcoin has been avoiding for years: who actually owns dormant coins? Freezing 6.9 million BTC by consensus isn’t a technical upgrade. It’s a philosophical one. Bitcoin’s core promise is permissionless ownership. BIP-361 argues that promise has to bend when the cryptography underneath it breaks.

The urgency math doesn’t add up yet. If Google’s 2029 estimate is right and activation hasn’t happened, a five-year migration wouldn’t complete in time. That gap between “when we need to start” and “when we’ll agree to start” is the real vulnerability.

Watch NIST’s post-quantum signature adoption rate and Google’s qubit roadmap. If either accelerates, this conversation moves from theoretical to urgent overnight.

Sources

Categories
AI

Claude Code Routines Let You Run Coding Agents on a Schedule, No Laptop Required

Anthropic just put Claude Code on autopilot.

A new feature called Routines, currently in research preview, lets developers define a coding task once and have Claude execute it automatically: on a schedule, in response to an HTTP POST, or triggered by a GitHub event like a new pull request. The agent runs on Anthropic-managed cloud infrastructure, so it keeps running whether or not your laptop is open.

Routines landed on Hacker News this week with 658 points and 372 comments, one of the largest discussions the Claude Code project has generated. The signal is real: developers are paying attention to this.

What a Routine Actually Does

The setup is a prompt, one or more GitHub repositories, and a set of triggers. Claude clones the repo fresh on each run, does whatever the prompt describes, and pushes its changes to a claude/-prefixed branch by default. You can allow unrestricted branch pushes if the workflow needs it, but the safe default keeps Claude from accidentally modifying main.

Three trigger types are available:

  • Scheduled: run hourly, daily, on weekdays, or weekly. Custom cron expressions are supported via the CLI.
  • API: a per-routine HTTP endpoint with a bearer token. Post a Sentry alert body or a failing test log to the endpoint, and Claude wakes up, reads the context, and opens a draft fix.
  • GitHub events: pull request opened, release published, or other repository events, with optional filters by author, title, label, or draft status.

A single routine can combine all three. The example in the docs describes a PR review routine that also runs nightly and can be triggered by a deploy script. Each run creates a standalone session, visible at claude.ai/code/routines, where you can review what Claude did, leave feedback, or continue the conversation.

Routines are available on Pro, Max, Team, and Enterprise plans with Claude Code on the web enabled. Usage counts against standard subscription limits, and there’s a daily cap on routine runs per account.

Why We’re Watching

The boring version of this announcement is “Anthropic added a cron interface to Claude Code.” That’s not what this is.

Routines are the first time Anthropic has shipped an architecture where Claude can take persistent, consequential actions on a codebase without any human initiating the run. The feature is scoped carefully: pushes go to prefixed branches, actions use your connected GitHub identity so you can audit them, and the safety rails are visible. But the direction is clear. Anthropic is building infrastructure for coding agents that exist independently of a chat session, more like services than tools.

The HN response is significant context. Previous Claude Code releases generated noise from AI enthusiasts. This one is drawing detailed technical questions from working engineers asking how to integrate Routines into their CI pipelines and incident response flows. That’s a different kind of attention.

Watch whether GitHub Actions and similar CI platforms respond with similar scheduling features. If they do, it means the category is validated. If they don’t, it means they see Routines as niche enough to ignore for now, which would itself be telling.

Sources