Claude Mythos Preview’s benchmark leap, what the numbers actually tell us

Claude Mythos Preview scored 93.9% on SWE-bench Verified. It will never ship to you.

Buried inside today’s Project Glasswing announcement is the data Anthropic clearly wanted read closely: a benchmark sheet for Claude Mythos Preview, a frontier model Anthropic is not generally releasing. The numbers are a discontinuity, not an iteration, and they’re the reason the company simultaneously announced a 12-company coalition to work out what to do with the thing.

Anthropic published comparative scores for Mythos Preview against Claude Opus 4.6, its current flagship, across three coding and agentic benchmarks.

Benchmark	Opus 4.6	Mythos Preview	Delta
SWE-bench Verified	80.8%	93.9%	+13.1
SWE-bench Pro	53.4%	77.8%	+24.4
Terminal-Bench 2.0	65.4%	82.0%	+16.6
CyberGym	66.6%	83.1%	+16.5

Mythos Preview will be accessible only to the 12 Glasswing partners and roughly 40 additional critical-infrastructure maintainers. Research-preview pricing is $25 per million input tokens, $125 per million output tokens after free credits, roughly 3× Opus 4.6’s rate.

The SWE-bench Pro jump is the one to linger on. Verified SWE-bench is close to saturation, once you’re above 80%, the benchmark isn’t telling you much beyond the fact that the model is competent at coding. SWE-bench Pro is the harder variant, designed around senior-engineer tasks: larger codebases, ambiguous requirements, multi-file reasoning. Opus 4.6 scored 53.4% on it. Mythos Preview scored 77.8%.

A 24-point jump on a hard benchmark in a single model generation is not normal. For comparison, the gap between GPT-4 and GPT-4-Turbo on comparable hard coding evaluations was roughly 5 to 10 points. The gap between Claude Sonnet 3.5 and Claude Opus 4 on SWE-bench Verified was about 15. Mythos Preview’s Pro number is closer to a generational leap than a version bump.

Terminal-Bench 2.0 tells a similar story. That benchmark measures agentic, multi-step terminal work, plan, run, interpret, recover. An 82% score there means Mythos Preview can mostly do tasks most human engineers can mostly do, in a shell, unassisted.

“AI models have reached a level of coding capability where they can surpass all but the most skilled humans at finding and exploiting software vulnerabilities.”
– Project Glasswing announcement

The interesting move is what Anthropic didn’t publish: scores on general reasoning, multilingual, or multimodal benchmarks. The Mythos Preview system card focuses almost entirely on code and agentic capability. Either those are the only axes where the model is materially ahead of Opus 4.6, or Anthropic is deliberately keeping other capability disclosures out of the public announcement. Both options are informative. Our read is the first: Mythos Preview is a coding/agentic specialization, likely trained with heavy RL on software-engineering environments. That would explain both the benchmarks highlighted and the fact it’s being deployed to code-maintenance partners rather than general consumer or enterprise customers.

Why We’re Watching

The benchmarks are a lower bound on what the next public Opus model can do once safety mitigations are added. Anthropic has said a future Opus release will build on the Mythos Preview capability set with offensive-security guardrails. The public ceiling now sits somewhere between Opus 4.6 and Mythos Preview, the exact location depends on how aggressive the guardrails are. That gap is also a pricing tell. $25/$125 per million tokens is roughly 3× Opus 4.6, so either the public successor is priced similarly (ending the era where the top Claude cost under $20 per million input tokens) or Anthropic subsidizes the capability to stay competitive with OpenAI’s coding-focused releases. Both answers reshape API economics for every team building agents.

For developers across African AI labs and product teams, the immediate implication is that the cheap-and-capable middle of the market just narrowed. Claude Sonnet at $3 per million input tokens is not going anywhere, and prompt caching still cuts that by 90%. But the frontier is drifting upmarket and the performance gap to the cheap tier will grow, not shrink.

Watch the research-preview data Anthropic releases over the next two quarters, fixes merged, zero-days caught, pricing decisions on the Opus successor. A preview that produces real patches validates the discontinuity. A preview that produces only announcements doesn’t.

Sources

Project Glasswing: Securing critical software for the AI era