Categories
AI

Write Your First MCP Server in 30 Minutes

Model Context Protocol (MCP) is the open standard that lets AI apps like Claude Desktop talk to your own systems: databases, APIs, files, anything you can code against.

This guide gets you from zero to a working MCP server in Python, connected to Claude Desktop and callable from a conversation. No prior experience with protocols required.

What you’ll need

  • Python 3.10 or higher (check with python --version)
  • uv or pip for package installation
  • Claude Desktop installed (free tier works)
  • A terminal you’re comfortable running commands in

Step 1, Install the MCP library

The official MCP Python SDK from Anthropic ships with a high-level FastMCP interface that handles the protocol wiring for you. You write Python functions. The library does the rest.

pip install "mcp[cli]"

If you’re using uv, which is faster:

uv add "mcp[cli]"

Create a file called server.py in a new folder. Everything in this guide builds on that single file.

Step 2, Write a minimal working server

Paste this into server.py:

from mcp.server.fastmcp import FastMCP

# Give your server a name (Claude Desktop shows this in its settings)
mcp = FastMCP("MyFirstServer")

@mcp.tool()
def add_numbers(a: int, b: int) -> int:
    '''Add two integers and return the result.'''
    return a + b

@mcp.tool()
def greet(name: str) -> str:
    '''Return a greeting for the given name.'''
    return f"Hello, {name}. Your MCP server is working."

if __name__ == "__main__":
    mcp.run(transport="stdio")

That’s a complete, functional MCP server. It exposes two tools: add_numbers and greet. FastMCP reads your Python type hints and generates the JSON Schema Claude needs to call the tools correctly. The docstring becomes the tool description that Claude reads when deciding whether to use the tool. Keep docstrings specific: something like Add two integers and return the result works better than Does math.

Run it to confirm it starts without errors:

python server.py

It will appear to hang. That’s expected. The server is waiting for input on stdin. Press Ctrl+C to stop it.

Step 3, Connect it to Claude Desktop

Claude Desktop reads server configurations from a JSON file at startup. On macOS the path is ~/Library/Application Support/Claude/claude_desktop_config.json. On Windows, it’s %APPDATA%\Claude\claude_desktop_config.json.

Create or open that file and add the following. If the file already exists with other servers, add the my-first-server block inside the existing mcpServers object.

{
  "mcpServers": {
    "my-first-server": {
      "command": "python",
      "args": ["/absolute/path/to/your/server.py"]
    }
  }
}

Replace /absolute/path/to/your/server.py with the actual path to your file. On macOS, running pwd in your terminal shows the current directory. On Windows, cd does the same. The path must be absolute, not relative.

Restart Claude Desktop. In the chat interface, you should see a small tools icon near the bottom of the window. Click it and you’ll find add_numbers and greet listed under “MyFirstServer.”

Verifying it works

Open a new conversation in Claude Desktop and type:

Use the add_numbers tool to compute 17 + 25.

Claude will call your server, receive the result, and report 42. If your tools don’t appear, work through this checklist:

  • The file path in claude_desktop_config.json is absolute (not relative like ./server.py)
  • The Python binary in command is the correct one for your environment (run which python or where python to confirm)
  • You restarted Claude Desktop after editing the config
  • server.py runs cleanly when you test it manually in the terminal

Adding a real data source

Two arithmetic tools aren’t much to show off. Here’s how to add something that retrieves live data. Add this to server.py:

import urllib.request
import json

@mcp.tool()
def get_btc_price() -> str:
    '''Fetch the current Bitcoin price in USD from a public API.'''
    url = "https://api.coingecko.com/api/v3/simple/price?ids=bitcoin&vs_currencies=usd"
    with urllib.request.urlopen(url) as resp:
        data = json.loads(resp.read())
        price = data["bitcoin"]["usd"]
        return f"Bitcoin is currently ${price:,} USD."

Restart Claude Desktop and ask Claude something like: what is the Bitcoin price right now? Claude will call your tool and return a live answer from CoinGecko‘s free API.

This is the core of what MCP enables. Your server can query a database, call an internal API, read a local file, run a calculation on private data. Anything Python can do, Claude can call, once the server exposes it as a tool.

Common pitfalls

  • Relative paths in config, claude_desktop_config.json requires absolute paths. Claude Desktop doesn’t have a working directory, so it can’t resolve ./server.py. Use the full path.
  • Missing type hints, FastMCP generates tool schemas from Python type hints. Omit them and the tool won’t appear in Claude’s list. Every parameter needs a type annotation.
  • Forgetting to restart, Claude Desktop caches server connections at startup. After any change to server.py or the config file, you must restart the app.
  • Wrong Python binary, If you’re using a virtual environment, the command field should point to the Python inside that environment, not the system Python. Run which python with the venv active to get the right path.
  • Vague docstrings, Claude reads your function docstrings to decide which tool to call. A vague description like Adds numbers is borderline. A specific one like Add two integers and return their sum, for arithmetic calculations is better. Specificity reduces wrong-tool calls.

Next steps

From here, you can expose data sources with @mcp.resource() using URI patterns, add reusable prompt templates with @mcp.prompt(), or deploy the server for remote access by switching transport="streamable-http" and updating the client config to use an HTTP URL instead of a command.

The MCP specification documents all three capability types in full. The reference servers repository on GitHub has production examples covering filesystems, databases, and third-party APIs. Most are under 200 lines. The protocol is young enough that simple API wrappers are still the most useful servers being built. If there’s an internal tool your team uses repeatedly, an MCP interface on top of it is usually a half-day project.

Sources

Categories
AI

How LLM Context Windows Actually Work Under the Hood

When you paste a 50-page report into Claude and it answers a question about page 38, it didn’t read to page 38. That’s not how any of this works.

The short answer

A context window is the full body of text a language model can see during a single inference call. Every token inside it is available to the model simultaneously when generating each new word. Nothing outside the window exists. Claude 3.7 Sonnet supports up to 200,000 tokens at once. That’s roughly the length of a full novel, all held in view at the same time. The model doesn’t skim it, summarize it, or build notes as it goes. It just… sees all of it.

The long answer

What counts as a token

The OpenAI tokenizer is a useful tool for getting intuition here. A token is roughly 0.75 words in English prose on average, though it varies significantly by language and content type. “tokenization” splits into three tokens. “the” is usually one. Code tokenizes differently from prose, and emoji often consume two to four tokens each.

This matters because every piece of text in a single API call counts against the context window: the system prompt, the full conversation history, the document you pasted, and the model’s own previous replies. If you’re near the limit, something gets cut. Applications typically drop the oldest messages, but the exact behavior depends on how the app is built. Many users don’t realize they’ve pushed old context out until they notice the model has “forgotten” something it acknowledged earlier.

How attention actually works

The architecture behind every major language model is the transformer, introduced in a 2017 paper by Google researchers Ashish Vaswani and colleagues. The defining mechanism is self-attention.

The intuition: imagine every word in a document broadcasting a query to every other word, asking “are you relevant to me?” Each word replies with a relevance score. The model then builds a weighted representation of the entire context based on those scores. “President” attends heavily to “Biden” three sentences later. “error” attends to “function” and “line 47” two paragraphs up. “however” attends to the contrasting claim it’s about to negate.

This happens across multiple attention heads in parallel, each trained to detect different types of relationships: syntactic, semantic, referential, positional. The outputs of all heads are combined, giving each token a representation that encodes how it relates to every other token in the window.

The result is a model that can connect a detail in paragraph 3 to a question about paragraph 47, provided both sit inside the context window. There’s no explicit lookup or cross-referencing logic built by engineers. It emerges from the attention scores learned during training.

The KV cache

One of the key optimizations making long-context models economically viable is the KV (key-value) cache. During inference, the model computes key and value matrices for every token. If part of the context is stable across multiple calls, those matrices don’t need to be recomputed each time.

This is the mechanism behind Anthropic’s prompt caching feature. A long, stable system prompt or document can be prefixed to requests in a cacheable block. You pay the full computation cost on the first call. Subsequent calls that hit the same prefix get a significant discount. For high-volume applications where many users send queries against the same large document, caching can reduce inference costs substantially.

Why long context gets expensive

Self-attention doesn’t scale linearly with context length. It’s roughly quadratic: double the context length and the compute cost roughly quadruples. This is the core engineering challenge in building long-context models.

Sparse attention architectures address this by having tokens attend only to the most relevant subset of the context rather than every other token. Some models implement sliding window attention, where tokens primarily attend to nearby tokens with periodic global attention heads for long-range dependencies. These approaches trade some expressiveness for much better scaling behavior at lengths that would otherwise be impractical.

The result is a landscape where context length and inference cost are genuinely in tension. Pasting your entire codebase into a 200K context is technically feasible, but whether it’s cost-effective depends heavily on your API volume and how stable that context is across calls.

What the model “remembers”

Here’s the subtle part. The context window is not memory in any human sense. The model doesn’t build a mental model of your document and then answer from that understanding. It generates each output token by attending to the raw input text in real time.

Ask the model “what did the report say about Q2 revenue?” and it doesn’t recall a summary it formed while reading. It attends to the raw text, finds patterns relating to “Q2 revenue” across the full context, and synthesizes a response. Fast and powerful, but with a specific failure mode.

When the relevant text is buried in the middle of a very long context, models can underweight it relative to text near the beginning and end of the window. Research from Stanford documented this pattern across multiple model architectures: performance on retrieval tasks degrades for content placed in the middle of long contexts, even when the model nominally has enough context window to see it. The researchers called it “lost in the middle.”

This is why retrieval-augmented generation (RAG) still has a role even as context windows expand. Retrieving and surfacing the two most relevant chunks often outperforms giving the model 50 chunks to sort through, because precision of context matters alongside size of context.

Why this matters in 2026

Context windows have grown so large that some teams are treating them as a substitute for retrieval infrastructure. For a codebase under 150,000 tokens, context stuffing is sometimes simpler and more accurate than maintaining a vector index, because the model sees everything in one shot rather than relying on similarity search to fetch the right pieces.

The economics don’t always follow, though. Filling a 200K-token context on every API call is expensive at current rates, especially at any meaningful request volume. The sweet spot for most production use cases is still some combination of retrieval (to narrow down what goes in the window) plus a reasonably sized context (to give the model enough room to reason across it).

What’s genuinely new in 2026 is how cheaply you can now get to 100K tokens of effective context versus three years ago. The cost curve has dropped dramatically. Long-context workflows that required expensive API tiers in 2023 are now accessible at standard pricing. That’s changing what kinds of applications are worth building.

Common misconceptions

“The model reads the context before answering.” There’s no sequential pass. Attention is computed across the full context simultaneously. The model doesn’t need to “get to” a part of the document: it sees it all at once in each layer.

“A larger context window means better answers.” Not automatically. More context can mean more noise. A model given 50 irrelevant pages alongside 2 relevant ones can perform worse than the same model given just the 2 relevant pages. Precision of context matters as much as volume.

“Tokens are just words.” They’re not. A single word can span multiple tokens, a single token can be a fragment of a word, and the same sequence of characters tokenizes differently depending on where it appears. Short, common English words tend to be single tokens. Technical vocabulary, code, and non-English text often tokenize less efficiently.

“The model loses information near its context limit.” Most models use fixed positional embeddings that treat position 1 and position 200,000 equivalently in terms of raw representational capacity. What changes is the “lost in the middle” degradation pattern described above. The model can see everything in the window; the issue is attention weight distribution, not literal information loss.

Where to learn more

Sources

Categories
News

CoW Swap Got DNS-Hijacked and $500K Drained. The Smart Contracts Were Fine

CoW Swap got hijacked on April 14, and the attack had nothing to do with DeFi.

Attackers took control of the Ethereum decentralized exchange’s domain at the DNS level and redirected users to a clone site. The fake interface prompted visitors to sign token approval transactions that gave the attacker permission to drain their wallets. Cybersecurity researcher Vladimir S. estimates roughly $500,000 was stolen from a small number of addresses. At least one user publicly claimed losses exceeding $50,000.

The protocol’s smart contracts, backend, and APIs were never compromised. CoW paused everything anyway as a precaution.

“We have evidence that a small number of users signed malicious approvals for very small amounts.”, MooKeeper, CoW Swap team member

Gnosis co-founder Martin Koppelmann confirmed the scope appeared limited: only users who visited the compromised site after approximately 14:54 UTC on April 14 and signed the malicious approvals were affected. The CoW team instructed anyone who interacted with the site during that window to immediately revoke all token approvals using Etherscan’s approval checker.

This is becoming a pattern. Curve Finance suffered the exact same attack vector in 2022 (roughly $570,000 drained) and again in May 2025 (DNS record manipulation, losses unspecified). Same playbook every time: hijack the domain, serve a malicious front-end, harvest approvals.

The irony is thick. DeFi protocols spend millions on smart contract audits, formal verification, and bug bounties. The contracts are battle-tested. Then the whole thing gets undone by a domain registrar compromise that any web2 phishing crew could pull off.

The fix isn’t complicated in theory. Decentralized front-ends, IPFS-hosted interfaces, ENS domains, client-side signature verification. But almost no major DeFi protocol actually ships these as defaults. The user experience gap between a traditional web interface and a decentralized one is still wide enough that protocols choose convenience over resilience.

Why We’re Watching

DNS hijacking is now the most reliable attack vector in DeFi, and it has nothing to do with blockchain security. Three major incidents in four years, same attack, same outcome. The smart contracts survive. The web infrastructure doesn’t. That’s a problem for every DeFi protocol that serves users through a traditional domain.

For African DeFi users who access protocols primarily through mobile browsers (often on slower connections where loading IPFS interfaces is impractical), front-end security is the entire security model. If the website you’re visiting isn’t the real one, your on-chain protections are meaningless.

Watch whether CoW Swap and Curve finally migrate to decentralized front-end hosting after this. If they don’t, the next $500,000 DNS hijack is a matter of when, not if.

Sources

Categories
AI

How LLM Context Windows Actually Work Under the Hood

Your AI assistant just forgot what you said three paragraphs ago. The context window is supposed to prevent that. It doesn’t always.

Context windows are one of the most misunderstood features in AI. People treat them like a RAM spec: bigger is always better, and if a number fits inside the limit, the model read it. Neither of those things is reliably true. Here’s what’s actually happening.

The short answer

A context window is the total number of tokens a model can process in a single forward pass: your prompt, any conversation history, any documents you attached, plus the model’s response so far. Everything the model “sees” at inference time must fit inside this window. Anything outside it doesn’t exist, from the model’s perspective.

Claude Sonnet currently supports a 200,000-token context window. That sounds large. Translated to real text, it’s roughly 150,000 words, or about two full-length novels. For comparison, GPT-4 Turbo supports 128,000 tokens. These are genuinely large numbers. They also come with asterisks.

The long answer

Tokens are not words

Before getting to how context windows work, it helps to be precise about tokens. A token is a chunk of text, but not necessarily a word. Common short words like the or is map to a single token. Longer or rarer words get split: cryptocurrency might be two or three tokens depending on the tokenizer. In practice, English text runs around 0.75 words per token, so a 200,000-token window holds roughly 150,000 words.

This matters because the window limit is a token limit, not a word limit, and it applies to everything: your system prompt, your entire conversation history, any documents you paste in, and the model’s own output so far. The counter is always running.

How the model reads: attention

The core mechanism inside a transformer is called self-attention. On every forward pass, every token in the context looks at every other token and decides how much “attention” to pay to it. This produces a weighted summary of the context that the model uses to generate each next token.

Think of it like a room where everyone can hear everyone else simultaneously. A token representing the word “she” looks at the surrounding context and figures out which earlier noun it refers to. A token representing a number looks at nearby tokens to understand its units and meaning. Every token is in conversation with every other token, all at once.

This sounds elegant, and it is. It’s also computationally expensive. The memory required for attention scales with the square of the sequence length. Double the context, quadruple the memory cost. This is why long-context models require more hardware to run and cost more per token at inference: the math gets heavier, not linearly, but quadratically.

The KV cache

To avoid recalculating attention for tokens you’ve already seen, transformers store something called a key-value (KV) cache. When you’re having a conversation, the model doesn’t re-read every prior message from scratch on each turn. It caches the attention representations and reuses them. This is what makes multi-turn conversation practical.

The KV cache also explains why prompt caching exists as a billing feature. If you have a long system prompt that doesn’t change between calls, providers can cache its KV representations and charge you less for re-reading it. The compute work was already done once.

The cache has limits. Cached representations occupy GPU memory, which is finite and expensive. This is part of why running a 200K-context model costs significantly more than running a 4K-context model, even if your actual query is short.

What happens at the edges

A model doesn’t read its context the way you read a document. Research has found a consistent pattern: models tend to be better at using information from the beginning and end of their context than information buried in the middle. A 2023 paper from Stanford and other institutions studied this directly and found that performance on retrieval tasks degraded significantly when the relevant information was placed in the middle of a long context, even when it was well within the model’s stated limit.

The researchers called this “lost in the middle.” It’s a real phenomenon, not a model-specific quirk. It reflects something fundamental about how attention distributes over long sequences: the strong positional signals at the start and end of a context anchor the model’s attention more effectively than the diffuse middle.

Practical implication: if you’re feeding a model a long document and asking a specific question, where you place the relevant information matters. Putting it near the start or end of the prompt tends to produce better recall than burying it on page 12 of a 20-page paste.

Why this matters in 2026

Context windows have grown dramatically in the past three years. Models that supported 4,000 tokens in early 2023 now support 128,000 or 200,000. This is a genuine capability leap, enabling things that were impossible before: feeding an entire codebase into a single prompt, having a 3-hour meeting transcript summarized in one call, or analyzing a full legal document without chunking.

But the growth of context windows has also created a misconception: that longer context automatically means better performance. It doesn’t. The quadratic scaling of attention means longer contexts cost more and, on some tasks, produce worse outputs because the model’s attention dilutes across more tokens. Smaller, focused prompts often outperform bloated ones.

The other shift happening in 2026 is the rise of agentic systems, where models run for many turns without human intervention. Claude Code Routines, launched this week in research preview, runs Claude as a persistent background agent on codebases. These agents accumulate context across runs: tool outputs, prior conversation turns, file contents. Managing context carefully isn’t a nice-to-have in these systems; it’s an engineering discipline. Run out of context mid-task, and the agent loses the thread.

Common misconceptions

If it fits in the context, the model read it. Technically true but practically misleading. Read in the attention sense means every token computed its attention weights relative to every other token. But attention doesn’t mean recall. Information in the middle of a long context is accessed less reliably than information at the edges, as the “lost in the middle” research demonstrates.

A bigger context window means a smarter model. Context window size is an engineering parameter, not an intelligence parameter. A model with a 200K context window isn’t inherently better at reasoning than a model with a 32K window. It’s better at processing more text in one shot. Those are different things.

The context window is like memory. It’s more like a whiteboard. Everything on it is equally visible at inference time (modulo the middle-attention issue), but once the conversation ends, it’s erased. There’s no persistent memory across sessions unless the system is explicitly designed for it. When you start a new chat, the model knows nothing about your last conversation.

Hitting the context limit is a hard error. Most production APIs handle context overflow by truncating the oldest part of the conversation, usually the earliest messages. This can cause the model to silently lose important context mid-task. Catching this in agentic systems requires explicit monitoring of token counts, not just assuming the session is intact.

Where to learn more

Sources

Categories
AI

Claude Mythos Preview’s benchmark leap, what the numbers actually tell us

Claude Mythos Preview scored 93.9% on SWE-bench Verified. It will never ship to you.

Buried inside today’s Project Glasswing announcement is the data Anthropic clearly wanted read closely: a benchmark sheet for Claude Mythos Preview, a frontier model Anthropic is not generally releasing. The numbers are a discontinuity, not an iteration, and they’re the reason the company simultaneously announced a 12-company coalition to work out what to do with the thing.

Anthropic published comparative scores for Mythos Preview against Claude Opus 4.6, its current flagship, across three coding and agentic benchmarks.

Benchmark Opus 4.6 Mythos Preview Delta
SWE-bench Verified 80.8% 93.9% +13.1
SWE-bench Pro 53.4% 77.8% +24.4
Terminal-Bench 2.0 65.4% 82.0% +16.6
CyberGym 66.6% 83.1% +16.5

Mythos Preview will be accessible only to the 12 Glasswing partners and roughly 40 additional critical-infrastructure maintainers. Research-preview pricing is $25 per million input tokens, $125 per million output tokens after free credits, roughly 3× Opus 4.6’s rate.

The SWE-bench Pro jump is the one to linger on. Verified SWE-bench is close to saturation, once you’re above 80%, the benchmark isn’t telling you much beyond the fact that the model is competent at coding. SWE-bench Pro is the harder variant, designed around senior-engineer tasks: larger codebases, ambiguous requirements, multi-file reasoning. Opus 4.6 scored 53.4% on it. Mythos Preview scored 77.8%.

A 24-point jump on a hard benchmark in a single model generation is not normal. For comparison, the gap between GPT-4 and GPT-4-Turbo on comparable hard coding evaluations was roughly 5 to 10 points. The gap between Claude Sonnet 3.5 and Claude Opus 4 on SWE-bench Verified was about 15. Mythos Preview’s Pro number is closer to a generational leap than a version bump.

Terminal-Bench 2.0 tells a similar story. That benchmark measures agentic, multi-step terminal work, plan, run, interpret, recover. An 82% score there means Mythos Preview can mostly do tasks most human engineers can mostly do, in a shell, unassisted.

“AI models have reached a level of coding capability where they can surpass all but the most skilled humans at finding and exploiting software vulnerabilities.”
– Project Glasswing announcement

The interesting move is what Anthropic didn’t publish: scores on general reasoning, multilingual, or multimodal benchmarks. The Mythos Preview system card focuses almost entirely on code and agentic capability. Either those are the only axes where the model is materially ahead of Opus 4.6, or Anthropic is deliberately keeping other capability disclosures out of the public announcement. Both options are informative. Our read is the first: Mythos Preview is a coding/agentic specialization, likely trained with heavy RL on software-engineering environments. That would explain both the benchmarks highlighted and the fact it’s being deployed to code-maintenance partners rather than general consumer or enterprise customers.

Why We’re Watching

The benchmarks are a lower bound on what the next public Opus model can do once safety mitigations are added. Anthropic has said a future Opus release will build on the Mythos Preview capability set with offensive-security guardrails. The public ceiling now sits somewhere between Opus 4.6 and Mythos Preview, the exact location depends on how aggressive the guardrails are. That gap is also a pricing tell. $25/$125 per million tokens is roughly 3× Opus 4.6, so either the public successor is priced similarly (ending the era where the top Claude cost under $20 per million input tokens) or Anthropic subsidizes the capability to stay competitive with OpenAI’s coding-focused releases. Both answers reshape API economics for every team building agents.

For developers across African AI labs and product teams, the immediate implication is that the cheap-and-capable middle of the market just narrowed. Claude Sonnet at $3 per million input tokens is not going anywhere, and prompt caching still cuts that by 90%. But the frontier is drifting upmarket and the performance gap to the cheap tier will grow, not shrink.

Watch the research-preview data Anthropic releases over the next two quarters, fixes merged, zero-days caught, pricing decisions on the Opus successor. A preview that produces real patches validates the discontinuity. A preview that produces only announcements doesn’t.

Sources

Categories
News

A fake Ledger app on the Mac App Store drained $9.5M. The self-custody question just got harder.

A fake Ledger Live app sat on Apple’s Mac App Store for six days and drained $9.5M.

A counterfeit app cloning Ledger Live drained more than $9.5 million in bitcoin and other crypto from at least 50 users, including musician G. Love, according to investigators tracking the theft. On-chain sleuth ZachXBT identified the scheme and traced the stolen funds through 150+ KuCoin addresses. The listing was live on Apple‘s Mac App Store from April 7 to April 13, 2026.

The mechanics aren’t new. Fake wallet apps have been a staple scam since 2018. What’s new is the distribution channel. This one passed the same review process Apple invokes when it defends the 30% take and argues against alternative stores. For the segment of crypto users who picked Ledger specifically because they didn’t trust browser extensions or random Telegram links, that’s the uncomfortable part.

Per ZachXBT’s thread and Decrypt‘s reporting, the attacker cloned the Ledger Live UI closely enough to pass a distracted user’s glance. Users who opened the fake app and entered their 24-word recovery phrase during what looked like a routine device setup handed over full control of every account seeded from that phrase. The app exfiltrated phrases to a server. Wallets drained within hours.

Fifty-plus victims are confirmed. The $9.5M total is likely to climb as more affected users self-identify. Stolen funds were laundered through 150+ KuCoin deposit addresses, a pattern consistent with a prepared off-ramp rather than an opportunistic heist. Apple pulled the listing on April 13 after ZachXBT’s public reporting. Ledger, the legitimate hardware-wallet maker, has confirmed the attack and reiterated that its real app never asks users to type a recovery phrase into a computer.

Ledger has reiterated in its public response that the real Ledger Live app never asks users to enter their 24-word recovery phrase, and any app that does is a scam. The hardware-wallet company distributes its software only through its own download page and vetted channels.

Direct market impact is modest. $9.5M across 50+ users doesn’t move price, and the on-chain trail is being watched. Protocol impact on Bitcoin or any affected chain is effectively zero. The distribution-channel impact is larger than the dollar figure suggests.

Why We’re Watching

The self-custody pitch takes a hit in an unexpected place. Canonical security advice has been to avoid hot wallets from unknown sources and use a hardware wallet like Ledger or Trezor. That still holds. The hardware wasn’t compromised. But the companion app is where users type recovery phrases during setup and recovery, and any channel that can serve a fake companion app is a channel that can drain users who did everything else right. The Mac App Store, as of this week, is one of those channels.

The review-process failure at Apple is the part to watch. iOS and macOS review is Apple’s answer to every regulator questioning the 30% tax. If a $9.5M wallet-cloning scam sits in the store for six days, the argument that review is worth the tax gets harder, and the argument that crypto apps should be allowed on alternative stores under the EU Digital Markets Act gets easier.

The audience expansion matters too. A 2021 fake Trezor wallet listing captured phrases for about $1M before being pulled. The 2026 delta is that crypto self-custody has broadened past the 2018 to 2021 cohort. Users coming in through stablecoin payments, tokenized assets, or AI agents that hold keys on their behalf aren’t steeped in the never type your seed phrase rule. The attack surface grew. Security literacy did not. In African markets where self-custody adoption is fastest via apps like Bitnob and Yellow Card, a high-trust-store incident like this reshapes onboarding copy, the wallet apps most people will actually touch are mobile-first, and the lesson needs to travel faster than the next clone.

The immediate rule for holders is narrower than most “crypto is unsafe” takes will suggest: no legitimate wallet app asks for your recovery phrase on the host machine. If it does, it’s fake. Your hardware wallet is fine. Your app store may not be.

Watch three things. Does Apple disclose how the listing passed review, or go silent and invite regulatory scrutiny. Does KuCoin freeze the deposit addresses fast, cooperation has been the pattern before. Does this show up in the MiCA secondary rulemaking or in US wallet-classification language, a clean App Store failure is exactly the case study that gets cited.

Sources

Categories
AI

Anthropic’s Project Glasswing pulls 12 tech giants into an AI security pact

Anthropic won’t ship its most capable coding model. It gave the keys to 12 infrastructure giants instead.

Anthropic announced Project Glasswing today, a 12-company consortium to arm maintainers of critical open-source software with a frontier Claude model capable of autonomously finding and patching zero-day vulnerabilities. The model driving it, Claude Mythos Preview, is not being released to the public. The coalition reads like a who’s-who of infrastructure providers that would normally be competing: AWS, Apple, Broadcom, Cisco, CrowdStrike, Google, JPMorgan Chase, Linux Foundation, Microsoft, NVIDIA, and Palo Alto Networks all signed on.

The triggering event, per Anthropic, is that Mythos Preview has gotten too good at offense. On CyberGym, it scored 83.1%, up from Opus 4.6’s 66.6%. In internal evaluations the model autonomously uncovered a 27-year-old vulnerability in OpenBSD and a 16-year-old flaw in FFmpeg that automated fuzzers had hit roughly five million times without catching. It also chained kernel-level bugs in Linux into a working privilege-escalation exploit, with no human steering.

Project Glasswing grants access to Mythos Preview to the 12 consortium partners plus 40+ organizations maintaining critical open-source infrastructure. Anthropic is contributing $100 million in model-usage credits and $4 million in cash donations to open-source security groups. Partners deploy it on their own codebases and the open-source components they depend on.

“By giving the maintainers of these critical open source codebases access to a new generation of AI models that can proactively identify and fix vulnerabilities at scale, Project Glasswing offers a credible path to changing that equation.”
– Jim Zemlin, CEO, Linux Foundation

The defensive mandate is specific: Mythos Preview points at the software supply chain, the libraries, protocols, and operating-system components that power everything else. The research-preview price is $25 per million input tokens, $125 per million output tokens after free credits, roughly 3× Opus 4.6’s rate. That price is Anthropic’s way of saying this is not a general-availability product even for the partners that have it.

Why We’re Watching

Anthropic is publicly declaring that a model it has built is too dangerous to ship without mitigations, and is still making it useful by gating access to vetted defenders. Until now, frontier labs have chosen between releasing broadly with guardrails (GPT-4, Claude Opus 4.6) or keeping a capability internal indefinitely. Glasswing is a third option: selective deployment to a named coalition under a charter. The pattern will repeat, biosecurity, financial-market manipulation, any domain where a capable agent tips an offense-defense balance. Whether you agree Mythos Preview sits on the wrong side of that tip is a separate question, what’s new is that the company is saying so publicly and building a deployment model around it. For the African security ecosystem, which already runs disproportionately on volunteer-maintained open source and rarely has enterprise security budgets, defenders getting frontier access early actually helps, provided the coalition extends past the current Western-enterprise roster. It hasn’t yet.

The competitor race is now about whether OpenAI, Google DeepMind, or Meta have something comparable internally that they’re choosing not to release. If yes, expect an equivalent announcement within two quarters. If no, this is the moment Anthropic pulled decisively ahead on agentic coding and everyone else has to explain why.

Watch the first six months for measurable wins: fixes merged into the Linux kernel, OpenBSD, FFmpeg, or the other named targets. If the coalition produces only announcements, Glasswing becomes a cautionary tale about hype cycles. Watch the guardrail design on the general-release Opus successor, Anthropic has said offensive capability will be throttled, how and whether those guardrails survive jailbreaks is the real technical disclosure. And watch who else signs on, CNCF, Apache, European CERTs, Chinese and Indian OSS foundations. The current 12 are Western enterprise. A coalition that stays Western enterprise is a marketing posture. A coalition that goes global is a norm.

Sources

Categories
AI Learn

What is prompt caching, and why is it the single biggest lever on your AI bill in 2026

If you’re running anything on top of a large language model in 2026, an agent, a RAG pipeline, a customer-support bot, a code assistant, and you aren’t using prompt caching, you are probably paying between 5× and 10× more than you need to. That’s not a marketing claim; it’s what the pricing tables on Anthropic’s, OpenAI’s, and Google’s own API docs say when you do the math.

This post is a plain-English walkthrough of what prompt caching is, how it actually works under the hood, which provider charges what, and the two or three ways teams still get it wrong. No prior ML background assumed.

The short answer

Prompt caching is a way to tell an LLM API that the first several thousand tokens of a request will be the same every time, so it should not re-read them from scratch but pick up from where it left off. The API does, and it charges you roughly 10% of the normal input-token price for the re-read portion. Anthropic gives the deepest discount (90% off reads), Google gives 75% off reads, OpenAI gives 50% off reads, and Google and OpenAI do it automatically for any prompt long enough to qualify.

In return, the first request (the “write”) costs a bit more than normal, 25% extra on Anthropic for a 5-minute cache, 100% extra for a 1-hour cache. You break even on Anthropic’s 5-minute cache after one hit, and on the 1-hour cache after two.

If your prompts reuse any significant prefix, a system prompt, a RAG context, tool schemas, few-shot examples, you should be caching. If they don’t, you shouldn’t. That’s really it.

The long answer

What the model is actually doing

A transformer-based LLM processes your prompt by turning each token into a vector, then running that vector through dozens of attention layers. At each layer, the model computes two big matrices, called Keys and Values, or K and V, that encode how every token in the prompt relates to every other token. Generating even a single new output token requires the K and V matrices for the entire prompt.

Normally, every API request recomputes K and V from scratch. That’s the expensive part, the matrix multiplications scale with the square of the prompt length, which is why long prompts are disproportionately slow and costly.

Prompt caching changes the contract. On the first request, the provider computes K and V for your prompt as usual, but also saves them in memory, keyed to the exact sequence of tokens that produced them. On subsequent requests that start with the same prefix, the provider jumps straight to the cached matrices and only computes fresh K and V for the tokens that come after the cache boundary. You pay for the skipped computation at a steep discount, because the cost to the provider is essentially zero, it’s just reading back memory it already had.

That’s why the discount is 90% rather than 100%: there’s still memory bandwidth, occasional cache misses where the entry was evicted, and some housekeeping. But it’s close to free.

The “bookmark” analogy

If the above was too dense, here’s the version that holds up:

Imagine you’re reading a dense reference book to answer questions for a stream of callers. The first caller asks something, and you read the whole introduction, slow, thorough, expensive. But now the introduction is loaded in your head. When the second caller asks a different question, you don’t reread the intro; you bookmark where you stopped and jump straight to their question, using the context you already have. The bookmark costs something to place, you had to pay attention to create it, but every subsequent jump is 90% cheaper than re-reading.

Prompt caching is the bookmark. The static prefix is the introduction. Your dynamic question is whatever comes after the bookmark.

What each provider charges in April 2026

Anthropic (Claude Sonnet 4.6 at $3 per million input tokens as the baseline):
– Write cost (5-min cache): 1.25× base = $3.75/MTok
– Write cost (1-hour cache): 2.0× base = $6.00/MTok
– Read cost: 0.10× base = $0.30/MTok (90% discount)
– Requires explicit cache_control on the prompt block
– Up to 4 cache breakpoints per request
– Minimum useful size: ~1,024 tokens

OpenAI (GPT-4 class, automatic):
– Write cost: same as normal input price (no premium)
– Read cost: 50% of input price
– Fully automatic for prompts ≥1,024 tokens, no code changes required
– Cache TTL: 5–10 minutes, not configurable

Google Gemini:
– Write cost: free to cache
– Read cost: 25% of input price (75% discount, most aggressive)
– Storage fees: charged per hour cached
– Minimum prompt size: 32,768 tokens (highest barrier to entry)
– Default TTL: 1 hour, configurable up to 24 hours

Three very different philosophies. Anthropic treats caching as an explicit optimization with the deepest payoff and the most knobs. OpenAI treats it as a transparent efficiency gain users shouldn’t have to think about. Google treats it as an explicit long-context service with the steepest read discount but a 32K minimum that rules out most conversational use.

When caching helps vs. when it hurts

Caching is a win when:
– Your system prompt is long and doesn’t change between requests (agents, RAG systems, persona-heavy chatbots)
– You run many-shot examples or tool schemas as part of every call
– You batch-process documents against the same instructions
– You hold code files in context across a coding session

Caching hurts, or does nothing, when:
– Your prompts are short (below the minimum threshold)
– Every request has a fundamentally different prefix
– You make requests infrequently enough that the cache expires between hits (Anthropic 5-min, OpenAI 5–10 min)
– Your workload is output-heavy with short inputs, the caching discount only applies to input tokens

Why this matters in 2026

Agentic AI is the reason. Every agent loop, whether it’s Claude Code iterating on a repo, a GPT-based research assistant walking through a browser, or a Gemini pipeline summarizing ten documents at once, shares a huge fixed prefix (system prompt, tool definitions, often a long document context) across dozens of model calls. Without caching, you pay full freight for that prefix on every turn. With caching, you pay it once per session plus 10% per turn thereafter. At typical agent loop lengths (10–50 turns), caching is the difference between an application that’s economical and one that isn’t.

The other reason is that the default cost ceiling is rising. As frontier models get more expensive per token (Anthropic’s Mythos Preview preview pricing is 3× Opus 4.6), the gap between the cached and uncached price grows in absolute terms. Teams that adopted caching in 2024 got a 5× discount on a cheap model. Teams adopting in 2026 are getting a 10× discount on a much more expensive one.

Common misconceptions

  1. All cached tokens cost 90% less. Only the read tokens. Writes cost extra (25–100% on Anthropic). If you never get a cache hit, one-off requests, infrequent workloads, prefixes that always change, caching makes your bill higher.

  2. Cache everything with the longest TTL. Longer TTLs cost more up front. Anthropic’s 1-hour cache doubles your write cost. If your reuse pattern is under 5 minutes, don’t pay for an hour.

  3. Placement doesn’t matter. Where you put the cache breakpoint is the whole game. If your static system prompt is followed by a timestamp, a user ID, or any dynamic content before the breakpoint, the cache key changes on every request and you never hit. Rule: stable content first, dynamic content last, cache breakpoint between them.

Where to learn more

Sources

Categories
News

Tether just launched a wallet. The stablecoin giant now competes with MetaMask.

Tether just built a wallet.

Tether, the issuer of USDT, launched its own consumer self-custody wallet this week. It supports USDT, bitcoin, and Tether’s gold-backed XAUT token at launch, and it sits directly on the slot MetaMask and Phantom currently occupy: the default app a normal person uses to hold and send crypto.

This is the kind of move that doesn’t trend on crypto Twitter because it isn’t a price story. It might be the most consequential distribution move in crypto this year.

Tether’s reach, via exchanges, remittance corridors, and merchants across emerging markets, is genuinely massive. Until this week, the company had no owned distribution to convert that funnel. That changed.

“With more than 570 million people already using Tether’s technology, the next step is making that digital infrastructure even more accessible and usable by the end users,”
Paolo Ardoino, CEO, Tether

Ardoino framed the product around removing complexity while keeping self-custody intact, positioning it as the people’s wallet for mainstream users rather than crypto natives.

The 570 million figure is Ardoino’s own count. Read it as a marketing number, it aggregates USDT holders, platform users, and partner-app reach rather than direct Tether Wallet users (currently zero, the product just launched). Even discounted substantially, it’s an enormous on-deck audience.

The product itself makes three choices worth naming. Email-style identifiers, so users see name@domain rather than 42-character hex. Non-custodial by default, Tether is not holding keys. And a deliberately narrow asset set, USDT, BTC, XAUT, with no ETH, no Solana, no meme coins. That last one is a positioning shot: this is a wallet for stored value, not a playground for speculation.

Nothing here is technically novel. ENS has offered email-style addresses since 2017. Non-custodial wallets are a commodity. What’s new is who’s shipping it and how they plan to distribute it.

Why We’re Watching

Tether owns the most powerful top-of-funnel in crypto and finally has a wallet to pipe it into. That matters nowhere more than in Africa, where USDT is already the default dollar. Chainalysis put Sub-Saharan Africa’s on-chain stablecoin activity at roughly 43% of the region’s total crypto volume in its 2024 Geography report, with Nigeria alone absorbing tens of billions in crypto transfers in the preceding year, the majority dollar-denominated. Triple-A pegs Nigerian crypto ownership at north of 10% of adults, among the highest in the world. These aren’t speculators. They’re people using stablecoins to get paid, to save, and to move money home. Every one of them currently holds USDT somewhere Tether does not control, Binance, Yellow Card, Bitnob, a MetaMask install. Tether Wallet is the first owned product that can convert that holder relationship into a retention relationship, and the first emerging market on the continent that tips hard toward it is the one where MetaMask’s narrative starts to crack.

The corresponding risk is concentration. A world where Tether is both the dominant stablecoin issuer and the wallet most users hold it in is a world where a single entity sits closer to the center of a lot of value transfer. That’s a structure MiCA, SEC, and Singapore’s MAS will have opinions about.

If Tether converts the funnel, it becomes a wallet company the way Apple became a payments company, by attaching a product to a distribution engine it already owned.

Watch the 90-day install numbers. Five million installs means MetaMask has a serious problem. Five hundred thousand means this is a footnote. Watch Nigeria, Kenya, and Argentina specifically, those three corridors will tell you whether emerging-market holders will follow the issuer into the issuer’s app.

Sources