spoonai
TOPMiniMaxMiniMax M3MSA

MiniMax M3 Is Here — It Revived the Sparse Attention It Once Killed to Hit 1M Context and 15× Faster Decoding

China's MiniMax dropped M3 on June 1 and flipped the API on the same day. The core is a new sparse attention called MSA. At 1M-token context, compute per token is ~1/20 of M2 and decoding is 15×+ faster. Weights land within 10 days.

·10분 소요·MiniMax M3 — Research Blog (MiniMax official)MiniMax M3 — Research Blog (MiniMax official)
공유
MiniMax M3 — open-weight model built on MSA sparse attention
Source: MarkTechPost

They pulled back a technique they'd killed — and built a top-tier model with it

Here's the deal: if you want a one-glance read on where the model race is heading, look at MiniMax M3, which China's MiniMax shipped on June 1. The company unveiled its next-gen large model and turned the API on the same day. But the real story isn't the performance numbers — it's how they got them.

M3's heart is a new sparse-attention design called MSA (MiniMax Sparse Attention). The twist: MiniMax had removed sparse attention in its previous generation, M2. They once decided "this isn't working" and dropped it — then refined it and brought it back for M3. The result is a 1M-token context window, meaning you can stuff multiple books or an entire codebase in at once and work over all of it.

The efficiency numbers are striking. By the company's own figures, at 1M-token context, compute per token falls to about 1/20 of the prior M2, with prefill 9×+ faster and decoding 15×+ faster. MiniMax positions M3 as the first open-weight model to bundle frontier coding, a 1M context window, and native image/video understanding into one model. That said, some outlets explicitly flagged these benchmarks as not independently verified — so factor that in.

The players — MiniMax, MSA, and 'open-weight Chinese models'

First, MiniMax. A Shanghai-based Chinese AI company and a multimodal player that does voice and video generation, not just text. It's one pillar of the past year's wave of Chinese open models (Qwen, DeepSeek, MiniMax), pumping out large language/multimodal models fast under its "M" series. M3 is the latest, and the payoff of a "took it out in M2, put it back in M3" experiment.

Next, MSA — sparse attention itself. Transformers, by default, have every token "look at" every other token (full attention). So as context grows, compute explodes quadratically. Sparse attention says, "don't look at everything — just the important parts." Designed well, it processes long context far cheaper; designed poorly, quality drops. MiniMax pulling it from M2 may be the scar of "the trade-off didn't work." Bringing it back in M3 is a claim that they've untangled that balance.

Last is the 'open-weight' stance. M3 will release model weights and a technical report within 10 days of launch — anyone can download it and run it on their own servers. That's the opposite of OpenAI/Anthropic's closed frontier models. Trail slightly on peak performance, but win on "cheap, fast, and open" — that's the Chinese open-model playbook, and M3 pushes it again.

What's inside — what '1/20' and '15×' actually mean

The most important number is 1/20 the compute. At 1M-token context, M3's per-token compute is about 1/20 of M2's. Why does that matter? The cost of handling long context is the price of the service. Process a million-token job at one-twentieth the cost, and applications that were "too expensive to do" (whole-codebase analysis, querying long document bundles) suddenly become economical.

Next, speed: prefill 9×+, decoding 15×+. Prefill is reading the input in one pass; decoding is generating the answer one token at a time. User-felt speed is mostly decoding — so 15× faster means much snappier responses on the same hardware, or far more concurrent users at the same speed. It's trying to catch two usually-conflicting rabbits at once: long context and fast generation.

Third, multimodal + coding in one. MiniMax positions M3 as the first open-weight model to bundle frontier coding, a 1M context window, and native image/video understanding. Usually you get a coding-specialized model here and a multimodal model there — the claim is that M3 fuses them.

Metric M3 (as announced) Why it matters
Context window 1,000,000 tokens whole codebases/long docs at once
Compute per token ~1/20 of M2 (at 1M) long-context cost collapses
Prefill speed 9×+ faster input processing
Decoding speed 15×+ felt response speed
Release weights + report in 10 days open-weight
Caveat some benchmarks unverified recheck after weights drop

What each side gets — MiniMax, developers, the industry

For MiniMax, it reclaims the "technical narrative." The story of removing something in M2 and reviving it refined in M3 isn't just another model drop — it's a message: "we understand attention architecture deeply." Release it open-weight and developers and researchers worldwide validate and use it for free, dragging ecosystem and reputation along. Revenue may be thinner than closed models, but it's effective at cementing the "China-origin efficiency frontier" brand.

For developers and startups, the key is "API live day one + weights within 10 days." Try it now via cloud API, and soon self-host the weights on your own infra. If you can use 1M context at 1/20 the cost, you can build products over long documents and large codebases far cheaper. Where data sovereignty matters, self-hosting an open-weight model becomes an option instead of a closed API.

For the industry, the axis of the frontier race shifts to efficiency again. After Qwen3 Coder Next and MiniMax M2.x Highspeed in late May, M3 keeps the Chinese camp pushing "how cheaply and quickly can you handle long context" over "top score." Even without the #1 absolute performance, dominating the value segment on efficiency eventually drags the closed camp toward lower prices and faster speeds.

Prior cases — the mixed history of sparse attention and open models

Sparse attention isn't a new idea. Its history shows why M3 is a gamble.

An old idea, repeated failures. Early sparse-attention research like Longformer and BigBird is years old. The promise of cheap long context was appealing, but in real large models it often hit a wall — "quality dips subtly versus full attention." MiniMax dropping it in M2 may fit the same pattern: good efficiency, too much quality cost. Bringing it back in M3 claims they've solved the trade-off; whether that's real will be told by independent verification after the weights drop.

A success in open-model catch-up. As DeepSeek showed, Chinese open models can create a global moment overnight with "efficiency + openness." Release the weights and the whole world dissects, validates, and fine-tunes them, and reputation and ecosystem explode in the process. M3 is aiming down the same path. But remember the flip side — when "announced numbers" diverge from "measured numbers," the backlash is big too.

The 'unverified benchmark' warning. Outlets like TechTimes explicitly noted M3's frontier claims aren't independently verified. AI model launches routinely cherry-pick favorable benchmarks. So treat numbers like 1/20 and 15× with a "company-reported" tag, and remember the real value surfaces only once weights and the technical report are out and the community runs them under equal conditions. Right now it's "impressive claim," not "verified fact."

Counter-plays — the closed camp and other open models

How do the closed frontier players (OpenAI, Anthropic, Google) respond? They compete on "peak performance + safety + integrated products." If efficiency open models like M3 eat the value segment, the closed camp eventually gets dragged into cutting prices or offering long context more cheaply. The recent price cuts in Flash/Mini-tier models read as the result of exactly that pressure. Their homework: keep the absolute-performance gap while defending the low-cost tier.

Against other Chinese/open players (Qwen, DeepSeek, etc.), MiniMax is both ally and rival. They all chant "efficiency + open," so differentiation narrows to "who serves longer context, cheaper, more multimodal." If M3's MSA truly cracks the trade-off, rival open models likely follow with similar sparse-attention designs. Attention-efficiency becomes a new front within the open camp.

Cloud and inference-infra vendors are a variable too. If a model that runs 1M context at 1/20 the cost becomes standard, the unit economics of inference services shift. More tokens per GPU is opportunity and price pressure for serving providers. Who serves an efficient open model cheapest becomes its own race.

So what changes — by persona

If you build AI products, M3 signals "long context is getting cheap." Once weights are out, you can self-host and keep data sovereignty while using 1M context. But this is the announced-numbers stage, so before adopting, run your own benchmarks under equal conditions after weights drop. "Cheap and long" doesn't guarantee "accurate on your task."

If you invest in or strategize around AI, the key is the acceleration of the efficiency axis. As the frontier race shifts from "top score" to "cost and speed per token," the battleground is unit price, not model bragging rights. If the Chinese open camp keeps pushing this axis, overall AI usage prices fall structurally — good for application markets (apps/agents), pressure on standalone model sales.

If you're just watching the tech, there's a fun lesson: "a shelved technique isn't always dead." Reviving in M3 the sparse attention dropped in M2 shows AI research isn't a straight line — it advances by oscillating between "works" and "doesn't." And every claim becomes real only when weights are out and anyone can verify. M3's real verdict starts in a few days, in the community's hands.

FAQ — quick answers

Should I believe the 1/20 compute and 15× decoding claims? Not yet as verified fact — treat them as company-reported until the weights and technical report land (within ~10 days of launch) and the community reproduces them under equal conditions. Some outlets explicitly flagged the frontier claims as not independently verified. Impressive claim now; verified result later.

Why did MiniMax remove sparse attention in M2 and bring it back in M3? Sparse attention trades efficiency for a risk of quality loss. The M2 removal likely reflects a trade-off that didn't pay off; the M3 revival is a claim that they've refined the design enough to keep the efficiency without the quality hit. Whether that's real is exactly what independent testing will reveal.

What does "open-weight" change for me? A lot, if data sovereignty or cost matters. Once weights drop, you can self-host on your own infrastructure instead of routing data through a closed API — and use 1M-token context at a fraction of the prior cost. The catch: you take on the ops burden of serving a large model yourself.

What's the bigger trend here? The frontier race is shifting from "top benchmark score" to "cost and speed per token." The Chinese open camp keeps pushing this efficiency axis, which structurally lowers AI usage prices over time — good news for app and agent builders, pressure on standalone model sales.

Bottom line: M3 is a genuinely interesting architectural bet — reviving a discarded technique to chase long context cheaply — wrapped in claims that aren't verified yet. Read it as a strong signal of where the open-model race is heading (efficiency, openness, long context), but wait for the weights and independent benchmarks before treating any specific number as fact. The next few days, in the community's hands, will tell the real story.

References

관련 기사

무료 뉴스레터

AI 트렌드를 앞서가세요

매일 아침, 엄선된 AI 뉴스를 받아보세요. 스팸 없음. 언제든 구독 취소.

매일 30개+ 소스 분석 · 한국어/영어 이중 언어광고 없음 · 1-클릭 해지