This AI Rewrites Its Own Code — MiniMax M2.7's Self-Evolution Experiment
MiniMax M2.7 autonomously improved itself over 100+ iterations, scoring 56.22% on SWE-Pro — near Claude Opus 4.6 levels — at 1/50th the price.

56.22% — And Why That Number Matters
SWE-Pro is one of the hardest benchmarks in AI right now. It doesn't test whether a model can chat or write poetry — it tests whether a model can read real GitHub issues, understand codebases, and actually fix bugs. Think of it as a practical software engineering exam for AI.
A model just scored 56.22% on it. That's within striking distance of Claude Opus 4.6's best (57.3%) and on par with GPT-5.3 Codex.
The model is called M2.7. The company behind it is MiniMax — a Chinese AI startup most people outside China haven't heard of.
Here's the kicker: the way it reached that score was fundamentally different from anything we've seen before.
The Backstory — Who Is MiniMax?
MiniMax was founded in 2021 by Yan Junchi, a former VP at SenseTime, one of China's biggest computer vision companies. The startup initially focused on consumer AI products for the Chinese market — music generation, video creation, that sort of thing.
But as the LLM race heated up globally, MiniMax pivoted hard. Its M2.5 model, released in late 2025, showed surprisingly strong performance on coding and agent benchmarks. It was a signal that this company had bigger ambitions than consumer apps.
Then in early April 2026, M2.7 dropped. And the AI community took notice.
| Model | SWE-Pro | PinchBench | Price (input/output per 1M tokens) | Key Feature |
|---|---|---|---|---|
| MiniMax M2.7 | 56.22% | 86.2% | $0.30 / $1.20 | Self-evolving |
| Claude Opus 4.6 | 57.3% | 87.4% | $15 / $75 | Best all-around |
| GPT-5.3 Codex | 56.22% | 84.1% | $10 / $30 | Code-focused |
| GPT-5.4 | 57.7% | 85.3% | $12 / $60 | Latest frontier |
Look at those prices. M2.7's input tokens cost 1/50th of Claude Opus. Output tokens are 1/60th.
How Self-Evolution Actually Works
The Loop That Changes Everything
Normally, training an AI model requires heavy human involvement. Researchers prepare data, write training code, analyze results, tune hyperparameters, and repeat this cycle hundreds of times.
M2.7 handled 30–50% of this process on its own.
Here's the specific loop it ran autonomously:
- Analyze failed task trajectories
- Diagnose root causes of failures
- Modify its scaffold code (the framework it operates within)
- Run evaluations with the modified version
- Compare results against the previous version
- Keep improvements, revert regressions
It ran this cycle over 100 times without human intervention. The model was debugging itself, rewriting its own operational code, and measuring whether those changes actually helped.
Speed as a Weapon
M2.7 generates roughly 100 tokens per second — noticeably faster than Claude Opus 4.6 or GPT-5 class models. In agentic workflows, speed directly translates to cost savings. Faster completion means fewer tokens consumed per task, which means lower pipeline costs across the board.
An AI that rewrites its own code. Right now it's handling 30–50% of its development workflow. But the direction matters more than the percentage — we're watching the role of human researchers shift from "builder" to "supervisor."
Built for Agents, Not Chat
M2.7 isn't trying to be your friendly chatbot. It's designed as an agent backend — something you plug into third-party tools and harnesses to execute complex tasks autonomously. That's why its benchmark focus is on SWE-Pro (coding ability) and PinchBench (agent capability) rather than general conversation quality.
The Bigger Picture — China's AI Counteroffensive
The timing of M2.7 is no coincidence. The AI model landscape is shifting fast in April 2026.
In January, Zhipu AI's GLM-5.1 — an open-source 744B parameter MoE model released under the MIT license — scored 58.4% on SWE-Bench Pro, topping the global leaderboard and beating both GPT-5.4 and Claude Opus 4.6 in coding benchmarks.
Now MiniMax joins the charge. Chinese AI startups aren't just playing catch-up anymore — they're leading in specific domains, particularly coding and agentic tasks.
The real story here is economics. If M2.7 delivers Claude-level performance at 1/50th the price for agent workloads, the calculus for companies deploying AI agents at scale changes completely. Monthly API costs that used to run in the millions could drop to tens of thousands.
What This Means for You
Two things worth watching.
First, agent economics are about to get disrupted. The biggest barrier to putting agent systems in production has been API costs. At M2.7 pricing levels, running hundreds of concurrent agents becomes financially viable for the first time.
Second, the self-evolution paradigm is just getting started. Today it's scaffold code modifications. Tomorrow it could be models managing their own training pipelines end-to-end. The human researcher's job description is evolving from "hands-on builder" to "strategic supervisor."
There are clear limitations, of course. M2.7 is specialized for coding and agent tasks — it falls well short of Claude or GPT for general conversation and creative writing. And "self-evolution" sounds grander than it currently is. True self-evolution — where a model modifies its own architecture — remains far off.
But the direction is unmistakable. The very early signs of AI building AI are here.
Sources
출처
관련 기사

The Week Open Source Caught Up: Gemma 4 and GLM-5.1

DeepSeek V4 — 1 Trillion Parameters, Open-Weight, and Everything You Need to Know

DeepSeek V4 Just Shattered the Open-Source Ceiling With 1 Trillion Parameters
AI 트렌드를 앞서가세요
매일 아침, 엄선된 AI 뉴스를 받아보세요. 스팸 없음. 언제든 구독 취소.
