Apr 28, 2026

Xiaomi MiMo V2.5 Pro: When a Model Learns to Work for 11 Hours Straight

Xiaomi's MiMo V2.5 Pro — a fully open-source 1.02T MoE model with 42B active params — compiled a complete compiler in 4.3 hours, built an 8,192-line video editor in 11.5 hours, and optimized an analog chip design. The real story isn't benchmark scores.

The competition axis is shifting from scores to stamina

Model releases are inflating. Every week a new model claims to match or beat GPT, Claude, or DeepSeek on some benchmark. But Xiaomi’s MiMo V2.5 Pro, released today, chose a different path: it didn’t lead with benchmark tables. Instead, it defined itself through three concrete things — writing a complete SysY compiler unsupervised (233/233 perfect score, 672 tool calls, 4.3 hours), independently building an 8,192-line video editor (1,868 tool calls, 11.5 hours), and completing an analog chip design optimization. These three tasks measure a different kind of capability: the reliability to sustain work for half a day without human involvement¹.

This is a meaningful shift in the axis of competition. For the past two years, AI model evaluation has been overwhelmingly concentrated on static benchmark scores — GPQA, MMLU, Terminal-Bench. But MiMo V2.5 Pro’s release suggests the next battlefield may be the duration and complexity of autonomous work: how long a task chain a model can execute without intervention, how it manages its own context and tools across that span, and whether it can self-recover from errors along the way.

Three sets of numbers define this model

First set: scale and architecture. MiMo V2.5 Pro is a 1.02-trillion-parameter Mixture-of-Experts model, activating 42 billion parameters per inference pass. It uses a hybrid attention mechanism — local sliding-window attention interleaved with global attention at a 6:1 ratio, with a 128-token window. This design cuts KV-cache storage by nearly 7× under long-context conditions. It was pre-trained on 27 trillion tokens using FP8 mixed precision with native 32K sequence length, and the context window was expanded to 1 million tokens¹.

Second set: autonomous task performance. The three tasks Xiaomi chose aren’t random — they span distinctly different intellectual forms:

SysY Compiler: This is a project from Peking University’s Compiler Principles course, requiring students to implement a complete compiler pipeline from scratch in Rust — lexer, parser, AST, Koopa IR codegen, RISC-V backend, performance optimization. The reference implementation typically takes a PKU CS major several weeks. MiMo V2.5 Pro finished in 4.3 hours across 672 tool calls, scoring a perfect 233/233 against hidden test suites. The telling detail: the very first compile passed 137/233 tests (59% cold-start pass rate), suggesting the architecture was designed correctly before a single test was run — not patched into correctness via trial and error. At turn 512, a refactoring pass regressed two tests; the model diagnosed the failures, recovered, and pushed forward¹.
Video Editor: From a few simple prompts, the model autonomously completed a fully functional multi-track video editor — timeline, trimming, cross-fades, audio mixing, and export pipeline. The final build: 8,192 lines of code, 1,868 tool calls, 11.5 hours of sustained work¹.
Analog Chip FVF-LDO Design: This is a graduate-level analog circuit EDA task — design and optimize a complete FVF-LDO (Flipped-Voltage-Follower low-dropout regulator) from scratch in the TSMC 180nm CMOS process. The model iterated through an ngspice simulation loop — adjusting parameters, reading waveforms, re-adjusting — and within roughly an hour brought every target metric within spec, with four of them improved by an order of magnitude over its own initial attempt¹.

Third set: token efficiency. On ClawEval, MiMo V2.5 Pro achieves 64% Pass³ at approximately 70,000 tokens per trajectory — using 40–60% fewer tokens than Claude Opus 4.6, Gemini 3.1 Pro, and GPT-5.4 at comparable capability levels. Developers get the same level of work output with significantly less inference overhead¹.

”Harness awareness” — a new capability dimension

Xiaomi introduced a notable concept in their release post: harness awareness. Their observation is that V2.5 Pro demonstrates a structured understanding of its operating environment during long autonomous runs — it knows what tool scaffold is proxying it, actively manages its context window, and even shapes how its own context gets populated in service of the final objective.

This isn’t traditional “tool use.” Traditional benchmarks test whether a model can correctly invoke a tool given its description. Harness awareness tests whether a model can sustain an understanding of — and optimize — its relationship with its tool environment across a thousand-step task. The difference is analogous to the gap between “knows how to write code in an IDE” and “can set up an environment from scratch on an unfamiliar system, build a toolchain, locate bugs, and deploy to production.”

The competitive implication is clear: as static metrics (MMLU, GPQA) approach saturation and gaps between models narrow, dynamic, long-horizon, autonomous task-completion capability is emerging as the new axis of differentiation. If this trend holds, the next generation of model evaluation will shift from “test-set accuracy” to “unsupervised run distance, duration, and reliability.”

Xiaomi’s AI strategy signal

MiMo V2.5 Pro’s release logic has a subtle symmetry with DeepSeek V4:

DeepSeek’s differentiation: cost structure revolution — frontier capability at 1/10th the price
MiMo V2.5 Pro’s differentiation: autonomous work stamina — higher engineering output through longer stable operating windows

Both are open-source, both release weights on Hugging Face under permissive licenses, both are redefining what “frontier” means — just from different angles. DeepSeek is redefining the cost of the frontier. Xiaomi is redefining its shape.

Equally noteworthy is Xiaomi’s training methodology: a three-stage post-training pipeline — supervised fine-tuning to establish the base, domain-specific expert training (separate teacher models for math, safety, agentic tool use, etc., each optimized independently), and multi-teacher on-policy distillation (a single student model learns from online samples drawn from multiple expert teachers). The elegance of this architecture: it does not pursue a single omni-capable teacher model. Instead, it lets multiple specialized teachers optimize their respective domains, then fuses them into one student through distillation¹. This “divide-and-conquer-then-fuse” training strategy may be the key technical path to long-horizon task stability.

Competitive landscape: the devaluation of benchmark scores

MiMo V2.5 Pro’s benchmark table labels multiple metrics as “best open-source” or “best overall.” But in the current competitive landscape, the marginal information value of benchmark scores is declining — when five models differ by less than two percentage points on the same test, selection criteria shift from scores to ecosystem, cost, toolchain compatibility, and domain-specific reliability.

MiMo chose the latter path: it didn’t fight for first place on every benchmark. It found a dimension that hasn’t been fully exploited — long-duration autonomous work reliability — and built a quantifiable advantage there. 8,192 lines of code, 1,868 tool calls, 11.5 hours of crash-free operation — these aren’t traditional ML metrics, but they answer the question developers actually care about more directly than another 0.5% pass@1 improvement: can it do a full afternoon’s worth of work for me?

Whether this strategy works depends on two variables: how quickly other labs adopt long-horizon evaluation standards, and how fast the developer community incorporates “autonomous work duration” into their actual model selection decisions.

References

Xiaomi MiMo Official — MiMo-V2.5-Pro release announcement, covering architecture details, autonomous task descriptions, benchmark results, and training methodology, April 27, 2026: https://mimo.xiaomi.com/mimo-v2-5-pro/ ↩ ↩² ↩³ ↩⁴ ↩⁵ ↩⁶ ↩⁷