
Those who bet on a surge in token prices in 2026 have been proven wrong twice in just one week.
On May 22, DeepSeek announced a permanent price reduction for the DeepSeek V4 Pro; early this morning, Xiaomi's MiMo-V2.5 series followed suit with a price reduction of up to 99%.
At the same time, Xiaomi's Token Plan billing system has been optimized, with the pricing remaining unchanged, but the available quantity has increased to 5 to 8 times the original amount.

Unsurprisingly, discussions about the price reduction of the Xiaomi MiMo model on overseas Reddit, the X platform, and various developer forums also surged rapidly.

However, at a time when the entire industry is lamenting the unbearable costs of tokenization, why is Xiaomi daring to go against the tide and lower prices? More importantly, where will this price reduction push the AI industry?
Token prices plummet, the AI industry welcomes its strictest father yet.
Xiaomi's announcement indicates that its MiMo-V2.5 series AI large-scale model API will undergo a permanent price reduction, with a maximum reduction of 99%, and the price will no longer differentiate based on input length. The new price took effect globally at 00:00 Beijing time on May 27th.

However, a 99% reduction does not mean that every call will be charged at the lowest price; the key variable is whether the input cache is hit.
Taking MiMo-V2.5-Pro as an example, once the cache is hit, the input price drops to approximately 0.025 yuan per million tokens. However, if the input cache is not hit, the price remains at 3 yuan per million tokens, and the output price is 6 yuan per million tokens.
In other words, the prerequisite for this extremely low price is that requests must hit the cache a large number of times.
This price is very attractive for tasks with high repetition of contexts, high frequency of agents, multi-round code tasks, and batch inference tasks. However, if your application scenario has a poor cache hit rate, the actual cost will obviously not reach the lowest point.
The Token Plan operates on a similar logic.

Xiaomi emphasized that pricing will remain unchanged, but credits will be significantly increased: the monthly fees for the four tiers of Lite, Standard, Pro, and Max will remain at 39 yuan, 99 yuan, 329 yuan, and 659 yuan, respectively. The credit limits will also be increased from 60 million, 200 million, 700 million, and 1.6 billion to 4.1 billion, 11 billion, 38 billion, and 82 billion, respectively.
According to the new conversion, MiMo-V2.5-Pro only requires 2.5 Credits/token to hit the cache, and 300 Credits/token to miss the cache, with an output of 600 Credits/token.


This is exactly the same strategy used by DeepSeek.
Here's a brief timeline: DeepSeek V4 preview version was released on April 24; the next day, V4-Pro was offered at a 25% discount; on April 26, the price of cache hits plummeted to one-tenth of the initial price; and by May 22, the temporary discount had become a permanent price reduction, with V4-Pro permanently reduced to one-quarter of its original price.
After some adjustments, the input cache hit price of DeepSeek-V4-Pro dropped from 0.1 yuan to 0.025 yuan. With Xiaomi MiMo-V2.5-Pro quickly following suit, the input cache hit price of domestic models has been completely locked at this benchmark.

Both DeepSeek and Xiaomi have focused their most impactful pricing strategies on cache hit rates and scenarios, and the reason is not complicated. The larger model is shifting from chat to actual work, and the Agent is where token consumption truly amplifies.
In chat scenarios, the user asks a question and the model answers, making the cost relatively easy to estimate.
However, in an agent-based scenario, a task may involve long contexts, multiple rounds of inference, code generation, tool calls, webpage reading, file analysis, and result verification. The user only sees the final output, while the backend may have already processed multiple requests and a large number of context reads.
This is where cache hits are important.
Agents, code helpers, and long-context applications share a common characteristic: much of the content appears repeatedly. This includes system prompts, project code, API documentation, tool descriptions, historical conversations, and dependency files. Recalculating this content every time would be very costly; however, if it can be cached, and billed only based on cache hits the next time it's used, the inference cost will significantly decrease.
In other words, the lower the cache hit price, the more suitable it is for real-world work scenarios involving high frequency, multiple rounds, and long contexts. The low prices offered by DeepSeek and Xiaomi are actually aimed at attracting developers and high-frequency applications, encouraging more agents, code assistants, and office automation applications to run on their models.

Xiaomi previously used initiatives such as MiMo Orbit and the Trillion Token Creator Incentive Program to allow more people to experience MiMo and solve real-world problems. This Trillion Token Incentive Program, launched on April 28th, saw all 100T tokens distributed ahead of schedule by 16:08 on May 26th.
From the platform's perspective, the low-priced tokens and free quotas result in a massive amount of real-world usage. These real-world usages bring complex tasks, failure samples, user feedback, agent workflows, code scenarios, and long-term contextual data, all of which in turn help the model and inference system iterate.
The phenomenon of "shrimp farming" in the community can also be understood within this logic. While maximizing their spending limits, users are also helping the platform create pressure, expose problems, and accumulate data.
Therefore, this cannot be analyzed solely based on the gross profit per inference. While short-term revenue is suppressed, the gains come from developer migration, increased call volume, and genuine feedback. For model vendors aiming for a position in the Agent ecosystem, this represents a very worthwhile platform investment.
Luo Fuli's "True Fragrance Law" is rooted in engineering violence.
However, having the will is not enough; the key is being able to afford to lower the price. What makes Xiaomi's price reduction this time special is that it contrasts with the previous public statements made by Luo Fuli, head of MiMo's large-scale model.
A month ago, Luo Fuli publicly opposed the token price war. Her assessment at the time was that low-priced tokens combined with an open third-party agent framework could easily lead to uncontrolled costs for the platform.
She noted that third-party agent frameworks often have poor context management. A single user query can trigger multiple rounds of low-value tool calls, with each request carrying an excessively long context containing over 100,000 tokens. If the platform cannot constrain this waste, the actual API cost could be dozens of times the subscription price.

She also believes that global computing power supply can no longer keep up with the growing demand for tokens driven by agents. Large-scale companies, without clarifying the cost structure of programming and agent scenarios, engaging in blind price wars will lead to throttling, resource degradation, and decreased stability, ultimately harming the user experience.
However, Xiaomi's price cut this time did not overturn previous judgments, but rather changed the premise for a price war. Luo Fuli previously opposed low prices without a supporting cost structure. What Xiaomi is now showcasing is a theoretical engineering solution that it believes can support low prices.
According to Xiaomi's announcement, its technical team, based on SGLang HiCache, fully supports SWA, which stands for Sliding Window Attention. This reduces the amount of data movement between multiple levels of storage such as GPU memory, CPU memory, and SSD in KV Cache to nearly one-seventh of what it was before optimization, and increases the number of cacheable tokens to nearly five times that before optimization.
At the same time, Xiaomi also optimized its expert parallel processing solution and input length bucketing strategy to improve the cluster's input throughput. Without this level of engineering capability, low prices can easily become unsustainable subsidies. Only with a sufficiently robust infrastructure system can low prices be transformed into a long-term advantage.

Price wars test engineering capabilities, as well as the strength of the support system.
Unlike pure AI model companies, Xiaomi's smartphone, automotive, IoT, and consumer electronics businesses provide it with a longer investment cycle and greater strategic patience. It can view its large-scale model services as an entry point into the AI ecosystem, avoiding the pitfall of focusing solely on short-term API revenue.
This is not friendly to small and medium-sized model companies. Without a core business to support them, without strong infrastructure capabilities, and without players with sufficient scale to dilute costs, they are destined to be unable to keep up with this price in the long run.
DeepSeek's low prices have directly threatened the market positioning of many domestic model providers. With Xiaomi MiMo following suit, more manufacturers with significant scale will be forced to adjust their prices or redefine their product value. Smaller model service providers may be pushed into narrower vertical markets.

This round of price cuts is, to some extent, a market selection process for efficiency-oriented model vendors. Companies with engineering capabilities, computing power scheduling capabilities, and ecosystem entry points can withstand the pressure from lower prices. Companies that only have model capabilities but cannot reduce inference costs will become increasingly passive.
Furthermore, as the room for further price reductions gradually narrows, the closer the price gets to the physical cost, the less valuable simple price cuts become. In the next stage, model quality, agent adaptation, developer tools, ecosystem integration, service stability, and enterprise delivery capabilities will all face a new round of intense competition.
Model capabilities determine the upper limit of AI development, while inference costs determine the scale of AI adoption. Only when truly affordable tokens flood the application layer will we truly see what the next era of AI explosion will look like.
#Welcome to follow iFanr's official WeChat account: iFanr (WeChat ID: ifanr), where more exciting content will be presented to you as soon as possible.