Chinese models have now overtaken US ones on openrouter
Im
Chinese models have now overtaken US ones on openrouter
Is not that open router is free. Even a company running the local models in their own AI servers is expending money in electricity and expensive depreciating hardware. The problem is if the thing you want to replace is costing more than the thing you are replacing there is not point on it. What companies seen to be doing is offloading tasks to less expensive models.Im
Even surprised US closed paid models were so dominant compared to Chinese open sourced mostly free/cheap models. I don’t get why many people favored US MODELS so much . Afterall , who wouldn’t prefer something that’s almost free compared to something you have to pay a lot for?
There is no shortage of chips in China. This is actually wrong thinking by the mainstream.Is not that open router is free. Even a company running the local models in their own AI servers is expending money in electricity and expensive depreciating hardware. The problem is if the thing you want to replace is costing more than the thing you are replacing there is not point on it. What companies seen to be doing is offloading tasks to less expensive models.
Most Chinese models are MoE and they are made to run more efficient while US models are dense on purpose to keep an "IQ" edge against non US MoE models but that is coming an a increasing price tag that cannot be subsidize forever if these companies want to go public.
What is interesting is that this seem to be an unexpected blowback of US export controls. I said back then that Chinese models were going to focus on efficiency and architecture rather than brute computational power. My guess if that if Chinese companies had access to unrestricted Nvidia GPUs they would had gone dense and they would cost as much as US models.
?
Chinese models have now overtaken US ones on openrouter
It really depends on what you are trying to evaluate.
This is OpenRouter metric, only people who don't want to use official API (and don't self host) use OpenRouter, so you're already looking at a small subset of western users who're fine with massively overpaying for open models just to host them on a western based server.Im
Even surprised US closed paid models were so dominant compared to Chinese open sourced mostly free/cheap models. I don’t get why many people favored US MODELS so much . Afterall , who wouldn’t prefer something that’s almost free compared to something you have to pay a lot for?
This again? The bench that claims DS cost $4 for 50k tokens? Yeah we know western models are desperate to keep their scam up, but they could use more effort when faking their benches.
TL;DR
- Cost inflated ~5×: The benchmark bills all input tokens at the full cache-miss rate ($0.435/M). In reality, 78% of tokens in agent runs are cache hits, which DeepSeek charges at $0.003625/M (99.2% discount). A representative trial reported at $4.36 drops to ~$0.89 with proper cache pricing on the verifiable portion. An additional $0.41 in the reported cost is unexplained (could be reasoning tokens, OpenRouter markup, or both) — but the dominant error is the cache-pricing gap. The $4.22 leaderboard average is similarly inflated.
- We solved all three tasks they failed: Same model (deepseek-v4-pro), same task definitions, same test verifiers. Three tasks, three passes. Combined cache-adjusted API cost: ~$0.86 total ($0.37 bandit + $0.17 termenv + ~$0.31 superjson estimated). For context, DeepSWE reports $4.22/task for this model.
- OpenRouter privacy guardrail blocks DeepSeek by default: OpenRouter hides providers that may train on data. Without explicitly enabling DeepSeek in privacy settings, the API returns 404s. DeepSWE has no failsafe for this — related to issue . We reproduced the 404 loop.
- No effort tuning for DeepSeek: deepseek-v4-pro ran at "default" effort (reasoning_effort: null). Every other model on the leaderboard got tuned effort levels (xhigh, max, high, medium). Meanwhile thinking mode was ON by default, burning reasoning tokens at output rates without any configuration.
- We had zero verifier infrastructure failures: We ran tests directly on the host (no Docker). None of the issues documented in (browser timeouts, Go dependency failures) affected our runs.