Artificial Intelligence thread

meedicx

Junior Member
Registered Member
I have ~2 million DS output tokens in May on 600 million input tokens and it costed me $14 on my $50 deposit, I don't need you to tell me how much I'm spending lol

It's one thing to make up a custom benchmark to benchmax western models, it's another to make up cost that can be easily disproven.

There is no "if it keep on trying to find a solution" I use Kimi as main and since last month DS as secondary on a daily basis for actual enterprise level engineering work involving large mission critical codebases, there have been rare instances where DS failed to find a solution, but I have not had a single instance where Kimi did.

Do you know what "long horizon" coding means? It just means one-shots. Enterprise don't one-shot, nobody doing any production coding worth anything do one-shot because you can't guarentee 100% success rate and the cost of not being 100% is catastrophic.

I can't speak for how well western models compare against Kimi / DS in one-shot tasks because I don't use them for that purpose and neither does anyone in real businesses. I can only tell you that for creating features, debugging, testing, refactoring, coming up with ideas, operating tools agentically, reading and writing documentation, things that are actually done in industry, there has not be a single instance where Kimi failed.

I mean why do you think US companies are all starting to limit Claude use and removing token use targets? Because in real enterprise enviroments one-shot benchmarks means absolutely nothing while the productivity of using AI to do real work is entirely a function of how much you have to spend per token. Enterprise management are notorious for having no clue what engineers actually do, that's why they set those targets, now they know, but unless they discover Chinese models, they're going to be lapped by companies that uses Chinese models.

Your point about one-shot tasks that benchmarks focus on versus iterative tasks more representative of real work is well made.

Here is an interesting comment from a LLM benchmark organization that supports your observation:
American frontier models are significantly smarter in raw one-shot intelligence. But when given a long horizon to perform agentic work, Chinese models have learned to excel in using arbitrary tools to dramatically improve their original response.

Tool use ability was traditionally Anthropic's edge (and still is, to a degree), but Chinese labs have surpassed them. The fact that this is true of almost all major Chinese releases suggests an innovation is being shared at the state level.

Interestingly this also explains why Xiaomi MiMo / MiniMax is so polarizing with some benchmarks and users saying its terrible while others swear by it. Different usage patterns explain a lot.

 
Last edited:

Wrought

Captain
Registered Member
Lots of claims floating around about Deepseek funding. The latest numbers are in the ballpark of 50 billion yuan, with a big chunk coming from Liang Wenfeng himself.

Chinese AI startup DeepSeek is set to raise about 50 billion yuan ($7.4 billion) in its first funding round from investors including
Please, Log in or Register to view URLs content!
and
Please, Log in or Register to view URLs content!
people with knowledge of the matter said. The fundraising could value the company after the investment at between 350 billion yuan and 400 billion yuan, or between $52 billion and $59 billion, the people said, declining to be identified because the information is confidential.

The startup’s founder, Liang Wenfeng, has committed 20 billion yuan of his own money, the people said, adding that tech conglomerate Tencent is considering 10 billion yuan and battery giant CATL is looking at 5 billion yuan, which would make them the largest external investors in the round. DeepSeek is also in final talks with China’s national artificial intelligence fund, gaming developer
Please, Log in or Register to view URLs content!
and e-commerce giant
Please, Log in or Register to view URLs content!
, they said, noting that the planned number of investors was fewer than 10.

Please, Log in or Register to view URLs content!
 

meedicx

Junior Member
Registered Member
Lol I'm sitting at 600 million total tokens with something around 2M output tokens on DS v4 Pro Max and my bill is currently $14
I'd like to know how they managed to spend $4 on 50k DSv4 tokens

These guys don't even try to be credible do they?

Western AI is entirely dependant on maintaining investment and there are massive incentive to benchmax, but real world engineering only care about outcomes and they simply lack the performance to deliver it

Your intuition is right again. Turns out the DeepSWE benchmark has issues which caused DSv4 cost to be overinflated by 4x. They also have bugs causing the model to fail early. So both the cost and final results are wrong.

Please, Log in or Register to view URLs content!

Some people think closed benchmarks are better but this just shows you the advantages of open ones where at least there's people who can attempt reproduction and find bugs.

If you look at the closed benchmarks run by EpochAI or the US commerce department that got big news attention recently, you do not know how they are introducing bias in favor of US models. They have all the financial incentives to be biased. Just reading the methodology should raise red flags where they call the official endpoints for US models but not for Chinese ones; this exact issue is what was causing DSv4 bugs in DeepSWE
 
Top