Artificial Intelligence thread

tphuang

General
Staff member
Super Moderator
VIP Professional
Registered Member
These results put the new R1 in the same generation as O3, Gemini 2.5 Pro, and Claude 4 in the pure programming / logic / math / reasoning space. It is still behind O3 and Gemini 2.5 Pro in multi-modal and context length aspects, and indeed China currently lacks any frontier multi-modal model, as well as Deep Research (although the lack of Deep Research is mostly due to the failures of Baidu, given how tied up Deep Research is with search engines).

In an objective sense, Deep Seek right now is around the same level of Anthropic when it comes to state of the art, and single handedly keeping China competitive ie within ~2-3 months of the West on frontier models. However, the shift to multi-modal frameworks by Open AI and Google may be trouble if those architectures prove to be superior in the eventual sense. Hence the need for more strong Chinese competitors in this space.

We now wait for Elon's Grok 3.5, which should wrap up this generation of model releases.
China is the king of multi-modal models for any application that you'd want to use. What are you talking about?

DeepSeek has no real viable multi-modal model, because it has no audio model.

For applications, Qwen and Mini CPM are the best models you can find out there.
 

Eventine

Junior Member
Registered Member
China is the king of multi-modal models for any application that you'd want to use. What are you talking about?

DeepSeek has no real viable multi-modal model, because it has no audio model.

For applications, Qwen and Mini CPM are the best models you can find out there.
What do you mean? I'm talking about multi-modal models that are intrinsically integrated (cross-trained) with frontier LLMs - ie Open AI and Google's latest models. Gemini 2.5 is a natively multi-modal model (as are the newer Open AI 4o series models, and almost certainly GPT 5).

Qwen and Mini CPM are impressive for their size, but they are not frontier models. If you look at
Please, Log in or Register to view URLs content!
, Chinese models are not even close to the top of the chart.
 

tphuang

General
Staff member
Super Moderator
VIP Professional
Registered Member
What do you mean? I'm talking about multi-modal models that are intrinsically integrated (cross-trained) with frontier LLMs - ie Open AI and Google's latest models. Gemini 2.5 is a natively multi-modal model (as are the newer Open AI 4o series models, and almost certainly GPT 5).

Qwen and Mini CPM are impressive for their size, but they are not frontier models. If you look at
Please, Log in or Register to view URLs content!
, Chinese models are not even close to the top of the chart.
I don't use benchmarks. I use actual models on edge platforms. Multi-modal models are pointless if they can't run on edge devices.

Otherwise, you just do your own integration of different models.
 

tokenanalyst

Brigadier
Registered Member
What do you mean? I'm talking about multi-modal models that are intrinsically integrated (cross-trained) with frontier LLMs - ie Open AI and Google's latest models. Gemini 2.5 is a natively multi-modal model (as are the newer Open AI 4o series models, and almost certainly GPT 5).

Qwen and Mini CPM are impressive for their size, but they are not frontier models. If you look at
Please, Log in or Register to view URLs content!
, Chinese models are not even close to the top of the chart.
SeedVL Bytedance is the number 10, I won' call that "not even close"
InternVL78 and Skywork are in the top 20.
Is only a matter of time until Qwen team release a multimodal will blow away everything else.
 

tphuang

General
Staff member
Super Moderator
VIP Professional
Registered Member

my latest on DeepSeek's new release and its already fully supported on Tencent's various apps. They are really the fastest in AI rollout and putting the latest models at the hand of users.

btw, Qwen multi-modal is already awesome. It's really good. although, The 3B version actually is small enough to run from a lot of edge platforms that have less memory. Which is quite critical.

If you never tried running inference on these things locally, it's hard to comprehend just how much RAM is required to get semi decent performance on multi-modal models.
 

Eventine

Junior Member
Registered Member
I don't use benchmarks. I use actual models on edge platforms. Multi-modal models are pointless if they can't run on edge devices.

Otherwise, you just do your own integration of different models.
Multi-modal models are useful outside of edge devices.

I work in a content development industry. Multi-modal models are extremely disruptive for content generation. A model cross-trained between videos & text is not just useful for video recognition, it's also useful for generating videos. That's typically done on compute clusters.

The advantage over a model pipeline is that the video generator has a deeper & richer understanding of the association between words, objects, and motion, and can follow instructions much, much better than a typical video model that's only been trained on tags & descriptions. Thus, for instance, you can tell a multi-modal model to modify small details of a picture it generated, which is impossible for a tags-only image model like Stable Diffusion, where you'd have to do manual in painting.

There are also synergies when doing cross-training - e.g. LLMs' visuo-spatial knowledge are improved by reasoning in a latent space that's been trained on images & videos. Even Claude, which has been hyper optimized for logic & coding, has native vision capabilities because of this.

Any way, I'm not saying the future is necessarily native multi-modal models, but that it is a weakness in the Chinese AI ecosystem. Yes, Byte Dance has Seed and Alibaba has Qwen, but neither are built off of state of the art LLMs. Consequently, when folks in my industry are looking at options for enterprise quality content production, they're gravitating towards Google's and Open AI's solutions, even those that formerly preferred Kling.
 

tokenanalyst

Brigadier
Registered Member
Multi-modal models are useful outside of edge devices.

I work in a content development industry. Multi-modal models are extremely disruptive for content generation. A model cross-trained between videos & text is not just useful for video recognition, it's also useful for generating videos. That's typically done on compute clusters.

The advantage over a model pipeline is that the video generator has a deeper & richer understanding of the association between words, objects, and motion, and can follow instructions much, much better than a typical video model that's only been trained on tags & descriptions. Thus, for instance, you can tell a multi-modal model to modify small details of a picture it generated, which is impossible for a tags-only image model like Stable Diffusion, where you'd have to do manual in painting.

There are also synergies when doing cross-training - e.g. LLMs' visuo-spatial knowledge are improved by reasoning in a latent space that's been trained on images & videos. Even Claude, which has been hyper optimized for logic & coding, has native vision capabilities because of this.

Any way, I'm not saying the future is necessarily native multi-modal models, but that it is a weakness in the Chinese AI ecosystem. Yes, Byte Dance has Seed and Alibaba has Qwen, but neither are built off of state of the art LLMs. Consequently, when folks in my industry are looking at options for enterprise quality content production, they're gravitating towards Google's and Open AI's solutions, even those that formerly preferred Kling.
Kling is not a LLM is a diffusion model, not even a multimodal model. SeedVL with 20B parameters beat most closed models that come out just a few month ago. The field is just moving too fast.
 

Eventine

Junior Member
Registered Member
Kling is not a LLM is a diffusion model, not even a multimodal model. SeedVL with 20B parameters beat most closed models that come out just a few month ago. The field is just moving too fast.
Yes, I realize Kling is not a LLM, which will be problematic for Kuaishou as the new breed of multi-modal models will be capable of substantially better details control & instructions following. Seed is more promising in this respect, and I hope Byte Dance can scale up on the LLM side, to the level of challenging Veo 3, which is the current favorite to "win" the closed AI video race.
 
Last edited:
Top