Artificial Intelligence thread

tphuang · May 29, 2025

Eventine said:
These results put the new R1 in the same generation as O3, Gemini 2.5 Pro, and Claude 4 in the pure programming / logic / math / reasoning space. It is still behind O3 and Gemini 2.5 Pro in multi-modal and context length aspects, and indeed China currently lacks any frontier multi-modal model, as well as Deep Research (although the lack of Deep Research is mostly due to the failures of Baidu, given how tied up Deep Research is with search engines).

In an objective sense, Deep Seek right now is around the same level of Anthropic when it comes to state of the art, and single handedly keeping China competitive ie within ~2-3 months of the West on frontier models. However, the shift to multi-modal frameworks by Open AI and Google may be trouble if those architectures prove to be superior in the eventual sense. Hence the need for more strong Chinese competitors in this space.

We now wait for Elon's Grok 3.5, which should wrap up this generation of model releases.

China is the king of multi-modal models for any application that you'd want to use. What are you talking about?

DeepSeek has no real viable multi-modal model, because it has no audio model.

For applications, Qwen and Mini CPM are the best models you can find out there.

Eventine · May 29, 2025

tphuang said:
China is the king of multi-modal models for any application that you'd want to use. What are you talking about?

DeepSeek has no real viable multi-modal model, because it has no audio model.

For applications, Qwen and Mini CPM are the best models you can find out there.

What do you mean? I'm talking about multi-modal models that are intrinsically integrated (cross-trained) with frontier LLMs - ie Open AI and Google's latest models. Gemini 2.5 is a natively multi-modal model (as are the newer Open AI 4o series models, and almost certainly GPT 5).

Qwen and Mini CPM are impressive for their size, but they are not frontier models. If you look at

Please, Log in or Register to view URLs content!

, Chinese models are not even close to the top of the chart.

tphuang · May 29, 2025

Eventine said:
What do you mean? I'm talking about multi-modal models that are intrinsically integrated (cross-trained) with frontier LLMs - ie Open AI and Google's latest models. Gemini 2.5 is a natively multi-modal model (as are the newer Open AI 4o series models, and almost certainly GPT 5).

Qwen and Mini CPM are impressive for their size, but they are not frontier models. If you look at
Please, Log in or Register to view URLs content!
, Chinese models are not even close to the top of the chart.

I don't use benchmarks. I use actual models on edge platforms. Multi-modal models are pointless if they can't run on edge devices.

Otherwise, you just do your own integration of different models.

tokenanalyst · May 29, 2025

Eventine said:
What do you mean? I'm talking about multi-modal models that are intrinsically integrated (cross-trained) with frontier LLMs - ie Open AI and Google's latest models. Gemini 2.5 is a natively multi-modal model (as are the newer Open AI 4o series models, and almost certainly GPT 5).

Qwen and Mini CPM are impressive for their size, but they are not frontier models. If you look at
Please, Log in or Register to view URLs content!
, Chinese models are not even close to the top of the chart.

SeedVL Bytedance is the number 10, I won' call that "not even close"
InternVL78 and Skywork are in the top 20.
Is only a matter of time until Qwen team release a multimodal will blow away everything else.

tphuang · May 29, 2025

https://twitter.com/i/web/status/1928220454172369115

my latest on DeepSeek's new release and its already fully supported on Tencent's various apps. They are really the fastest in AI rollout and putting the latest models at the hand of users.

btw, Qwen multi-modal is already awesome. It's really good. although, The 3B version actually is small enough to run from a lot of edge platforms that have less memory. Which is quite critical.

If you never tried running inference on these things locally, it's hard to comprehend just how much RAM is required to get semi decent performance on multi-modal models.

Eventine · May 29, 2025

tphuang said:
I don't use benchmarks. I use actual models on edge platforms. Multi-modal models are pointless if they can't run on edge devices.

Otherwise, you just do your own integration of different models.

Multi-modal models are useful outside of edge devices.

I work in a content development industry. Multi-modal models are extremely disruptive for content generation. A model cross-trained between videos & text is not just useful for video recognition, it's also useful for generating videos. That's typically done on compute clusters.

The advantage over a model pipeline is that the video generator has a deeper & richer understanding of the association between words, objects, and motion, and can follow instructions much, much better than a typical video model that's only been trained on tags & descriptions. Thus, for instance, you can tell a multi-modal model to modify small details of a picture it generated, which is impossible for a tags-only image model like Stable Diffusion, where you'd have to do manual in painting.

There are also synergies when doing cross-training - e.g. LLMs' visuo-spatial knowledge are improved by reasoning in a latent space that's been trained on images & videos. Even Claude, which has been hyper optimized for logic & coding, has native vision capabilities because of this.

Any way, I'm not saying the future is necessarily native multi-modal models, but that it is a weakness in the Chinese AI ecosystem. Yes, Byte Dance has Seed and Alibaba has Qwen, but neither are built off of state of the art LLMs. Consequently, when folks in my industry are looking at options for enterprise quality content production, they're gravitating towards Google's and Open AI's solutions, even those that formerly preferred Kling.

tokenanalyst · May 29, 2025

Eventine said:
Multi-modal models are useful outside of edge devices.

I work in a content development industry. Multi-modal models are extremely disruptive for content generation. A model cross-trained between videos & text is not just useful for video recognition, it's also useful for generating videos. That's typically done on compute clusters.

The advantage over a model pipeline is that the video generator has a deeper & richer understanding of the association between words, objects, and motion, and can follow instructions much, much better than a typical video model that's only been trained on tags & descriptions. Thus, for instance, you can tell a multi-modal model to modify small details of a picture it generated, which is impossible for a tags-only image model like Stable Diffusion, where you'd have to do manual in painting.

There are also synergies when doing cross-training - e.g. LLMs' visuo-spatial knowledge are improved by reasoning in a latent space that's been trained on images & videos. Even Claude, which has been hyper optimized for logic & coding, has native vision capabilities because of this.

Any way, I'm not saying the future is necessarily native multi-modal models, but that it is a weakness in the Chinese AI ecosystem. Yes, Byte Dance has Seed and Alibaba has Qwen, but neither are built off of state of the art LLMs. Consequently, when folks in my industry are looking at options for enterprise quality content production, they're gravitating towards Google's and Open AI's solutions, even those that formerly preferred Kling.

Kling is not a LLM is a diffusion model, not even a multimodal model. SeedVL with 20B parameters beat most closed models that come out just a few month ago. The field is just moving too fast.

Eventine · May 29, 2025

tokenanalyst said:
Kling is not a LLM is a diffusion model, not even a multimodal model. SeedVL with 20B parameters beat most closed models that come out just a few month ago. The field is just moving too fast.

Yes, I realize Kling is not a LLM, which will be problematic for Kuaishou as the new breed of multi-modal models will be capable of substantially better details control & instructions following. Seed is more promising in this respect, and I hope Byte Dance can scale up on the LLM side, to the level of challenging Veo 3, which is the current favorite to "win" the closed AI video race.

Fully Compliant · May 29, 2025

Veo 3 output is unsettlingly good. I hope no one here works in advertising...

Please, Log in or Register to view URLs content!

tokenanalyst · May 29, 2025

Fully Compliant said:
Veo 3 output is unsettlingly good. I hope no one here works in advertising...

Please, Log in or Register to view URLs content!

Please, Log in or Register to view URLs content!

Still that uncanny AI look doesn't go away.

Artificial Intelligence thread

tphuang

General

Eventine

Senior Member

tphuang

General

tokenanalyst

Lieutenant General

tphuang

General

Eventine

Senior Member

tokenanalyst

Lieutenant General

Eventine

Senior Member

Fully Compliant

New Member

tokenanalyst

Lieutenant General