For video and image creation, Kling does the job quite well. There are others that do the job of watching video and getting audio out and such. But if you are just looking at creating ads, short dramas and things like that, Kling serves the Chinese market quite well.Yes, I realize Kling is not a LLM, which will be problematic for Kuaishou as the new breed of multi-modal models will be capable of substantially better details control & instructions following. Seed is more promising in this respect, and I hope Byte Dance can scale up on the LLM side, to the level of challenging Veo 3, which is the current favorite to "win" the closed AI video race.
Remember, there are many Chinese models that are simply not being benchmarked by these 3rd party lists.
reasoning frankly is overrated.Multi-modal models are useful outside of edge devices.
I work in a content development industry. Multi-modal models are extremely disruptive for content generation. A model cross-trained between videos & text is not just useful for video recognition, it's also useful for generating videos. That's typically done on compute clusters.
The advantage over a model pipeline is that the video generator has a deeper & richer understanding of the association between words, objects, and motion, and can follow instructions much, much better than a typical video model that's only been trained on tags & descriptions. Thus, for instance, you can tell a multi-modal model to modify small details of a picture it generated, which is impossible for a tags-only image model like Stable Diffusion, where you'd have to do manual in painting.
There are also synergies when doing cross-training - e.g. LLMs' visuo-spatial knowledge are improved by reasoning in a latent space that's been trained on images & videos. Even Claude, which has been hyper optimized for logic & coding, has native vision capabilities because of this.
Any way, I'm not saying the future is necessarily native multi-modal models, but that it is a weakness in the Chinese AI ecosystem. Yes, Byte Dance has Seed and Alibaba has Qwen, but neither are built off of state of the art LLMs. Consequently, when folks in my industry are looking at options for enterprise quality content production, they're gravitating towards Google's and Open AI's solutions, even those that formerly preferred Kling.
In most AI applications, you don't have the time to generate a bunch of reasoning tokens.
Think about actual multi-modal applications like ADAS, autonomous delivery robots, drones, humanoid robot and things like that. America is quite far behind in these areas. None of that can run computer clusters.
people in most of AI community in America has 0 clue what it's like to put AI in physical objects that do stuff. I guarantee you very few of the talking heads who follow LLMs understand this stuff.