Qwen 3.7 is like GLM-5.1 in that it was first Qwen model to look like it has some RSI potential but the problem for alibaba team is whether or not they got good enough data since they didn’t open things up and make it as cheap and accessible to developers like Kimi and glm did.
I'm skeptical about the user data argument. People were saying this since ChatGPT-3.5 that the model with the most users will snowball due to accumulating a data advantage, but this turned out to be false. There are major filtering, distribution and data quality challenges with using user provided data for training. It's also unlikely for Z.AI to have acquired that much user data to make such a big leap in a few months.
It seems to me the breakthrough for GLM-5.2 is changing how they do RL, especially by focusing on long-horizon tasks. The technical blog mentions switching RL reward algorithms from GRPO to critic-based PPO. DeepSeek R1 saw a major breakthrough using GRPO to enable reasoning and GLM-5.2 may have achieved a similar-level breakthrough using critic-based PPO for long-horizon task training.
The Qwen3.7 Max blog also put a strong emphasis on RL for long-horizon tasks and describes overcoming similar challenges in preventing reward-hacking. So they are already on a similar track as GLM.
By being open weight and transparent about RL processes, GLM-5.2 will lift every other LLM developer just like DeepSeek R1. I expect all the major Chinese LLMs to massively improve by end of this year as they distill GLM-5.2 and adopt its RL techniques