Artificial Intelligence thread

mossen

Junior Member
Registered Member
Qwen 3 is an efficient model, but it doesn't do great on factuality (SimpleQA).

1.png

By comparison, R1 gets around 30%. Qwen barely hits 11% at best and often below 10%. Gemini Flash 2.5 gets over 30% too. Gemini Pro gets 53%. I think Qwen's biggest strength is how efficient it is for its smaller sizes. But it's not as good as the leading models. Not even as good as the leading open source model (R1). And DeepSeek is likely to release a new model very shortly.

Qwen is far better than Llama, but Deepseek is still the open source king.
 

Wrought

Senior Member
Registered Member
Deepseek returns to South Korea following a two-month ban.

SEOUL, April 28 (Reuters) - Chinese artificial intelligence service DeepSeek became available again on South Korean app markets on Monday for the first time in about two months, when downloads were suspended after authorities cited breaches in data protection rules.
South Korea's Personal Information Protection Commission said on Thursday that DeepSeek transferred
Please, Log in or Register to view URLs content!
and prompts without permission when the service first launched in South Korea in January.

Downloading the app was suspended in February after the questions over personal data protection surfaced, but the service was available for download again on South Korea's app market including via Apple's App Store and Google Play Store.

Please, Log in or Register to view URLs content!
 

tphuang

General
Staff member
Super Moderator
VIP Professional
Registered Member

european_guy

Junior Member
Registered Member
Qwen 3 is an efficient model, but it doesn't do great on factuality (SimpleQA).

View attachment 150915

By comparison, R1 gets around 30%. Qwen barely hits 11% at best and often below 10%. Gemini Flash 2.5 gets over 30% too. Gemini Pro gets 53%. I think Qwen's biggest strength is how efficient it is for its smaller sizes. But it's not as good as the leading models. Not even as good as the leading open source model (R1). And DeepSeek is likely to release a new model very shortly.

Qwen is far better than Llama, but Deepseek is still the open source king.

Please, Log in or Register to view URLs content!
is an OpenAI dataset that is all but simple.

Hera are some random questions out of the
Please, Log in or Register to view URLs content!
:

- Who received the IEEE Frank Rosenblatt Award in 2010? (Michio Sugeno)

- How much money, in euros, was the surgeon held responsible for Stella Obasanjo's death ordered to pay her son? (120,000)

- What were the month and year when Obama told Christianity Today, "I am a Christian, and I am a devout Christian. I believe in the redemptive death and resurrection of Jesus Christ"? (January 2008)

These are far from simple....here simpleQA means that those are single fact questions, where the answer is just a couple of words.

Small models, without many hundreds/thousands of billions of parameters cannot perform well on this test, because all these little facts are stored in the model's parameters...and you need tons of them to learn all the small, single little facts that happened in the world at that level of detail.

This is a good test to indirectly get some hints on the model size, for instance, for closed models, like the OpenAI ones, if model A has a better result on SImpleQA of model B, then almost certainly model A is bigger than B.

This test does not measure how much a model is "smart" or good at instruction following, it does not measure how much a model is useful for day by day usage.
 
Top