recently google released a paper called TurboQuant that basically found an elegant way of quantizing the KV cache memory itself... allowing up to 4x to 6x longer context at near lossless .... I recompiled llama.cpp with this and now on my rtx 5070ti super that only has 16GB vram I can run qwen3.5 9b at the FULL 256k token context window... which is wild because if yall recall, when chatgpt first came out it only had 8k context... and for the longest time the 200k context was only something that antrophic offered enterprise customers...
alright, I tried running qwen3.6 27B locally and it worked fine. had a nice web GUI which I could use. I'm happy with it.
qwen3.5 9b on artificalanalysis is only a few points under Gemini 2.5 Pro in Intelligence benchmarks which last year in 2025 it was SOTA...
won't be long before everyone can run near frontier models on their own hardware espeically with the new 1-bit models coming out... pair that with the likes of turboquant, qwen's deltanet and Deepseeks' enngram technique of seperating facts (knowledge ) from reasoning/logic/intelligence to make LLMs much capable at much smaller size/footprints, pretty soon offline will be as good as online, especially if the online versions will be going up in cost, and gated (mythos) and deliberatly handicapp (opus 4.7 got intentionally handicapped in the IT security domain etc) ....
if antrophic really cared about mankind, they would open weight the opus and mythos models to the world
