Artificial Intelligence thread

tphuang · May 1, 2026

https://twitter.com/i/web/status/2050254833802793201

my look at Ascend SuperPoD + large SSD storage + networking + how that helps Caching + cache hit on DeepSeek V4

meedicx said:
Refining a DeepSeek model on your own code bakes the code base directly in the weights, avoiding context scaling issues and may end up performing better than the most expensive OAI or Claude models.
And just coincidentally, Ascend is marketing CPT support
View attachment 174274

Please, Log in or Register to view URLs content!

just so that we provide a link of this content

昇腾同步支持并开源DeepSeek V4 复杂Sparse Attention + mHC架构续训练参考实现，
Please, Log in or Register to view URLs content!
携手 Autofuse，助力训练轻松入图、开箱即优
昇腾 CANN 基于 A3 64卡超节点正式完成 DeepSeek V4-Flash 模型续训练（CPT）的0-day 适配支持。通过 TorchTitan-NPU 插件与 Autofuse 自动融合技术的深度协同，实测模型吞吐量最高达到 1100 tokens/p/s，实现模型训练性能开箱即优。而这一亮眼的开箱表现，主要源自以下三大维度的硬核系统级优化：

Please, Log in or Register to view URLs content!

Highlights

训练框架采用 TorchTitan + TorchTitan-NPU 插件化方案，采用超节点亲和的大 EP + 纯 FSDP 的精简并行切分策略，以极低适配成本和通信开销达成内存占用最优，实现易用性与性能的较好均衡

TorchTitan-NPU 深度适配 torch.compile 机制，使能训练入图技术，依托 Inductor + AutoFuse（基于 Ascend C 的 Codegen 后端）实现端到端的 Vector 算子自动融合，为整网带来高达 31.8% 的开箱即用性能收益

针对稀疏注意力等复杂结构，开发 SparseAttnSharedkv、LightningIndexer 等多个高效的 NPU 融合算子，从负载均衡分核计算、内存与计算均衡等维度协同优化，充分释放芯片稀疏算力

基于上述优化点，CANN 已基于 TorchTitan 支持 DeepSeek-V4-Flash 的模型训练，采用 A3 集群 BF16 精度 64 卡 4K 序列 + MTP1 训练吞吐达 1100 tokens/p/s

未来展望
从 A3 集群现阶段的 Profiling 数据来看，计算耗时占据绝对主导，达到总时间的 84.68%，其中 Vector 类算子更占到整网耗时的近 40%。因此，进一步的性能优化可重点围绕以下方向展开：

采用更精细的按需重计算策略，避免对无需保留激活值的 Vector 算子进行冗余重计算

进一步扩大 AutoFuse 自动融合的覆盖范围，以压缩计算时延

当前 TorchTitan-NPU 版本尚未集成 MC2 或基于算子流水编排的 EP 域通信计算并行机制，将在后续阶段进行能力补齐

在功能拓展方面，后续计划同步跟进 DeepSeek-V4 技术报告中的演进方向：

针对下一代 A5 平台，支持 FP8 与 A8W4 低精度量化训练特性，发挥 A5 代际的 MxFP8/MxFP4 低精度计算和通信能力

在 TorchTitan-NPU 中集成 Muon 优化器功能支持，以追求更快的模型收敛效果，并配套提供 AutoFuse 与融合算子加速能力

性能结果与未来展望
基于A3 SuperPods 64卡的性能结果
依托 CANN 平台与 TorchTitan-NPU 插件，我们在 A3 64 卡集群上快速完成了 DeepSeek-V4-Flash 模型的基础训练性能调优。方案采用大 EP + 纯 FSDP 的极简并行切分策略，集成针对稀疏注意力模块开发的融合算子，并结合 Ascend C AutoFuse 自动融合机制，实现了模型吞吐从初始 397 tokens/p/s 到 1100 tokens/p/s 的显著提升。其中，定制融合算子和 AutoFuse 分别贡献了 90% 和 30% 的吞吐提升。

looks like there is still plenty of work here. Even they admit, the 64-card cluster has only done basic training performance improvement.

bsdnf · May 1, 2026

tphuang said:
https://twitter.com/i/web/status/2050254833802793201

my look at Ascend SuperPoD + large SSD storage + networking + how that helps Caching + cache hit on DeepSeek V4

Please, Log in or Register to view URLs content!
just so that we provide a link of this content

Please, Log in or Register to view URLs content!

looks like there is still plenty of work here. Even they admit, the 64-card cluster has only done basic training performance improvement.

95% essentially reached the theoretical limit; further hardware improvements offer little benefit in terms of cache hit rates.

tphuang · May 1, 2026

bsdnf said:
95% essentially reached the theoretical limit; further hardware improvements offer little benefit in terms of cache hit rates.

if you can speed the speed at cache hit retrieval, then that frees up CPU cycles for other tasks. Every cycle that requires processor busy wait is wasted time.

tokenanalyst · May 1, 2026

This guy called Matthew Berman is getting a lot flak because he say that US should open source to win the "AI race" if there is one, so he is getting a lot heat for that.

But outside of the AI race thing I think the he is missing the point why the US probably won't go as open as China does and in fact companies like Anthropic and OpenAI will fight the creation of more capable open models and push for legislation against Chinese open models. The reason is, money. The speculative valuation of these companies is just too high and they need to show to overly hyped investors that their subscription business models are viable and that there will be revenue in the future.

Today training the model has become relative the cheapest cost, serving the model to millions of users in long horizontal tasks is now the second largest cost, but the biggest worry right now is how to keep investors that had poured trillions of dollars of speculative money based on speculative future revenue happy and giving free AI models to be run by the public, governments and private companies locally is not going to help with that.

bsdnf · May 1, 2026

They can distill and open-source smaller models, but the giants are only willing to release scraps like Gemma4 and GPT-OSS.

Anthropic? Their ambition is for no one but themselves to control LLM; forget about it.

Wrought · May 2, 2026

Would you look at that, the rent-seekers are trying to seek rent.

Strahle labeled the post an advertisement, but she didn’t disclose what organization had paid for it. It turns out the funding came from Build American AI, a dark-money group tied to Leading the Future, a
Please, Log in or Register to view URLs content!
super PAC supported by, and in some cases directly funded by, tech figures affiliated with companies like OpenAI and Palantir.

Marketing agencies are pitching influencers deals such as $5,000 per TikTok video to amplify Build American AI’s messaging about how China’s technological rise should be seen as a threat. The goal, according to a staffer from SM4, the influencer marketing agency running the campaign on behalf of Build American AI, is to subtly shift public debate by framing China’s AI advancement as a serious risk to the safety and well-being of Americans. “They want a push to mention China and America and why beating China is so important,” says the staffer.

Please, Log in or Register to view URLs content!

siegecrossbow · May 2, 2026

Why you shouldn’t disclose that you’ve been watching pron at work…

Please, Log in or Register to view URLs content!

ForcedTrend · May 2, 2026

siegecrossbow said:
Why you shouldn’t disclose that you’ve been watching pron at work…

Please, Log in or Register to view URLs content!

how do they access to the footage? are they uploaded directly to the cloud with no local only option?

meedicx · May 2, 2026

Wrought said:
Would you look at that, the rent-seekers are trying to seek rent.

Please, Log in or Register to view URLs content!

Exhibit 1: OpenAI / Palantir backed influence campaign against Chinese models (quoted)

Exhibit 2: US Government report using biased benchmarking to dump on Chinese AI

Please, Log in or Register to view URLs content!

Exhibit 3: Treasury Secretary remarks: “If China is the casino, artificial intelligence is the table stakes. If we don’t win in AI, then it’s game over”

Please, Log in or Register to view URLs content!

All signals that shows US government and industry is deeply anxious right now on Chinese AI and maintaining their lead.

9dashline · May 2, 2026

meedicx said:
Exhibit 1: OpenAI / Palantir backed influence campaign against Chinese models (quoted)

Exhibit 2: US Government report using biased benchmarking to dump on Chinese AI

Please, Log in or Register to view URLs content!

Exhibit 3: Treasury Secretary remarks: “If China is the casino, artificial intelligence is the table stakes. If we don’t win in AI, then it’s game over”

Please, Log in or Register to view URLs content!

All signals that shows US government and industry is deeply anxious right now on Chinese AI and maintaining their lead.

and Mythos is too expensive to serve publically, at 10 trillion in size, and still no where near "ASI/AGI"... It's called Myth for a reason...

They would have to charge 2x of Opus, and Anthropic is already compute bottlenecked, and losing money on opus anyway (heavily subsidized by VC money)

Entire US hegemony depends on US maintaining dominance in AI... This is why they are panicking... House of cards are collapsing before their eyes

Artificial Intelligence thread

tphuang

General

昇腾同步支持并开源DeepSeek V4 复杂Sparse Attention + mHC架构续训练参考实现，
Please, Log in or Register to view URLs content!
携手 Autofuse，助力训练轻松入图、开箱即优

Highlights

未来展望

性能结果与未来展望

基于A3 SuperPods 64卡的性能结果

bsdnf

Senior Member

tphuang

General

tokenanalyst

Lieutenant General

bsdnf

Senior Member

Wrought

Captain

siegecrossbow

Field Marshall

ForcedTrend

Junior Member

meedicx

Junior Member

9dashline

Major

Artificial Intelligence thread

General

昇腾同步支持并开源DeepSeek V4 复杂Sparse Attention + mHC架构续训练参考实现，Please, Log in or Register to view URLs content! 携手 Autofuse，助力训练轻松入图、开箱即优​

Highlights​

未来展望​

性能结果与未来展望​

基于A3 SuperPods 64卡的性能结果​

Senior Member

General

Lieutenant General

Senior Member

Captain

Field Marshall

Junior Member

Junior Member

Major

昇腾同步支持并开源DeepSeek V4 复杂Sparse Attention + mHC架构续训练参考实现，
Please, Log in or Register to view URLs content!
携手 Autofuse，助力训练轻松入图、开箱即优

Highlights

未来展望

性能结果与未来展望

基于A3 SuperPods 64卡的性能结果