Artificial Intelligence thread

tphuang

General
Staff member
Super Moderator
VIP Professional
Registered Member

my look at Ascend SuperPoD + large SSD storage + networking + how that helps Caching + cache hit on DeepSeek V4

Refining a DeepSeek model on your own code bakes the code base directly in the weights, avoiding context scaling issues and may end up performing better than the most expensive OAI or Claude models.
And just coincidentally, Ascend is marketing CPT support
View attachment 174274
Please, Log in or Register to view URLs content!
just so that we provide a link of this content

昇腾同步支持并开源DeepSeek V4 复杂Sparse Attention + mHC架构续训练参考实现,
Please, Log in or Register to view URLs content!
携手 Autofuse,助力训练轻松入图、开箱即优​

昇腾 CANN 基于 A3 64卡超节点正式完成 DeepSeek V4-Flash 模型续训练(CPT)的0-day 适配支持。通过 TorchTitan-NPU 插件与 Autofuse 自动融合技术的深度协同,实测模型吞吐量最高达到 1100 tokens/p/s,实现模型训练性能开箱即优。而这一亮眼的开箱表现,主要源自以下三大维度的硬核系统级优化:

Please, Log in or Register to view URLs content!

Highlights​

  • 训练框架采用 TorchTitan + TorchTitan-NPU 插件化方案,采用超节点亲和的大 EP + 纯 FSDP 的精简并行切分策略,以极低适配成本和通信开销达成内存占用最优,实现易用性与性能的较好均衡
  • TorchTitan-NPU 深度适配 torch.compile 机制,使能训练入图技术,依托 Inductor + AutoFuse(基于 Ascend C 的 Codegen 后端)实现端到端的 Vector 算子自动融合,为整网带来高达 31.8% 的开箱即用性能收益
  • 针对稀疏注意力等复杂结构,开发 SparseAttnSharedkvLightningIndexer 等多个高效的 NPU 融合算子,从负载均衡分核计算、内存与计算均衡等维度协同优化,充分释放芯片稀疏算力
  • 基于上述优化点,CANN 已基于 TorchTitan 支持 DeepSeek-V4-Flash 的模型训练,采用 A3 集群 BF16 精度 64 卡 4K 序列 + MTP1 训练吞吐达 1100 tokens/p/s

未来展望​

从 A3 集群现阶段的 Profiling 数据来看,计算耗时占据绝对主导,达到总时间的 84.68%,其中 Vector 类算子更占到整网耗时的近 40%。因此,进一步的性能优化可重点围绕以下方向展开:

  • 采用更精细的按需重计算策略,避免对无需保留激活值的 Vector 算子进行冗余重计算
  • 进一步扩大 AutoFuse 自动融合的覆盖范围,以压缩计算时延
  • 当前 TorchTitan-NPU 版本尚未集成 MC2 或基于算子流水编排的 EP 域通信计算并行机制,将在后续阶段进行能力补齐
在功能拓展方面,后续计划同步跟进 DeepSeek-V4 技术报告中的演进方向:

  • 针对下一代 A5 平台,支持 FP8 与 A8W4 低精度量化训练特性,发挥 A5 代际的 MxFP8/MxFP4 低精度计算和通信能力
  • 在 TorchTitan-NPU 中集成 Muon 优化器功能支持,以追求更快的模型收敛效果,并配套提供 AutoFuse 与融合算子加速能力

性能结果与未来展望​

基于A3 SuperPods 64卡的性能结果​

依托 CANN 平台与 TorchTitan-NPU 插件,我们在 A3 64 卡集群上快速完成了 DeepSeek-V4-Flash 模型的基础训练性能调优。方案采用大 EP + 纯 FSDP 的极简并行切分策略,集成针对稀疏注意力模块开发的融合算子,并结合 Ascend C AutoFuse 自动融合机制,实现了模型吞吐从初始 397 tokens/p/s 到 1100 tokens/p/s 的显著提升。其中,定制融合算子和 AutoFuse 分别贡献了 90% 和 30% 的吞吐提升。
looks like there is still plenty of work here. Even they admit, the 64-card cluster has only done basic training performance improvement.
 

bsdnf

Senior Member
Registered Member

my look at Ascend SuperPoD + large SSD storage + networking + how that helps Caching + cache hit on DeepSeek V4


Please, Log in or Register to view URLs content!
just so that we provide a link of this content


Please, Log in or Register to view URLs content!




looks like there is still plenty of work here. Even they admit, the 64-card cluster has only done basic training performance improvement.
95% essentially reached the theoretical limit; further hardware improvements offer little benefit in terms of cache hit rates.
 
Last edited:

tokenanalyst

Lieutenant General
Registered Member
This guy called Matthew Berman is getting a lot flak because he say that US should open source to win the "AI race" if there is one, so he is getting a lot heat for that.
1777672535547.png

But outside of the AI race thing I think the he is missing the point why the US probably won't go as open as China does and in fact companies like Anthropic and OpenAI will fight the creation of more capable open models and push for legislation against Chinese open models. The reason is, money. The speculative valuation of these companies is just too high and they need to show to overly hyped investors that their subscription business models are viable and that there will be revenue in the future.

Today training the model has become relative the cheapest cost, serving the model to millions of users in long horizontal tasks is now the second largest cost, but the biggest worry right now is how to keep investors that had poured trillions of dollars of speculative money based on speculative future revenue happy and giving free AI models to be run by the public, governments and private companies locally is not going to help with that.

1777671484970.png
 
Last edited:

Wrought

Captain
Registered Member
Would you look at that, the rent-seekers are trying to seek rent.

Strahle labeled the post an advertisement, but she didn’t disclose what organization had paid for it. It turns out the funding came from Build American AI, a dark-money group tied to Leading the Future, a
Please, Log in or Register to view URLs content!
super PAC supported by, and in some cases directly funded by, tech figures affiliated with companies like OpenAI and Palantir.

Marketing agencies are pitching influencers deals such as $5,000 per TikTok video to amplify Build American AI’s messaging about how China’s technological rise should be seen as a threat. The goal, according to a staffer from SM4, the influencer marketing agency running the campaign on behalf of Build American AI, is to subtly shift public debate by framing China’s AI advancement as a serious risk to the safety and well-being of Americans. “They want a push to mention China and America and why beating China is so important,” says the staffer.

Please, Log in or Register to view URLs content!
 

meedicx

Junior Member
Registered Member
Would you look at that, the rent-seekers are trying to seek rent.

Please, Log in or Register to view URLs content!

Exhibit 1: OpenAI / Palantir backed influence campaign against Chinese models (quoted)

Exhibit 2: US Government report using biased benchmarking to dump on Chinese AI
Please, Log in or Register to view URLs content!

Exhibit 3: Treasury Secretary remarks: “If China is the casino, artificial intelligence is the table stakes. If we don’t win in AI, then it’s game over”
Please, Log in or Register to view URLs content!

All signals that shows US government and industry is deeply anxious right now on Chinese AI and maintaining their lead.
 

9dashline

Captain
Registered Member
Exhibit 1: OpenAI / Palantir backed influence campaign against Chinese models (quoted)

Exhibit 2: US Government report using biased benchmarking to dump on Chinese AI
Please, Log in or Register to view URLs content!

Exhibit 3: Treasury Secretary remarks: “If China is the casino, artificial intelligence is the table stakes. If we don’t win in AI, then it’s game over”
Please, Log in or Register to view URLs content!

All signals that shows US government and industry is deeply anxious right now on Chinese AI and maintaining their lead.
and Mythos is too expensive to serve publically, at 10 trillion in size, and still no where near "ASI/AGI"... It's called Myth for a reason...

They would have to charge 2x of Opus, and Anthropic is already compute bottlenecked, and losing money on opus anyway (heavily subsidized by VC money)

Entire US hegemony depends on US maintaining dominance in AI... This is why they are panicking... House of cards are collapsing before their eyes
 
Top