It's notable that DeepSeek implemented the kernel using TileLang instead of CUDA. They also implemented the DeepSeek Sparse Attention in Deepseek-3.2 using TileLang.
It seems the "moat" for CUDA isn't as big as $NVDA boosters on social media portend.
View attachment 167399
It is notable, and was noted some months ago. For example: