Artificial Intelligence thread

9dashline

Captain
Registered Member
Board fight and talent drain to other startups is what has happened.
Since I paying $200 mo I tested it, according to its alleged strength, to have it write me a short fictional story, I gave it same test I gave R1 and o1-pro, and yet its 4.5 is worse of the three. Also unable to output long tokens unlike o1 pro or deep research. And its knowledge cutoff of a lot of recent tech is still back from 2023

Please, Log in or Register to view URLs content!
 

tphuang

General
Staff member
Super Moderator
VIP Professional
Registered Member
Please, Log in or Register to view URLs content!

China Telecom's 息壤 platform currently has 22 EFLOPS and if including 3rd party compute, it has 27 EFLOPS. Not sure if this is FP16 or FP64.

It's providing DeepSeek R1-V3 to customers like Sinopec, China Railway Material Group, Dongguan gov't and digital management and other government entities.

Please, Log in or Register to view URLs content!

Alibaba's B2B platform has AI search function Accio using DeepSeek's reasoning model

Please, Log in or Register to view URLs content!

Baidu has fully onboarded DeepSeek-R1 full blooded version. Customer in web PC and apps can all be using it.

Please, Log in or Register to view URLs content!

Baidu Library and Baidu Cloud have all onboarded DeepSeek-R1 fully version
 

Eventine

Junior Member
Registered Member
New start up from the founders of Oculus, Ubiquity6, and former Meta Reality Labs focusing on direct voice-to-voice AI chat (as opposed to voice-to-text / text-to-voice), challenging Open AI and Grok 3.

They've put up a demo here:
Please, Log in or Register to view URLs content!
, it seems quite impressive. They promise to open source the model, which is smart considering what's been happening. The start up is backed financially by Marc Andreessen among others. Presuming their open source version isn't crippled, this could lead to wide adoption; too bad a Chinese lab didn't get to it first.
 

Fatty

Junior Member
Registered Member

Well this is pretty crazy. Deepseek apparently has a 545% cost margin on inference for R1 and are using less than 300 H800 nodes (8 H800s per node) to serve the insane amount of demand they’re getting.

Not sure whether or not they’re just insanely good or if OpenAI and anthropic are scamming people with their pricing.
 
Last edited:

OptimusLion

Junior Member
Registered Member
Official explanation of DeepSeek-V3 / R1 reasoning system: optimization goal is higher throughput, lower latency

DeepSeek official today published an article "Overview of DeepSeek-V3 / R1 Reasoning System" on Zhihu, detailing how to use large-scale cross-node expert parallelism (Expert Parallelism / EP) to increase batch size, how to hide transmission..

001ZzMwgly1hz1dq76s7nj614008tjsu02.jpg

001ZzMwgly1hz1dq7awv5j614009k0uf02.jpg
001ZzMwgly1hz1dq7fr4bj60zk0fw40402.jpg
001ZzMwgly1hz1dq7mc45j60x50bf3z102.jpg
001ZzMwgly1hz1dq7pvqwj60x60c90u002.jpg


Please, Log in or Register to view URLs content!
 

OptimusLion

Junior Member
Registered Member
DeepSeek Day6: Overview of DeepSeek-V3/R1 Reasoning System.
Save Stream: Let's follow Deepseek's sixth day update and delve into the essence of the DeepSeek V3/R1 reasoning system, which is not just an inference engine, but a carefully orchestrated symphony of parallelism, communication optimization, and load balancing, all designed with only one purpose: to squeeze every drop of performance from the existing H800 GPU of the magic square. This is true MoE (mixture of experts) magic, and Deepseek has unlocked it in front of the world in the sixth day update.
DeepSeek's goal is to achieve high throughput (handling a large number of requests) and low latency (fast response) at the same time. This is the eternal problem of large language model (LLM) services, so here is a problem: MoE models like DeepSeek-V3/R1 are very large. They have a large number of parameters, and if they are simply run, the speed will be slow and they will take up a lot of memory.
DeepSeek's solution is expert parallelism (EP), but not general EP, but large-scale, cross-node EP (cross-node expert parallelism): EP does not let each GPU save the entire model, but spreads the model's experts across multiple GPUs. Each GPU only needs to store a subset of the experts. This reduces the memory footprint of each GPU. MoE models have a large number of experts. DeepSeek-V3/R1 has 256 experts per layer, but only 8 experts are activated for any given input. This sparsity is key, but it requires large batch sizes to ensure that each expert has enough work to do. In order to achieve huge batch sizes and distribute experts efficiently, DeepSeek distributes experts across multiple nodes (servers with multiple GPUs), which is cross-node communication-overlapping communication with computation, data parallelism (DP) and requires load balancing between different DP instances.
Imagine: you have a huge pizza (model), and each slice of pizza represents an expert. Instead of trying to eat the whole pizza yourself, you share the pizza with friends (GPUs) (clusters) at a party. Everyone gets a manageable portion, and then everyone can eat the whole pizza together (process a batch of data).
DeepSeek uses a prefill-decode decomposition architecture, which means they handle initial prompt processing (prefill) and tokens generation (decoding) separately. This allows them to tailor parallel strategies based on each stage:
Prefill stage: [Routed Expert EP32, MLA/Shared Expert DP32], EP32: Each deployment unit spans 4 nodes with 32 redundant routing experts. Each GPU handles 9 routing experts and 1 shared expert. DP32: Data parallelism across 32 GPUs for MLA (multi-layer attention) and shared experts. Prefilling requires more computing resources, so a larger EP degree is required to maximize throughput.
Decoding stage: [Routed Expert EP144, MLA/Shared Expert DP144], EP144: Each deployment unit spans 18 nodes with 32 redundant routing experts. Each GPU manages 2 routing experts and 1 shared expert. DP144: Provides data parallelism across 144 GPUs for MLA and shared experts. Decoding is more sensitive to latency. A higher degree of EP helps reduce the memory footprint and latency of each GPU.
Large-scale EP introduces a lot of communication. GPUs need to exchange data about which experts are activated and their computation results. This communication can easily become a bottleneck. What is DeepSeek's solution? Compute-communication overlap. They cleverly designed the computation workflow to hide the communication latency behind the computation.
Pre-filling phase: Double batch overlap, they split each batch into two micro-batches. While one micro-batch is being computed, the communication of the other micro-batch is going on in the background. Microbatch 1: Compute -> Communicate, Microbatch 2: Communicate -> Compute, which is like juggling, catching one ball while throwing another ball (Figure 2, code example see Figure 4).
The decoding phase is more complicated because the execution time of different stages is uneven. DeepSeek breaks down the attention layer into two steps and uses a 5-stage pipeline to achieve seamless overlap (Figure 3), where each stage performs a portion of the computation and then starts communication while the next stage starts computing.
Even with EP and communication overlap, the system can become bottlenecked if some GPUs are overloaded while others are idle. DeepSeek implements three load balancers to address this issue:
Prefill load balancer: addresses the uneven number of requests and sequence lengths between DP instances, balances core attention computations between GPUs, and balances the number of input tokens for each GPU.
Decoding load balancer: addresses the uneven number of requests and sequence lengths between DP instances (resulting in inconsistent KVCache usage), balances KVCache usage between GPUs, and balances the number of requests for each GPU.
Expert-level parallel load balancer: solves the problem that some experts are inherently more popular than others (leading to unbalanced expert computing workloads on different GPUs), balances expert computing on each GPU (minimizes the maximum scheduling reception load), see Figure 5 for a code example, this example demonstrates a very basic load balancing method, in fact, DeepSeek uses a more complex algorithm to minimize the maximum load on any single GPU.
Finally, the craziest part of Deepseek's presentation: DeepSeek runs its inference service on the H800 GPU, squeezing every drop of H800's performance, and cleverly combines FP8 and BF16 precision. Specifically, matrix multiplication and scheduling transmission use the FP8 format consistent with training, and core MLA calculations and combined transmissions use the BF16 format to ensure optimal service performance. .
In addition, due to the high service load during the day and low load at night, Deepseek implemented a mechanism to deploy inference services on all nodes during peak hours during the day. During the low load period at night, Deepseek reduces inference nodes and allocates resources to research and training (Figure 6). In the past 24 hours (UTC+8 12:00 PM on February 27, 2025 to 12:00 PM on February 28, 2025), the combined peak node occupancy of V3 and R1 inference services reached 278, with an average occupancy of 226.75 nodes (each node contains 8 H800 GPUs). Assuming the rental cost of an H800 GPU is $2 per hour, the total cost per day is $87,072.
The numbers speak for themselves:
73.7k input tokens per second per H800 node (pre-filled, including cache hits)
14.8k output tokens per second per H800 node (decoded)
The above data includes all user requests from web pages, apps, and APIs. If all tokens are billed according to DeepSeek-R1 pricing (*), the total daily revenue is $562,027, with a cost-profit ratio of 545%.
(*) R1 pricing: $0.14/M input tokens (cache hits), $0.55/M input tokens (cache misses), $2.19/M output tokens.
Of course, Deepseek admits that their actual revenue has dropped significantly for the following reasons (Figure 7):
DeepSeek-V3 is priced significantly lower than R1,
Only a small portion of the service can be monetized (web and APP access is still free),
Night discounts will be automatically applied during off-peak hours


639b1bfbly1hz1em112ifj20rg0jy0yp (1).jpg

639b1bfbly1hz1euj3cbbj20wn07p0ue.jpg

639b1bfbly1hz1ewdfv2oj21ef0bun0h.jpg

639b1bfbly1hz1f04ijtij219k0s9jy1.jpg

639b1bfbly1hz1f1mnncbj219g0lfag1.jpg


639b1bfbly1hz1f30r3kej20x60bejsr.jpg

639b1bfbly1hz1f7u5knij20x70c9jup.jpg
 
Last edited:
Top