DeepSeek Day6: Overview of DeepSeek-V3/R1 Reasoning System.
Save Stream: Let's follow Deepseek's sixth day update and delve into the essence of the DeepSeek V3/R1 reasoning system, which is not just an inference engine, but a carefully orchestrated symphony of parallelism, communication optimization, and load balancing, all designed with only one purpose: to squeeze every drop of performance from the existing H800 GPU of the magic square. This is true MoE (mixture of experts) magic, and Deepseek has unlocked it in front of the world in the sixth day update.
DeepSeek's goal is to achieve high throughput (handling a large number of requests) and low latency (fast response) at the same time. This is the eternal problem of large language model (LLM) services, so here is a problem: MoE models like DeepSeek-V3/R1 are very large. They have a large number of parameters, and if they are simply run, the speed will be slow and they will take up a lot of memory.
DeepSeek's solution is expert parallelism (EP), but not general EP, but large-scale, cross-node EP (cross-node expert parallelism): EP does not let each GPU save the entire model, but spreads the model's experts across multiple GPUs. Each GPU only needs to store a subset of the experts. This reduces the memory footprint of each GPU. MoE models have a large number of experts. DeepSeek-V3/R1 has 256 experts per layer, but only 8 experts are activated for any given input. This sparsity is key, but it requires large batch sizes to ensure that each expert has enough work to do. In order to achieve huge batch sizes and distribute experts efficiently, DeepSeek distributes experts across multiple nodes (servers with multiple GPUs), which is cross-node communication-overlapping communication with computation, data parallelism (DP) and requires load balancing between different DP instances.
Imagine: you have a huge pizza (model), and each slice of pizza represents an expert. Instead of trying to eat the whole pizza yourself, you share the pizza with friends (GPUs) (clusters) at a party. Everyone gets a manageable portion, and then everyone can eat the whole pizza together (process a batch of data).
DeepSeek uses a prefill-decode decomposition architecture, which means they handle initial prompt processing (prefill) and tokens generation (decoding) separately. This allows them to tailor parallel strategies based on each stage:
Prefill stage: [Routed Expert EP32, MLA/Shared Expert DP32], EP32: Each deployment unit spans 4 nodes with 32 redundant routing experts. Each GPU handles 9 routing experts and 1 shared expert. DP32: Data parallelism across 32 GPUs for MLA (multi-layer attention) and shared experts. Prefilling requires more computing resources, so a larger EP degree is required to maximize throughput.
Decoding stage: [Routed Expert EP144, MLA/Shared Expert DP144], EP144: Each deployment unit spans 18 nodes with 32 redundant routing experts. Each GPU manages 2 routing experts and 1 shared expert. DP144: Provides data parallelism across 144 GPUs for MLA and shared experts. Decoding is more sensitive to latency. A higher degree of EP helps reduce the memory footprint and latency of each GPU.
Large-scale EP introduces a lot of communication. GPUs need to exchange data about which experts are activated and their computation results. This communication can easily become a bottleneck. What is DeepSeek's solution? Compute-communication overlap. They cleverly designed the computation workflow to hide the communication latency behind the computation.
Pre-filling phase: Double batch overlap, they split each batch into two micro-batches. While one micro-batch is being computed, the communication of the other micro-batch is going on in the background. Microbatch 1: Compute -> Communicate, Microbatch 2: Communicate -> Compute, which is like juggling, catching one ball while throwing another ball (Figure 2, code example see Figure 4).
The decoding phase is more complicated because the execution time of different stages is uneven. DeepSeek breaks down the attention layer into two steps and uses a 5-stage pipeline to achieve seamless overlap (Figure 3), where each stage performs a portion of the computation and then starts communication while the next stage starts computing.
Even with EP and communication overlap, the system can become bottlenecked if some GPUs are overloaded while others are idle. DeepSeek implements three load balancers to address this issue:
Prefill load balancer: addresses the uneven number of requests and sequence lengths between DP instances, balances core attention computations between GPUs, and balances the number of input tokens for each GPU.
Decoding load balancer: addresses the uneven number of requests and sequence lengths between DP instances (resulting in inconsistent KVCache usage), balances KVCache usage between GPUs, and balances the number of requests for each GPU.
Expert-level parallel load balancer: solves the problem that some experts are inherently more popular than others (leading to unbalanced expert computing workloads on different GPUs), balances expert computing on each GPU (minimizes the maximum scheduling reception load), see Figure 5 for a code example, this example demonstrates a very basic load balancing method, in fact, DeepSeek uses a more complex algorithm to minimize the maximum load on any single GPU.
Finally, the craziest part of Deepseek's presentation: DeepSeek runs its inference service on the H800 GPU, squeezing every drop of H800's performance, and cleverly combines FP8 and BF16 precision. Specifically, matrix multiplication and scheduling transmission use the FP8 format consistent with training, and core MLA calculations and combined transmissions use the BF16 format to ensure optimal service performance. .
In addition, due to the high service load during the day and low load at night, Deepseek implemented a mechanism to deploy inference services on all nodes during peak hours during the day. During the low load period at night, Deepseek reduces inference nodes and allocates resources to research and training (Figure 6). In the past 24 hours (UTC+8 12:00 PM on February 27, 2025 to 12:00 PM on February 28, 2025), the combined peak node occupancy of V3 and R1 inference services reached 278, with an average occupancy of 226.75 nodes (each node contains 8 H800 GPUs). Assuming the rental cost of an H800 GPU is $2 per hour, the total cost per day is $87,072.
The numbers speak for themselves:
73.7k input tokens per second per H800 node (pre-filled, including cache hits)
14.8k output tokens per second per H800 node (decoded)
The above data includes all user requests from web pages, apps, and APIs. If all tokens are billed according to DeepSeek-R1 pricing (*), the total daily revenue is $562,027, with a cost-profit ratio of 545%.
(*) R1 pricing: $0.14/M input tokens (cache hits), $0.55/M input tokens (cache misses), $2.19/M output tokens.
Of course, Deepseek admits that their actual revenue has dropped significantly for the following reasons (Figure 7):
DeepSeek-V3 is priced significantly lower than R1,
Only a small portion of the service can be monetized (web and APP access is still free),
Night discounts will be automatically applied during off-peak hours
