DeepSeek, a Chinese artificial intelligence startup, said the artificial intelligence models it trains are comparable to leading models from heavyweight companies such as OpenAI, Meta and Anthropic, but with 11 times less GPU computing and cost. These claims have not yet been fully verified, but this shocking announcement shows that while US sanctions have affected the availability of Chinese artificial intelligence hardware, smart scientists are working hard to extract the best performance from the limited amount of hardware to Reduce the impact of suffocation on China’s AI chip supply. The company has open sourced the model and weights, so we expect testing to be available soon.
Deepseek trained the DeepSeek-V3 Mixture-of-Experts (MoE) language model with 671 billion parameters in just two months using a cluster of 2,048 Nvidia H800 GPUs, representing 2.8 million GPU hours. Paper. In comparison, Meta took 11 times the computing power (30.8 million GPU hours) trained Llama 3 with 405 billion parameters in 54 days using a cluster of 16,384 H100 GPUs.
DeepSeek claims that it uses advanced pipeline algorithms, optimized communication frameworks, and FP8 low-precision computation and communication to significantly reduce the computational and memory requirements typically required for models of this size.
The company used a cluster of 2,048 Nvidia H800 GPUs, each equipped with an NVLink interconnect for GPU-to-GPU and an InfiniBand interconnect for node-to-node communication. In this type of setup, inter-GPU communication is quite fast, but inter-node communication is not, so optimization is key for performance and efficiency. While DeepSeek implemented dozens of optimization techniques to reduce the computational requirements of its DeepSeek-v3, several key techniques enabled it to achieve impressive results.
DeepSeek uses the DualPipe algorithm to overlap computation and communication stages within and between forward and backward micro-batches, thereby reducing pipeline efficiency. In particular, scheduling (routing tokens to experts) and combining (aggregating results) operations are handled in parallel with the computation using custom PTX (Parallel Thread Execution) instructions, which means writing low-level code designed to interact with Nvidia CUDA Dedicated code to the GPU and optimizes its operation. According to DeepSeek, the DualPipe algorithm minimizes training bottlenecks, especially for the cross-node expert parallelism required by the MoE architecture. This optimization allows the cluster to process 14.8 trillion commands with near-zero communication overhead during pre-training. Card.
In addition to implementing DualPipe, DeepSeek limits each token to a maximum of four nodes to limit the number of nodes participating in communication. This reduces traffic and ensures that communication and computation can overlap efficiently.
A key factor in reducing computational and communication requirements is the use of low-precision training techniques. DeepSeek uses the FP8 mixed-precision framework, which enables faster operations and reduced memory usage without affecting numerical stability. Key operations such as matrix multiplication are performed in FP8, while sensitive components such as embedding and normalization layers are retained at higher precision (BF16 or FP32) to ensure accuracy. This approach reduces memory requirements while maintaining robust accuracy, with relative training loss errors consistently below 0.25%.
In terms of performance, the company said the DeepSeek-v3 MoE language model is on par with or better than GPT-4x, Claude-3.5-Sonnet, and LLlama-3.1, depending on the benchmark. Of course we have to see a third party prove this benchmark. The company has open sourced the model and weights, so we expect testing to be available soon.
Although DeepSeek-V3 may lag behind cutting-edge models such as GPT-4o or o3 in terms of number of parameters or inference capabilities, DeepSeek’s achievements demonstrate that it is possible to train advanced MoE language models using relatively limited resources. Of course, this requires a lot of optimization and low-level programming, but the results seem to be surprisingly good.
The DeepSeek team realizes that deploying the DeepSeek-V3 model requires advanced hardware and a deployment strategy that separates the pre-filling and decoding stages, which may not be possible for small companies due to lack of resources.
“While acknowledging its strong performance and cost-effectiveness, we also recognize that DeepSeek-V3 has some limitations, especially in terms of deployment,” the company’s paper reads. “First, to ensure efficient inference, DeepSeek-V3 The recommended deployment unit of V3 is relatively large, which may burden small teams. Secondly, although our DeepSeek-V3 deployment strategy has achieved an end-to-end generation speed that is more than twice that of DeepSeek-V2, there are still further enhancements. Fortunately, with the development of more advanced hardware, these limitations are expected to be solved naturally.