Name: DeepSeek-V3
Author: DeepSeek

DeepSeek's groundbreaking 671B parameter Mixture-of-Experts model with 37B activated per token. Released December 2024 with MIT license enabling unrestricted commercial use. Outperforms open-source models and achieves performance comparable to leading closed-source models (GPT-4, Claude) on most benchmarks. Excels particularly on math and code tasks. Uses Multi-head Latent Attention (MLA) and DeepSeekMoE architectures for efficient inference. Pretrained on 14.8T diverse tokens with only 2.788M H800 GPU hours - breakthrough training efficiency. Pioneers auxiliary-loss-free load balancing and multi-token prediction objectives.

DeepSeek-V3

Strengths

Caveats

Capabilities

Resources

Reviews

Comments