Glossary/DeepSeek (Architecture & Significance)
Flagship AI Models

DeepSeek (Architecture & Significance)

The Chinese AI lab that matched GPT-4 for 6% of the training cost.


Definition

DeepSeek is a Chinese AI research lab (founded 2023, backed by High-Flyer Capital) that released a series of models in 2024–2025 that shocked the AI industry. DeepSeek V3 (December 2024) matched or exceeded GPT-4o on most benchmarks while reportedly costing only $5.6 million to train — compared to estimates of $100M+ for GPT-4. DeepSeek R1 (January 2025) matched o1 on math and coding reasoning at open-source weights. This triggered a global stock market selloff and forced every major AI lab to revisit their cost assumptions.

Multi-Head Latent Attention (MLA): DeepSeek's architectural innovation

Standard transformer attention stores one Key-Value (KV) cache entry per attention head per token — this grows linearly with context length and becomes the memory bottleneck for long-context inference. DeepSeek V3 uses Multi-Head Latent Attention (MLA), which compresses the KV cache into a low-dimensional latent vector, then reconstructs the per-head keys and values via learned up-projection matrices. This reduces KV cache memory by ~13.5x.

MLA: compress hidden state h_t to latent c_t^KV (low-rank), then reconstruct keys K and values V via learned up-projections W^UK and W^UV. Cache only c_t^KV — not the full K and V.

ModelAttention typeKV cache per tokenContext memory
GPT-4Multi-head (MHA)Full per headHigh
LLaMA 3Grouped-query (GQA)Shared across groupsMedium
DeepSeek V3Multi-head Latent (MLA)Compressed latentVery low

Why $5.6M training cost matters — and the catch

The $5.6M figure refers to the final pre-training run on 2,048 H800 GPUs over 2 months. It excludes: research and experimentation costs, the cost of training previous model versions (V1, V2, Coder), infrastructure, salaries, and Nvidia chip purchases. Still, DeepSeek proved that algorithmic efficiency improvements (MLA, MoE routing, FP8 training) can dramatically compress training costs — a finding that has significant implications for AI lab economics globally.

FP8 training

DeepSeek V3 used FP8 (8-bit floating point) mixed-precision training throughout — a technique most labs avoided due to numerical instability. DeepSeek developed custom stability techniques that made FP8 training viable at scale, roughly halving training compute vs BF16.

Try LumiChats for ₹69

39+ AI models. Study Mode with page-locked answers. Agent Mode with code execution. Pay only on days you use it.

Get Started — ₹69/day

Related Terms

4 terms