DeepSeek R1 Architecture: How Reinforcement Learning Creates Reasoning
Deep dive into DeepSeek R1's architecture: how pure RL training enables chain-of-thought reasoning, the GRPO algorithm, MoE design, and knowledge distillation.
DeepSeek R1 demonstrated something remarkable: you can teach a language model to reason through pure reinforcement learning, without any human-labeled reasoning examples. The model discovers chain-of-thought reasoning on its own, developing behaviors like self-verification and reflection that weren’t explicitly programmed.
This tutorial breaks down how DeepSeek R1 works—from its Mixture of Experts foundation to the GRPO algorithm that makes training efficient, to the distillation process that transfers reasoning to smaller models.
The Key Innovation: Reasoning from RL Alone
Most reasoning models (like OpenAI’s o1) use supervised fine-tuning on human-labeled chain-of-thought examples. DeepSeek took a different approach with R1-Zero: apply reinforcement learning directly to the base model with no supervised fine-tuning.
The result? The model spontaneously developed reasoning capabilities:
Query: "What is 15% of 80? Then add 25% tax."
<think>
Let me work through this step by step.
First, I need to find 15% of 80.
15% = 0.15
0.15 × 80 = 12
Now I need to add 25% tax to 12.
25% of 12 = 0.25 × 12 = 3
12 + 3 = 15
Wait, let me verify this is correct...
15% of 80 = 12 ✓
25% of 12 = 3 ✓
12 + 3 = 15 ✓
</think>
The answer is 15.
Notice the self-verification at the end—the model learned to check its own work. This wasn’t programmed; it emerged from RL training.
Base Architecture: DeepSeek-V3
DeepSeek R1 is built on the DeepSeek-V3 base model, which introduces two key architectural innovations:
Mixture of Experts (MoE)
| Specification | Value |
|---|---|
| Total Parameters | 671 billion |
| Activated Parameters | 37 billion per token |
| Expert Configuration | 1 shared + 8 routed experts per layer |
| Context Length | 128K tokens |
The MoE architecture is why R1 can have 671B total parameters but only activate 37B per forward pass—making inference dramatically more efficient than a dense model of the same size.
Token → Router → Select 8 experts → Compute → Combine outputs
↓
1 shared expert (always active)
+ 8 routed experts (selected per token)
DeepSeekMoE Innovation: Unlike standard MoE which uses fewer, larger experts, DeepSeek uses many fine-grained experts. This decomposition allows more specialized knowledge storage.
Multi-Head Latent Attention (MLA)
Standard attention caches full key-value tensors for each head, consuming massive memory during inference. MLA compresses K and V into a latent space:
Standard Attention:
KV Cache Size: O(batch × seq_len × num_heads × head_dim)
Multi-Head Latent Attention:
1. Compress K, V → Latent vector (much smaller)
2. Cache compressed vectors
3. Decompress on-the-fly during inference
KV Cache Size: Reduced to 5-13% of standard
This is why DeepSeek R1 can handle 128K context on reasonable hardware—the KV cache doesn’t explode.
Training Pipeline: From Base Model to Reasoner
DeepSeek R1’s training follows a multi-stage pipeline:
Stage 1: Cold Start (SFT)
The DeepSeek-V3 base model is fine-tuned on thousands of chain-of-thought examples to establish basic reasoning patterns:
# Conceptual example of cold start data
examples = [
{
"input": "Solve: 2x + 5 = 13",
"output": """<think>
I need to solve for x.
2x + 5 = 13
2x = 13 - 5
2x = 8
x = 4
</think>
x = 4"""
},
# ... thousands more examples
]
This seeding gives the model the “format” of reasoning but not the capability itself.
Stage 2: Reasoning-Oriented RL
This is where the magic happens. Large-scale reinforcement learning trains the model to actually reason, using rule-based rewards:
Reward Function Design:
| Reward Type | What It Measures | Implementation |
|---|---|---|
| Accuracy | Is the answer correct? | Rule-based verification |
| Format | Does output use <think> tags? | Pattern matching |
def compute_reward(response, ground_truth):
reward = 0.0
# Accuracy reward
if extract_answer(response) == ground_truth:
reward += 1.0
# Format reward
if "<think>" in response and "</think>" in response:
reward += 0.2
return reward
Stage 3: Rejection Sampling + SFT
High-quality reasoning outputs from Stage 2 are collected and used for supervised fine-tuning. This extends reasoning capabilities to broader domains:
- Generate many responses per prompt
- Keep only those with correct answers and clean formatting
- Fine-tune on this curated dataset
Stage 4: Human Preference Alignment
Final RL stage aligns the model with human preferences for helpfulness, safety, and readability.
GRPO: The Training Algorithm
DeepSeek uses Group Relative Policy Optimization (GRPO) instead of standard PPO. This is crucial for making RL training practical at scale.
Why Not PPO?
Standard PPO requires a critic network (value function) that’s typically as large as the policy model:
PPO Training Cost:
- Policy model: 671B parameters
- Critic model: ~671B parameters
- Total: ~1.3 trillion parameters to train
Memory: 2x the model size
Compute: 2x forward passes per step
How GRPO Works
GRPO eliminates the critic by using group relative advantages:
def grpo_step(model, prompt, num_samples=64):
# 1. Sample multiple responses from current policy
responses = [model.generate(prompt) for _ in range(num_samples)]
# 2. Compute rewards for each response
rewards = [compute_reward(r) for r in responses]
# 3. Normalize rewards within the group (this replaces the critic!)
mean_reward = np.mean(rewards)
std_reward = np.std(rewards)
advantages = [(r - mean_reward) / std_reward for r in rewards]
# 4. Update policy to favor high-advantage responses
for response, advantage in zip(responses, advantages):
if advantage > 0:
# Increase probability of this response
update_policy(model, prompt, response, weight=advantage)
else:
# Decrease probability
update_policy(model, prompt, response, weight=advantage)
Key insight: By comparing responses within a group, GRPO estimates which responses are “better than average” without needing a learned value function.
GRPO Efficiency
| Metric | PPO | GRPO |
|---|---|---|
| Critic Model Required | Yes (as large as policy) | No |
| Memory Overhead | ~2x | ~1x |
| Compute per Step | 2 forward passes | 1 forward pass |
Emergent Reasoning Behaviors
During Stage 2 (reasoning RL), DeepSeek R1 spontaneously developed several reasoning patterns:
1. Self-Verification
The model learns to check its own work:
<think> Let me calculate the compound interest. Principal: $1000 Rate: 5% per year Time: 3 years A = P(1 + r)^t A = 1000(1 + 0.05)^3 A = 1000(1.05)^3 A = 1000 × 1.157625 A = 1157.63 Let me verify: Year 1: 1000 × 1.05 = 1050 Year 2: 1050 × 1.05 = 1102.50 Year 3: 1102.50 × 1.05 = 1157.625 ✓ The calculation is correct. </think> The final amount is $1,157.63
2. Reflection and Backtracking
When the model detects an error, it backtracks:
<think> Hmm, that doesn't seem right. Let me reconsider... Actually, I made an error in step 3. The formula should be... </think>
3. Dynamic Strategy Adaptation
The model tries different approaches when one doesn’t work:
<think> I tried factoring but it didn't work cleanly. Let me try the quadratic formula instead... </think>
These behaviors weren’t in the training data—they emerged because RL rewards correct answers, and these strategies help achieve correctness.
Knowledge Distillation
DeepSeek released distilled versions that transfer R1’s reasoning to smaller models:
| Model | Base | Parameters | AIME 2024 | MATH-500 |
|---|---|---|---|---|
| DeepSeek-R1-Distill-Qwen-1.5B | Qwen 2.5 | 1.5B | 28.9% | 83.9% |
| DeepSeek-R1-Distill-Qwen-7B | Qwen 2.5 | 7B | 55.5% | 92.8% |
| DeepSeek-R1-Distill-Qwen-14B | Qwen 2.5 | 14B | 69.7% | 93.9% |
| DeepSeek-R1-Distill-Qwen-32B | Qwen 2.5 | 32B | 72.6% | 94.3% |
| DeepSeek-R1-Distill-Llama-8B | Llama 3.1 | 8B | 50.4% | 89.1% |
| DeepSeek-R1-Distill-Llama-70B | Llama 3.3 | 70B | 70.0% | 94.5% |
| DeepSeek-R1 (full) | V3 | 671B | 79.8% | 97.3% |
AIME 2024 = American Invitational Mathematics Examination. MATH-500 = subset of MATH benchmark with 500 problems. Source: DeepSeek R1 paper.
How Distillation Works
- Generate training data: Sample 800,000 reasoning examples from DeepSeek R1
- Supervised fine-tuning: Train smaller models on this data
- No additional RL: Unlike R1 training, distilled models don’t go through RL
# Conceptual distillation process
teacher_model = load_model("DeepSeek-R1")
student_model = load_model("Qwen-2.5-32B")
# Generate reasoning examples from teacher
training_data = []
for prompt in diverse_prompts:
response = teacher_model.generate(prompt)
if is_high_quality(response):
training_data.append((prompt, response))
# Fine-tune student on teacher's outputs
student_model.finetune(training_data)
Distilled Model Performance
The 32B distilled model notably outperforms OpenAI o1-mini on several benchmarks:
| Benchmark | R1-Distill-Qwen-32B | o1-mini |
|---|---|---|
| AIME 2024 | 72.6% | 63.6% |
| MATH-500 | 94.3% | 90.0% |
| LiveCodeBench | 57.2% | 53.8% |
Source: DeepSeek R1 paper. o1-mini scores from OpenAI’s technical announcements (December 2024).
This is remarkable: a 32B open-source model outperforming a proprietary reasoning model.
Implementation Details
Inference Configuration
For optimal results with DeepSeek R1:
# Recommended settings from DeepSeek
config = {
"temperature": 0.6, # Range: 0.5-0.7, prevents repetition
"top_p": 0.95,
"max_tokens": 32768, # Allow long reasoning chains
}
# Force reasoning by starting with <think>
prompt = "Your question here"
response = model.generate(
prompt,
prefix="<think>\n" # Forces model to reason
)
Prompting for Math Problems
prompt = """
Solve the following problem. Please reason step by step,
and put your final answer within \\boxed{}.
Problem: Find all real solutions to x^3 - 6x^2 + 11x - 6 = 0
"""
The \boxed{} format helps the model structure its output and makes answer extraction easier.
Architecture Comparison (January 2025)
How did DeepSeek R1 compare to other reasoning models at its release?
| Aspect | DeepSeek R1 | OpenAI o1 | Claude 3.5 Sonnet |
|---|---|---|---|
| Training Approach | RL + SFT | Unknown (likely RL) | SFT only |
| Reasoning Visible | Yes (<think> tags) | No (hidden) | No |
| Open Source | Yes (MIT) | No | No |
| Open Weights | Yes | No | No |
| Base Architecture | MoE (671B/37B) | Unknown | Dense |
| Local Deployment | Yes (distilled) | No | No |
Practical Implications
When to Use DeepSeek R1
Good use cases:
- Complex mathematical reasoning
- Multi-step problem solving
- Code debugging and review
- Logical deduction tasks
- When you need to understand the model’s reasoning
Consider alternatives for:
- Creative writing (Claude excels here)
- Simple Q&A (overkill, use faster models)
- Tasks requiring current information (no web access)
Cost Efficiency (January 2025)
The MoE architecture made R1 surprisingly efficient at launch:
| Model | Total Params | Active Params | Cost (per 1M output tokens) |
|---|---|---|---|
| DeepSeek R1 | 671B | 37B | ~$0.55 |
| GPT-4 Turbo | Undisclosed | Dense | ~$30 |
| Claude 3.5 Sonnet | Undisclosed | Dense | ~$3 |
R1 achieved competitive reasoning performance at 1/30th to 1/50th the cost of contemporaneous models.
What’s Next
You’ve learned how DeepSeek R1 achieves reasoning through:
- Pure RL training (R1-Zero) proving reasoning can emerge without human examples
- GRPO algorithm making large-scale RL training practical
- MoE + MLA architecture enabling efficient inference
- Distillation transferring reasoning to smaller, deployable models
Key takeaways:
- Reasoning capabilities can be incentivized through RL alone
- Rule-based rewards (no neural reward model) prevent reward hacking
- GRPO eliminates the critic network, halving memory and compute requirements
- Distillation effectively transfers reasoning to smaller models
Further Reading:
- DeepSeek V3.2 Architecture - Latest DeepSeek with DSA and integrated reasoning
- DeepSeek R1 Getting Started - Run R1 on Bedrock or locally
- DeepSeek R1 Paper - Full technical details
- DeepSeek V3 Technical Report - Base architecture details
- GRPO Deep Dive - Detailed algorithm explanation
Sources:
Comments
to join the discussion.