DeepSeek R1 Architecture: How Reinforcement Learning Creates Reasoning

DeepSeek R1 demonstrated something remarkable: you can teach a language model to reason through pure reinforcement learning, without any human-labeled reasoning examples. The model discovers chain-of-thought reasoning on its own, developing behaviors like self-verification and reflection that weren’t explicitly programmed.

This tutorial breaks down how DeepSeek R1 works—from its Mixture of Experts foundation to the GRPO algorithm that makes training efficient, to the distillation process that transfers reasoning to smaller models.

The Key Innovation: Reasoning from RL Alone

Most reasoning models (like OpenAI’s o1) use supervised fine-tuning on human-labeled chain-of-thought examples. DeepSeek took a different approach with R1-Zero: apply reinforcement learning directly to the base model with no supervised fine-tuning.

The result? The model spontaneously developed reasoning capabilities:

Query: "What is 15% of 80? Then add 25% tax."

<think>
Let me work through this step by step.
First, I need to find 15% of 80.
15% = 0.15
0.15 × 80 = 12

Now I need to add 25% tax to 12.
25% of 12 = 0.25 × 12 = 3
12 + 3 = 15

Wait, let me verify this is correct...
15% of 80 = 12 ✓
25% of 12 = 3 ✓
12 + 3 = 15 ✓
</think>

The answer is 15.

Notice the self-verification at the end—the model learned to check its own work. This wasn’t programmed; it emerged from RL training.

Base Architecture: DeepSeek-V3

DeepSeek R1 is built on the DeepSeek-V3 base model, which introduces two key architectural innovations:

Mixture of Experts (MoE)

Specification	Value
Total Parameters	671 billion
Activated Parameters	37 billion per token
Expert Configuration	1 shared + 8 routed experts per layer
Context Length	128K tokens

The MoE architecture is why R1 can have 671B total parameters but only activate 37B per forward pass—making inference dramatically more efficient than a dense model of the same size.

Token → Router → Select 8 experts → Compute → Combine outputs
                    ↓
              1 shared expert (always active)
              + 8 routed experts (selected per token)

DeepSeekMoE Innovation: Unlike standard MoE which uses fewer, larger experts, DeepSeek uses many fine-grained experts. This decomposition allows more specialized knowledge storage.

Multi-Head Latent Attention (MLA)

Standard attention caches full key-value tensors for each head, consuming massive memory during inference. MLA compresses K and V into a latent space:

Standard Attention:
  KV Cache Size: O(batch × seq_len × num_heads × head_dim)

Multi-Head Latent Attention:
  1. Compress K, V → Latent vector (much smaller)
  2. Cache compressed vectors
  3. Decompress on-the-fly during inference

  KV Cache Size: Reduced to 5-13% of standard

This is why DeepSeek R1 can handle 128K context on reasonable hardware—the KV cache doesn’t explode.

Training Pipeline: From Base Model to Reasoner

DeepSeek R1’s training follows a multi-stage pipeline:

Stage 1: Cold Start (SFT)

The DeepSeek-V3 base model is fine-tuned on thousands of chain-of-thought examples to establish basic reasoning patterns:

# Conceptual example of cold start data
examples = [
    {
        "input": "Solve: 2x + 5 = 13",
        "output": """<think>
I need to solve for x.
2x + 5 = 13
2x = 13 - 5
2x = 8
x = 4
</think>
x = 4"""
    },
    # ... thousands more examples
]

This seeding gives the model the “format” of reasoning but not the capability itself.

Stage 2: Reasoning-Oriented RL

This is where the magic happens. Large-scale reinforcement learning trains the model to actually reason, using rule-based rewards:

Reward Function Design:

Reward Type	What It Measures	Implementation
Accuracy	Is the answer correct?	Rule-based verification
Format	Does output use `<think>` tags?	Pattern matching

def compute_reward(response, ground_truth):
    reward = 0.0

    # Accuracy reward
    if extract_answer(response) == ground_truth:
        reward += 1.0

    # Format reward
    if "<think>" in response and "</think>" in response:
        reward += 0.2

    return reward

Stage 3: Rejection Sampling + SFT

High-quality reasoning outputs from Stage 2 are collected and used for supervised fine-tuning. This extends reasoning capabilities to broader domains:

Generate many responses per prompt
Keep only those with correct answers and clean formatting
Fine-tune on this curated dataset

Stage 4: Human Preference Alignment

Final RL stage aligns the model with human preferences for helpfulness, safety, and readability.

GRPO: The Training Algorithm

DeepSeek uses Group Relative Policy Optimization (GRPO) instead of standard PPO. This is crucial for making RL training practical at scale.

Why Not PPO?

Standard PPO requires a critic network (value function) that’s typically as large as the policy model:

PPO Training Cost:
  - Policy model: 671B parameters
  - Critic model: ~671B parameters
  - Total: ~1.3 trillion parameters to train

  Memory: 2x the model size
  Compute: 2x forward passes per step

How GRPO Works

GRPO eliminates the critic by using group relative advantages:

def grpo_step(model, prompt, num_samples=64):
    # 1. Sample multiple responses from current policy
    responses = [model.generate(prompt) for _ in range(num_samples)]

    # 2. Compute rewards for each response
    rewards = [compute_reward(r) for r in responses]

    # 3. Normalize rewards within the group (this replaces the critic!)
    mean_reward = np.mean(rewards)
    std_reward = np.std(rewards)
    advantages = [(r - mean_reward) / std_reward for r in rewards]

    # 4. Update policy to favor high-advantage responses
    for response, advantage in zip(responses, advantages):
        if advantage > 0:
            # Increase probability of this response
            update_policy(model, prompt, response, weight=advantage)
        else:
            # Decrease probability
            update_policy(model, prompt, response, weight=advantage)

Key insight: By comparing responses within a group, GRPO estimates which responses are “better than average” without needing a learned value function.

GRPO Efficiency

Metric	PPO	GRPO
Critic Model Required	Yes (as large as policy)	No
Memory Overhead	~2x	~1x
Compute per Step	2 forward passes	1 forward pass

Emergent Reasoning Behaviors

During Stage 2 (reasoning RL), DeepSeek R1 spontaneously developed several reasoning patterns:

1. Self-Verification

The model learns to check its own work:

Output

<think>
Let me calculate the compound interest.
Principal: $1000
Rate: 5% per year
Time: 3 years

A = P(1 + r)^t
A = 1000(1 + 0.05)^3
A = 1000(1.05)^3
A = 1000 × 1.157625
A = 1157.63

Let me verify:
Year 1: 1000 × 1.05 = 1050
Year 2: 1050 × 1.05 = 1102.50
Year 3: 1102.50 × 1.05 = 1157.625 ✓

The calculation is correct.
</think>

The final amount is $1,157.63

2. Reflection and Backtracking

When the model detects an error, it backtracks:

Output

<think>
Hmm, that doesn't seem right. Let me reconsider...

Actually, I made an error in step 3. The formula should be...
</think>

3. Dynamic Strategy Adaptation

The model tries different approaches when one doesn’t work:

Output

<think>
I tried factoring but it didn't work cleanly.
Let me try the quadratic formula instead...
</think>

These behaviors weren’t in the training data—they emerged because RL rewards correct answers, and these strategies help achieve correctness.

Knowledge Distillation

DeepSeek released distilled versions that transfer R1’s reasoning to smaller models:

Model	Base	Parameters	AIME 2024	MATH-500
DeepSeek-R1-Distill-Qwen-1.5B	Qwen 2.5	1.5B	28.9%	83.9%
DeepSeek-R1-Distill-Qwen-7B	Qwen 2.5	7B	55.5%	92.8%
DeepSeek-R1-Distill-Qwen-14B	Qwen 2.5	14B	69.7%	93.9%
DeepSeek-R1-Distill-Qwen-32B	Qwen 2.5	32B	72.6%	94.3%
DeepSeek-R1-Distill-Llama-8B	Llama 3.1	8B	50.4%	89.1%
DeepSeek-R1-Distill-Llama-70B	Llama 3.3	70B	70.0%	94.5%
DeepSeek-R1 (full)	V3	671B	79.8%	97.3%

AIME 2024 = American Invitational Mathematics Examination. MATH-500 = subset of MATH benchmark with 500 problems. Source: DeepSeek R1 paper.

How Distillation Works

Generate training data: Sample 800,000 reasoning examples from DeepSeek R1
Supervised fine-tuning: Train smaller models on this data
No additional RL: Unlike R1 training, distilled models don’t go through RL

# Conceptual distillation process
teacher_model = load_model("DeepSeek-R1")
student_model = load_model("Qwen-2.5-32B")

# Generate reasoning examples from teacher
training_data = []
for prompt in diverse_prompts:
    response = teacher_model.generate(prompt)
    if is_high_quality(response):
        training_data.append((prompt, response))

# Fine-tune student on teacher's outputs
student_model.finetune(training_data)

Distilled Model Performance

The 32B distilled model notably outperforms OpenAI o1-mini on several benchmarks:

Benchmark	R1-Distill-Qwen-32B	o1-mini
AIME 2024	72.6%	63.6%
MATH-500	94.3%	90.0%
LiveCodeBench	57.2%	53.8%

Source: DeepSeek R1 paper. o1-mini scores from OpenAI’s technical announcements (December 2024).

This is remarkable: a 32B open-source model outperforming a proprietary reasoning model.

Implementation Details

Inference Configuration

For optimal results with DeepSeek R1:

# Recommended settings from DeepSeek
config = {
    "temperature": 0.6,      # Range: 0.5-0.7, prevents repetition
    "top_p": 0.95,
    "max_tokens": 32768,     # Allow long reasoning chains
}

# Force reasoning by starting with <think>
prompt = "Your question here"
response = model.generate(
    prompt,
    prefix="<think>\n"  # Forces model to reason
)

Prompting for Math Problems

prompt = """
Solve the following problem. Please reason step by step,
and put your final answer within \\boxed{}.

Problem: Find all real solutions to x^3 - 6x^2 + 11x - 6 = 0
"""

The \boxed{} format helps the model structure its output and makes answer extraction easier.

Architecture Comparison (January 2025)

How did DeepSeek R1 compare to other reasoning models at its release?

Aspect	DeepSeek R1	OpenAI o1	Claude 3.5 Sonnet
Training Approach	RL + SFT	Unknown (likely RL)	SFT only
Reasoning Visible	Yes (`<think>` tags)	No (hidden)	No
Open Source	Yes (MIT)	No	No
Open Weights	Yes	No	No
Base Architecture	MoE (671B/37B)	Unknown	Dense
Local Deployment	Yes (distilled)	No	No

Practical Implications

When to Use DeepSeek R1

Good use cases:

Complex mathematical reasoning
Multi-step problem solving
Code debugging and review
Logical deduction tasks
When you need to understand the model’s reasoning

Consider alternatives for:

Creative writing (Claude excels here)
Simple Q&A (overkill, use faster models)
Tasks requiring current information (no web access)

Cost Efficiency (January 2025)

The MoE architecture made R1 surprisingly efficient at launch:

Model	Total Params	Active Params	Cost (per 1M output tokens)
DeepSeek R1	671B	37B	~$0.55
GPT-4 Turbo	Undisclosed	Dense	~$30
Claude 3.5 Sonnet	Undisclosed	Dense	~$3

R1 achieved competitive reasoning performance at 1/30th to 1/50th the cost of contemporaneous models.

What’s Next

You’ve learned how DeepSeek R1 achieves reasoning through:

Pure RL training (R1-Zero) proving reasoning can emerge without human examples
GRPO algorithm making large-scale RL training practical
MoE + MLA architecture enabling efficient inference
Distillation transferring reasoning to smaller, deployable models

Key takeaways:

Reasoning capabilities can be incentivized through RL alone
Rule-based rewards (no neural reward model) prevent reward hacking
GRPO eliminates the critic network, halving memory and compute requirements
Distillation effectively transfers reasoning to smaller models

Further Reading:

DeepSeek V3.2 Architecture - Latest DeepSeek with DSA and integrated reasoning
DeepSeek R1 Getting Started - Run R1 on Bedrock or locally
DeepSeek R1 Paper - Full technical details
DeepSeek V3 Technical Report - Base architecture details
GRPO Deep Dive - Detailed algorithm explanation

Sources:

The Key Innovation: Reasoning from RL Alone

Base Architecture: DeepSeek-V3

Mixture of Experts (MoE)

Multi-Head Latent Attention (MLA)

Training Pipeline: From Base Model to Reasoner

Stage 1: Cold Start (SFT)

Stage 2: Reasoning-Oriented RL

Stage 3: Rejection Sampling + SFT

Stage 4: Human Preference Alignment

GRPO: The Training Algorithm

Why Not PPO?

How GRPO Works

GRPO Efficiency

Emergent Reasoning Behaviors

1. Self-Verification

2. Reflection and Backtracking

3. Dynamic Strategy Adaptation

Knowledge Distillation

How Distillation Works

Distilled Model Performance

Implementation Details

Inference Configuration

Prompting for Math Problems

Architecture Comparison (January 2025)

Practical Implications

When to Use DeepSeek R1

Cost Efficiency (January 2025)

What’s Next

Comments