A multi-stage pipeline that integrates DeepSeek Reasoner and Anthropic Claude for enhanced chain-of-thought data generation, built on Qwen2.5 language models. Inspired by the DeepSeek-R1 paper, it showcases:
- Hybrid CoT Generation using DeepSeek + Anthropic expansions
- Cold-Start SFT using enhanced CoT data
- Reasoning-Oriented RL (GRPO) for improved correctness
- Rejection Sampling to gather top responses
- Additional SFT on filtered data
- Final RL for broad scenarios
- Optional Distillation to smaller Qwen2.5 checkpoints
Super Chain of Thought (Super CoT) is an enhanced reasoning framework that combines DeepSeek's chain-of-thought capabilities with selective Anthropic expansions and reinforcement learning. Unlike traditional CoT which simply shows reasoning steps, Super CoT:
- Structured Reasoning: Uses DeepSeek's tags with Anthropic expansions for uncertain steps
- Iterative Refinement: Applies RL to improve reasoning quality over multiple stages
- Quality Control: Implements rejection sampling to filter and keep only the best reasoning paths
- Knowledge Distillation: Transfers learned reasoning patterns to smaller, more efficient models
The integration of Anthropic Claude adds a crucial layer of reasoning enhancement to our pipeline:
- Automatic Detection: Scans reasoning steps for uncertainty markers like:
- "maybe", "not sure", "guess", "uncertain", "unsure"
- Length-based heuristics for complex steps
- Domain-specific uncertainty signals
- Selective Expansion: Only expands steps that need clarification
- Preservation of Clear Reasoning: Leaves well-reasoned steps untouched
-
Input Processing:
# Original DeepSeek step <think>This might be related to quantum tunneling, but I'm not sure...</think> # Anthropic expansion request "Please provide a factual grounding of why this step might be correct..." # Final expanded format <think>Original step <explanation>Anthropic's detailed grounding</explanation> </think>
-
Integration Points:
- During initial CoT collection
- In rejection sampling phase
- During final model distillation
-
API Integration:
# Setup anthropic_client = anthropic.Client(api_key=os.environ["ANTHROPIC_API_KEY"]) # Expansion call expansion = anthropic_client.completions.create( model="claude-3-5-sonnet-20241022", max_tokens=512, prompt=f"Explain why this step is valid: {uncertain_step}" )
-
Error Handling:
- Graceful fallbacks to original reasoning
- Rate limiting protection
- Context length management
- Expansion validation
-
Best Practices:
- Keep expansions concise (≤512 tokens)
- Focus on factual grounding
- Maintain reasoning flow
- Preserve original insights
Based on the DeepSeek-R1 paper:
-
Math & Reasoning:
- 79.8% Pass@1 on AIME 2024 (surpassing OpenAI-o1-1217)
- 97.3% on MATH-500 (on par with OpenAI-o1-1217)
- Strong performance on MMLU, MMLU-Pro, and GPQA Diamond
-
Coding:
- 2,029 Elo rating on Codeforces (96.3 percentile)
- Strong performance on LiveCodeBench
- Competitive results on software engineering tasks
-
Distilled Models:
- DeepSeek-R1-Distill-Qwen-7B: 55.5% on AIME 2024
- DeepSeek-R1-Distill-Qwen-32B: 72.6% on AIME 2024, 94.3% on MATH-500
-
Hybrid CoT Generation:
- DeepSeek Reasoner for base chain-of-thought
- Anthropic Claude for expanding "uncertain" steps
- Clean conversation history management
- Automatic uncertainty detection
-
Enhanced GRPO Implementation:
- Group-based advantage computation
- KL-constrained optimization
- Reference model comparison
- Stable policy updates
-
Prompting Best Practices:
- Zero-shot prompting recommended
- Direct problem description preferred
- Avoid few-shot examples (can degrade performance)
- Clear output format specification
-
Known Limitations:
- Language Mixing: Optimized for Chinese/English
- Prompt Sensitivity: Performance varies with prompt structure
- Software Engineering: Limited RL application due to evaluation time
- Function Calling: May need additional fine-tuning for specific formats
This repository provides a Qwen2.5-based implementation inspired by the DeepSeek-R1 paper, enhanced with Anthropic expansions. We focus on making the core ideas accessible through a single, well-documented Python script. Here's how we implement the key concepts:
- Uses Qwen2.5-7B as the foundation model
- Implements GRPO (Group Relative Policy Optimization)
- Integrates Anthropic Claude for uncertain step expansions
- Efficient training without separate critic models
The pipeline begins by gathering high-quality chain-of-thought data from DeepSeek Reasoner and selectively expanding uncertain steps with Anthropic Claude:
-
Response Format:
Question: {prompt} <reasoning_process> <think>Step-by-step logical deduction</think> <explanation>Anthropic expansion for uncertain steps</explanation> </reasoning_process> <summary> Final concise answer </summary>
-
API Integration:
- DeepSeek Reasoner for base CoT
- Anthropic Claude for uncertain step expansion
- Clean conversation history
- Automatic uncertainty detection
-
Error Handling:
- API failures trigger fallbacks
- Rate limiting protection
- Response validation
- Expansion integration checks
Initial supervised fine-tuning on enhanced CoT data:
-
Data Processing:
- Tokenization with proper padding
- Sequence length management
- Batch collation
- Expansion preservation
-
Training Loop:
- Linear learning rate warmup
- Gradient clipping
- Progress tracking
- Validation of reasoning structure
Group-based Reward Policy Optimization (GRPO):
-
Policy Architecture:
- Language model as base policy
- Token-level probability computation
- Group advantage estimation
- KL divergence constraints
-
Reward Structure:
- +1.0 for correct answers
- +0.2 for proper reasoning format
- Bonus for utilizing expansions
- Normalized advantages within groups
Quality-focused data augmentation:
-
Sampling Strategy:
- Multiple candidates per question
- Temperature-controlled generation
- Reward-based filtering
- Expansion preservation check
-
Additional Training:
- Fine-tuning on best samples
- Shorter training cycle
- Preservation of reasoning structure
- Integration of expansions
Comprehensive reinforcement learning:
-
Policy Updates:
- KL-constrained optimization
- Reference model comparison
- Stable policy improvement
- Expansion-aware updates
-
Monitoring:
- Reward tracking
- Loss curves
- Policy divergence checks
- Expansion utilization metrics
Knowledge transfer to smaller models:
-
Student Selection:
- Smaller Qwen2.5 variants
- Architecture preservation
- Memory optimization
- Expansion handling capability
-
Training Process:
- Teacher prediction generation
- Student mimicry learning
- Checkpoint management
- CoT structure preservation
-
Memory Efficiency:
- Gradient checkpointing by default
- Automatic mixed precision (AMP)
- Dynamic batch sizing
- Efficient attention patterns
-
Training Stability:
- Group advantage normalization
- KL divergence constraints
- Reference model comparisons
- Progress monitoring
-
Current Status: Our implementation demonstrates:
- Core DeepSeek-R1 concepts
- Anthropic expansion integration
- Starting point for experimentation
- Learning from paper methodology
Enhanced knowledge distillation approach:
- Teacher: Trained Qwen2.5-7B model
- Student: Smaller Qwen2.5 variants (1.5B to 32B)
- Training: Supervised learning with CoT preservation
- Focus: Maintaining reasoning capabilities
-
Python 3.8+
-
GPU (recommended for RL):
- Minimum: Single GPU with 24GB VRAM
- Recommended: 40GB+ VRAM (A40, A100)
- CPU: 32+ cores
- RAM: 64GB+
-
API Keys:
# DeepSeek API for CoT generation export DEEPSEEK_API_KEY="your-key-here" # Anthropic API for expansions export ANTHROPIC_API_KEY="your-key-here"
-
Dependencies:
pip install -r requirements.txt
-
Setup:
python -m venv venv source venv/bin/activate # or `venv\Scripts\activate` on Windows pip install -r requirements.txt
-
API Setup:
export DEEPSEEK_API_KEY="your-key-here" export ANTHROPIC_API_KEY="your-key-here"
-
Run Pipeline:
python deepseek_qwen2_5_integration_r1.py
Modify prompts in main()
:
prompts = [
"Explain quantum entanglement",
"Solve the traveling salesman problem",
"Derive the quadratic formula"
]
-
DeepSeek Usage:
# Extract both reasoning and final answer reasoning_cot = choice.reasoning_content # Contains <think> tags final_text = choice.content # Final answer only
-
Anthropic Integration:
# Expand uncertain steps if is_uncertain_step(reasoning_text): expansion = call_anthropic_expansion( client, model="claude-3-5-sonnet-20241022", raw_thought=reasoning_text )
Key parameters to adjust:
# SFT parameters
supervised_fine_tune(
epochs=5, # More epochs for better convergence
batch_size=4, # Increase for faster training
lr=5e-6, # Lower learning rate for stability
warmup_ratio=0.1 # Longer warmup for complex tasks
)
# RL parameters
rl_training_grpo(
num_rl_steps=100, # More steps for better policy
group_size=8, # Larger groups for stable updates
lr=1e-6, # Conservative learning rate
clip_ratio=0.15 # Tighter clipping for safety
)
This repository provides a complete implementation of the DeepSeek-R1 paper in a single Python script, making it easy to understand and modify. Key features:
-
All-in-One Design:
- Complete pipeline in
deepseek_qwen2_5_integration_r1.py
- No complex dependencies or distributed setup required
- Easy to modify and experiment with
- Complete pipeline in
-
Hardware Requirements:
- Minimum: Single GPU with 24GB VRAM (e.g., RTX 3090)
- Recommended: 40GB+ VRAM (e.g., A40, A100)
- CPU: 32+ cores recommended
- RAM: 64GB+ recommended
-
Training Time Estimates:
- Cold-Start SFT: ~2-4 hours
- Initial RL: ~8-12 hours
- Rejection Sampling: ~2-3 hours
- Additional SFT: ~4-6 hours
- Final RL: ~12-24 hours
- Optional Distillation: ~6-8 hours per model size
-
Memory Optimization:
- Gradient checkpointing enabled by default
- Automatic mixed precision (AMP) training
- Efficient attention implementation
- Dynamic batch sizing based on available VRAM
-
Customization Points:
- Reward functions in
compute_reward()
- Model architectures in policy classes
- Training hyperparameters in each stage
- Data collection and filtering strategies
- Reward functions in
Resource Note: For users with limited GPU resources, the script includes flags to run smaller experiments or skip certain stages. The minimal version can run on a 16GB GPU but with reduced performance.
- Overview
- Features
- Requirements
- Project Structure
- Usage
- Pipeline Stages
- Key Code Snippets
- Advanced Topics
- Citing & Acknowledgments
- License
R1 + Super CoT: ChainForge follows the methodology of DeepSeek-R1 to enhance a Qwen2.5 model's reasoning abilities via reinforcement learning (RL). We:
- Retrieve high-quality chain-of-thought (CoT) from DeepSeek Reasoner's
reasoning_content
- Use it for a "cold-start" supervised fine-tuning (SFT)
- Conduct Reasoning-Oriented RL to boost correctness and clarity
- Utilize rejection sampling to pick the best RL outputs
- Perform additional SFT on these curated samples
- Optionally distill the final large model into a smaller Qwen2.5 checkpoint
Note: This is a reference pipeline. For production usage, expand datasets, scale RL steps, and incorporate advanced reward modeling.
- DeepSeek Reasoner Integration:
- Automates CoT collection via
reasoning_content
- Properly handles tags in chain-of-thought
- Maintains clean conversation history without reasoning feedback
- Automates CoT collection via
- Qwen2.5-7B Base Model: Hugging Face model with RoPE and large context support
- Group-based RL: A GRPO-like approach for stable reinforcement training
- Rejection Sampling: Extracts best RL completions for further SFT
- Distillation: Compress final RL knowledge into smaller Qwen2.5 variants
.
├── deepseek_qwen2_5_integration_r1.py # Main pipeline implementation
├── requirements.txt # Python dependencies
└── README.md # Documentation
-
DeepSeek Integration (
gather_cot_data_from_deepseek
):- Automated CoT collection using
reasoning_content
- Proper handling of tags
- Clean conversation history management
- Error handling and fallbacks
- Automated CoT collection using
-
Dataset Classes:
ChainOfThoughtDataset
: For initial SFTMockRLReasoningDataset
: For RL trainingAdditionalSFTDataset
: For post-RL fine-tuning
-
RL Components:
GRPOTorchPolicy
: Policy wrappercompute_reward
: Reward functionsample_responses
: Response generation
-
Setup Environment:
# Create virtual environment python -m venv venv source venv/bin/activate # or `venv\Scripts\activate` on Windows # Install dependencies pip install -r requirements.txt
-
Set API Key:
export DEEPSEEK_API_KEY="your-key-here" export ANTHROPIC_API_KEY="your-key-here"
-
Run Pipeline:
python deepseek_qwen2_5_integration_r1.py
Modify deepseek_prompts
in main()
:
deepseek_prompts = [
"Explain quantum entanglement",
"Solve the traveling salesman problem",
"Derive the quadratic formula"
]
Important notes for using DeepSeek Reasoner:
-
Handling
reasoning_content
:# Extract both reasoning and final answer reasoning_cot = choice.reasoning_content # Contains <think> tags final_text = choice.content # Final answer only # Never feed reasoning_content back into conversation messages.append({"role": "assistant", "content": final_text})
-
Supported Parameters:
# Only use these parameters response = openai.ChatCompletion.create( model="deepseek-reasoner", messages=messages, max_tokens=1024 )
-
Conversation History:
- Only append final answers (
content
) - Never include
reasoning_content
in history - Keep track of turns properly
- Only append final answers (
Key parameters to adjust:
# SFT parameters
supervised_fine_tune(
epochs=5, # More epochs for better convergence
batch_size=4, # Increase for faster training
lr=5e-6, # Lower learning rate for stability
warmup_ratio=0.1 # Longer warmup for complex tasks
)
# RL parameters
rl_training_grpo(
num_rl_steps=100, # More steps for better policy
group_size=8, # Larger groups for stable updates
lr=1e-6, # Conservative learning rate
clip_ratio=0.15 # Tighter clipping for safety
)
The pipeline begins by gathering high-quality chain-of-thought data from DeepSeek Reasoner and selectively expanding uncertain steps with Anthropic Claude:
-
Response Format:
Question: {prompt} <reasoning_process> <think>Step-by-step logical deduction</think> <explanation>Anthropic expansion for uncertain steps</explanation> </reasoning_process> <summary> Final concise answer </summary>
-
API Integration:
- DeepSeek Reasoner for base CoT
- Anthropic Claude for uncertain step expansion
- Clean conversation history
- Automatic uncertainty detection
-
Error Handling:
- API failures trigger fallbacks
- Rate limiting protection
- Response validation
- Expansion integration checks
Initial supervised fine-tuning on enhanced CoT data:
-
Data Processing:
- Tokenization with proper padding
- Sequence length management
- Batch collation
- Expansion preservation
-
Training Loop:
- Linear learning rate warmup
- Gradient clipping
- Progress tracking
- Validation of reasoning structure
Group-based Reward Policy Optimization (GRPO):
-
Policy Architecture:
- Language model as base policy
- Token-level probability computation
- Group advantage estimation
- KL divergence constraints
-
Reward Structure:
- +1.0 for correct answers
- +0.2 for proper reasoning format
- Bonus for utilizing expansions
- Normalized advantages within groups
Quality-focused data augmentation:
-
Sampling Strategy:
- Multiple candidates per question
- Temperature-controlled generation
- Reward-based filtering
- Expansion preservation check
-
Additional Training:
- Fine-tuning on best samples
- Shorter training cycle
- Preservation of reasoning structure
- Integration of expansions
Comprehensive reinforcement learning:
-
Policy Updates:
- KL-constrained optimization
- Reference model comparison
- Stable policy improvement
- Expansion-aware updates
-
Monitoring:
- Reward tracking
- Loss curves
- Policy divergence checks
- Expansion utilization metrics
Knowledge transfer to smaller models:
-
Student Selection:
- Smaller Qwen2.5 variants
- Architecture preservation
- Memory optimization
- Expansion handling capability
-
Training Process:
- Teacher prediction generation
- Student mimicry learning
- Checkpoint management
- CoT structure preservation
-
Distributed Training:
# Add to model configuration device_map = "auto" # or specific device mapping
-
Dataset Expansion:
- Collect more DeepSeek CoT samples
- Gather targeted Anthropic expansions
- Implement custom reward models
- Add task-specific datasets
-
Gradient Checkpointing:
model.gradient_checkpointing_enable()
-
Mixed Precision:
from torch.cuda.amp import autocast with autocast(): outputs = model(input_ids)
Implement domain-specific rewards:
def compute_domain_reward(response, ground_truth, has_expansion=False):
reward = base_reward(response, ground_truth)
if has_expansion:
reward *= 1.1 # Bonus for utilizing expansions
reward += domain_specific_score(response)
return reward
If you use this code, please cite:
@misc{deepseek2024r1,
title={DeepSeek-R1: Augmenting Reasoning via Reinforcement Learning},
author={DeepSeek Team},
year={2024},
publisher={arXiv}
}
- Nicolas W Schlaepfer (Initial Implementation)
- DeepSeek Team (Original R1 Methodology)
- Qwen Team (Base Models)
- Anthropic (Claude Integration)
MIT License. See LICENSE for details.