R1 + Super CoT: ChainForge

A multi-stage pipeline that integrates DeepSeek Reasoner and Anthropic Claude for enhanced chain-of-thought data generation, built on Qwen2.5 language models. Inspired by the DeepSeek-R1 paper, it showcases:

Hybrid CoT Generation using DeepSeek + Anthropic expansions
Cold-Start SFT using enhanced CoT data
Reasoning-Oriented RL (GRPO) for improved correctness
Rejection Sampling to gather top responses
Additional SFT on filtered data
Final RL for broad scenarios
Optional Distillation to smaller Qwen2.5 checkpoints

What is Super Chain of Thought?

Super Chain of Thought (Super CoT) is an enhanced reasoning framework that combines DeepSeek's chain-of-thought capabilities with selective Anthropic expansions and reinforcement learning. Unlike traditional CoT which simply shows reasoning steps, Super CoT:

Structured Reasoning: Uses DeepSeek's tags with Anthropic expansions for uncertain steps
Iterative Refinement: Applies RL to improve reasoning quality over multiple stages
Quality Control: Implements rejection sampling to filter and keep only the best reasoning paths
Knowledge Distillation: Transfers learned reasoning patterns to smaller, more efficient models

Anthropic Integration Details

The integration of Anthropic Claude adds a crucial layer of reasoning enhancement to our pipeline:

1. Uncertainty Detection

Automatic Detection: Scans reasoning steps for uncertainty markers like:
- "maybe", "not sure", "guess", "uncertain", "unsure"
- Length-based heuristics for complex steps
- Domain-specific uncertainty signals
Selective Expansion: Only expands steps that need clarification
Preservation of Clear Reasoning: Leaves well-reasoned steps untouched

2. Expansion Process

Input Processing:

# Original DeepSeek step
<think>This might be related to quantum tunneling, but I'm not sure...</think>

# Anthropic expansion request
"Please provide a factual grounding of why this step might be correct..."

# Final expanded format
<think>Original step
<explanation>Anthropic's detailed grounding</explanation>
</think>

Integration Points:
- During initial CoT collection
- In rejection sampling phase
- During final model distillation

3. Implementation Details

API Integration:

# Setup
anthropic_client = anthropic.Client(api_key=os.environ["ANTHROPIC_API_KEY"])

# Expansion call
expansion = anthropic_client.completions.create(
    model="claude-3-5-sonnet-20241022",
    max_tokens=512,
    prompt=f"Explain why this step is valid: {uncertain_step}"
)

Error Handling:
- Graceful fallbacks to original reasoning
- Rate limiting protection
- Context length management
- Expansion validation
Best Practices:
- Keep expansions concise (≤512 tokens)
- Focus on factual grounding
- Maintain reasoning flow
- Preserve original insights

Latest Results & Achievements

Based on the DeepSeek-R1 paper:

Math & Reasoning:
- 79.8% Pass@1 on AIME 2024 (surpassing OpenAI-o1-1217)
- 97.3% on MATH-500 (on par with OpenAI-o1-1217)
- Strong performance on MMLU, MMLU-Pro, and GPQA Diamond
Coding:
- 2,029 Elo rating on Codeforces (96.3 percentile)
- Strong performance on LiveCodeBench
- Competitive results on software engineering tasks
Distilled Models:
- DeepSeek-R1-Distill-Qwen-7B: 55.5% on AIME 2024
- DeepSeek-R1-Distill-Qwen-32B: 72.6% on AIME 2024, 94.3% on MATH-500

Key Features & Updates

Hybrid CoT Generation:
- DeepSeek Reasoner for base chain-of-thought
- Anthropic Claude for expanding "uncertain" steps
- Clean conversation history management
- Automatic uncertainty detection
Enhanced GRPO Implementation:
- Group-based advantage computation
- KL-constrained optimization
- Reference model comparison
- Stable policy updates
Prompting Best Practices:
- Zero-shot prompting recommended
- Direct problem description preferred
- Avoid few-shot examples (can degrade performance)
- Clear output format specification
Known Limitations:
- Language Mixing: Optimized for Chinese/English
- Prompt Sensitivity: Performance varies with prompt structure
- Software Engineering: Limited RL application due to evaluation time
- Function Calling: May need additional fine-tuning for specific formats

Paper Methodology & Our Implementation

This repository provides a Qwen2.5-based implementation inspired by the DeepSeek-R1 paper, enhanced with Anthropic expansions. We focus on making the core ideas accessible through a single, well-documented Python script. Here's how we implement the key concepts:

1. Base Architecture

Uses Qwen2.5-7B as the foundation model
Implements GRPO (Group Relative Policy Optimization)
Integrates Anthropic Claude for uncertain step expansions
Efficient training without separate critic models

2. Training Pipeline Stages

Stage 0: Hybrid CoT Generation

The pipeline begins by gathering high-quality chain-of-thought data from DeepSeek Reasoner and selectively expanding uncertain steps with Anthropic Claude:

Response Format:

Question: {prompt}
<reasoning_process>
  <think>Step-by-step logical deduction</think>
  <explanation>Anthropic expansion for uncertain steps</explanation>
</reasoning_process>
<summary>
  Final concise answer
</summary>

API Integration:
- DeepSeek Reasoner for base CoT
- Anthropic Claude for uncertain step expansion
- Clean conversation history
- Automatic uncertainty detection
Error Handling:
- API failures trigger fallbacks
- Rate limiting protection
- Response validation
- Expansion integration checks

Stage 1: Cold-Start SFT

Initial supervised fine-tuning on enhanced CoT data:

Data Processing:
- Tokenization with proper padding
- Sequence length management
- Batch collation
- Expansion preservation
Training Loop:
- Linear learning rate warmup
- Gradient clipping
- Progress tracking
- Validation of reasoning structure

Stage 2: Reasoning-Oriented RL

Group-based Reward Policy Optimization (GRPO):

Policy Architecture:
- Language model as base policy
- Token-level probability computation
- Group advantage estimation
- KL divergence constraints
Reward Structure:
- +1.0 for correct answers
- +0.2 for proper reasoning format
- Bonus for utilizing expansions
- Normalized advantages within groups

Stage 3: Rejection Sampling & Additional SFT

Quality-focused data augmentation:

Sampling Strategy:
- Multiple candidates per question
- Temperature-controlled generation
- Reward-based filtering
- Expansion preservation check
Additional Training:
- Fine-tuning on best samples
- Shorter training cycle
- Preservation of reasoning structure
- Integration of expansions

Stage 4: Final RL

Comprehensive reinforcement learning:

Policy Updates:
- KL-constrained optimization
- Reference model comparison
- Stable policy improvement
- Expansion-aware updates
Monitoring:
- Reward tracking
- Loss curves
- Policy divergence checks
- Expansion utilization metrics

Stage 5: Optional Distillation

Knowledge transfer to smaller models:

Student Selection:
- Smaller Qwen2.5 variants
- Architecture preservation
- Memory optimization
- Expansion handling capability
Training Process:
- Teacher prediction generation
- Student mimicry learning
- Checkpoint management
- CoT structure preservation

3. Implementation Details

Memory Efficiency:
- Gradient checkpointing by default
- Automatic mixed precision (AMP)
- Dynamic batch sizing
- Efficient attention patterns
Training Stability:
- Group advantage normalization
- KL divergence constraints
- Reference model comparisons
- Progress monitoring
Current Status: Our implementation demonstrates:
- Core DeepSeek-R1 concepts
- Anthropic expansion integration
- Starting point for experimentation
- Learning from paper methodology

4. Distillation Implementation

Enhanced knowledge distillation approach:

Teacher: Trained Qwen2.5-7B model
Student: Smaller Qwen2.5 variants (1.5B to 32B)
Training: Supervised learning with CoT preservation
Focus: Maintaining reasoning capabilities

Requirements

Python 3.8+
GPU (recommended for RL):
- Minimum: Single GPU with 24GB VRAM
- Recommended: 40GB+ VRAM (A40, A100)
- CPU: 32+ cores
- RAM: 64GB+

API Keys:

# DeepSeek API for CoT generation
export DEEPSEEK_API_KEY="your-key-here"

# Anthropic API for expansions
export ANTHROPIC_API_KEY="your-key-here"

Dependencies:
```
pip install -r requirements.txt
```

Usage

Basic Usage

Setup:

python -m venv venv
source venv/bin/activate  # or `venv\Scripts\activate` on Windows
pip install -r requirements.txt

API Setup:

export DEEPSEEK_API_KEY="your-key-here"
export ANTHROPIC_API_KEY="your-key-here"

Run Pipeline:

python deepseek_qwen2_5_integration_r1.py

Advanced Usage

Custom Prompts

Modify prompts in main():

prompts = [
    "Explain quantum entanglement",
    "Solve the traveling salesman problem",
    "Derive the quadratic formula"
]

API Integration

DeepSeek Usage:

# Extract both reasoning and final answer
reasoning_cot = choice.reasoning_content  # Contains <think> tags
final_text = choice.content  # Final answer only

Anthropic Integration:

# Expand uncertain steps
if is_uncertain_step(reasoning_text):
    expansion = call_anthropic_expansion(
        client,
        model="claude-3-5-sonnet-20241022",
        raw_thought=reasoning_text
    )

Training Configuration

Key parameters to adjust:

# SFT parameters
supervised_fine_tune(
    epochs=5,          # More epochs for better convergence
    batch_size=4,      # Increase for faster training
    lr=5e-6,          # Lower learning rate for stability
    warmup_ratio=0.1   # Longer warmup for complex tasks
)

# RL parameters
rl_training_grpo(
    num_rl_steps=100,  # More steps for better policy
    group_size=8,      # Larger groups for stable updates
    lr=1e-6,          # Conservative learning rate
    clip_ratio=0.15    # Tighter clipping for safety
)

Single-Script Implementation

This repository provides a complete implementation of the DeepSeek-R1 paper in a single Python script, making it easy to understand and modify. Key features:

All-in-One Design:
- Complete pipeline in deepseek_qwen2_5_integration_r1.py
- No complex dependencies or distributed setup required
- Easy to modify and experiment with
Hardware Requirements:
- Minimum: Single GPU with 24GB VRAM (e.g., RTX 3090)
- Recommended: 40GB+ VRAM (e.g., A40, A100)
- CPU: 32+ cores recommended
- RAM: 64GB+ recommended
Training Time Estimates:
- Cold-Start SFT: ~2-4 hours
- Initial RL: ~8-12 hours
- Rejection Sampling: ~2-3 hours
- Additional SFT: ~4-6 hours
- Final RL: ~12-24 hours
- Optional Distillation: ~6-8 hours per model size
Memory Optimization:
- Gradient checkpointing enabled by default
- Automatic mixed precision (AMP) training
- Efficient attention implementation
- Dynamic batch sizing based on available VRAM
Customization Points:
- Reward functions in compute_reward()
- Model architectures in policy classes
- Training hyperparameters in each stage
- Data collection and filtering strategies

Resource Note: For users with limited GPU resources, the script includes flags to run smaller experiments or skip certain stages. The minimal version can run on a 16GB GPU but with reduced performance.

Overview

R1 + Super CoT: ChainForge follows the methodology of DeepSeek-R1 to enhance a Qwen2.5 model's reasoning abilities via reinforcement learning (RL). We:

Retrieve high-quality chain-of-thought (CoT) from DeepSeek Reasoner's reasoning_content
Use it for a "cold-start" supervised fine-tuning (SFT)
Conduct Reasoning-Oriented RL to boost correctness and clarity
Utilize rejection sampling to pick the best RL outputs
Perform additional SFT on these curated samples
Optionally distill the final large model into a smaller Qwen2.5 checkpoint

Note: This is a reference pipeline. For production usage, expand datasets, scale RL steps, and incorporate advanced reward modeling.

Features

DeepSeek Reasoner Integration:
- Automates CoT collection via reasoning_content
- Properly handles tags in chain-of-thought
- Maintains clean conversation history without reasoning feedback
Qwen2.5-7B Base Model: Hugging Face model with RoPE and large context support
Group-based RL: A GRPO-like approach for stable reinforcement training
Rejection Sampling: Extracts best RL completions for further SFT
Distillation: Compress final RL knowledge into smaller Qwen2.5 variants

Project Structure

.
├── deepseek_qwen2_5_integration_r1.py  # Main pipeline implementation
├── requirements.txt                     # Python dependencies
└── README.md                           # Documentation

Key Components

DeepSeek Integration (gather_cot_data_from_deepseek):
- Automated CoT collection using reasoning_content
- Proper handling of tags
- Clean conversation history management
- Error handling and fallbacks
Dataset Classes:
- ChainOfThoughtDataset: For initial SFT
- MockRLReasoningDataset: For RL training
- AdditionalSFTDataset: For post-RL fine-tuning
RL Components:
- GRPOTorchPolicy: Policy wrapper
- compute_reward: Reward function
- sample_responses: Response generation

Usage

Basic Usage

Setup Environment:

# Create virtual environment
python -m venv venv
source venv/bin/activate  # or `venv\Scripts\activate` on Windows

# Install dependencies
pip install -r requirements.txt

Set API Key:

export DEEPSEEK_API_KEY="your-key-here"
export ANTHROPIC_API_KEY="your-key-here"

Run Pipeline:

python deepseek_qwen2_5_integration_r1.py

Advanced Usage

Custom Prompts

Modify deepseek_prompts in main():

deepseek_prompts = [
    "Explain quantum entanglement",
    "Solve the traveling salesman problem",
    "Derive the quadratic formula"
]

DeepSeek API Usage

Important notes for using DeepSeek Reasoner:

Handling reasoning_content:

# Extract both reasoning and final answer
reasoning_cot = choice.reasoning_content  # Contains <think> tags
final_text = choice.content  # Final answer only

# Never feed reasoning_content back into conversation
messages.append({"role": "assistant", "content": final_text})

Supported Parameters:

# Only use these parameters
response = openai.ChatCompletion.create(
    model="deepseek-reasoner",
    messages=messages,
    max_tokens=1024
)

Conversation History:
- Only append final answers (content)
- Never include reasoning_content in history
- Keep track of turns properly

Hyperparameter Tuning

Key parameters to adjust:

# SFT parameters
supervised_fine_tune(
    epochs=5,          # More epochs for better convergence
    batch_size=4,      # Increase for faster training
    lr=5e-6,          # Lower learning rate for stability
    warmup_ratio=0.1   # Longer warmup for complex tasks
)

# RL parameters
rl_training_grpo(
    num_rl_steps=100,  # More steps for better policy
    group_size=8,      # Larger groups for stable updates
    lr=1e-6,          # Conservative learning rate
    clip_ratio=0.15    # Tighter clipping for safety
)

Pipeline Stages

Stage 0: Hybrid CoT Generation

The pipeline begins by gathering high-quality chain-of-thought data from DeepSeek Reasoner and selectively expanding uncertain steps with Anthropic Claude:

Response Format:

Question: {prompt}
<reasoning_process>
  <think>Step-by-step logical deduction</think>
  <explanation>Anthropic expansion for uncertain steps</explanation>
</reasoning_process>
<summary>
  Final concise answer
</summary>

API Integration:
- DeepSeek Reasoner for base CoT
- Anthropic Claude for uncertain step expansion
- Clean conversation history
- Automatic uncertainty detection
Error Handling:
- API failures trigger fallbacks
- Rate limiting protection
- Response validation
- Expansion integration checks

Stage 1: Cold-Start SFT

Initial supervised fine-tuning on enhanced CoT data:

Data Processing:
- Tokenization with proper padding
- Sequence length management
- Batch collation
- Expansion preservation
Training Loop:
- Linear learning rate warmup
- Gradient clipping
- Progress tracking
- Validation of reasoning structure

Stage 2: Reasoning-Oriented RL

Group-based Reward Policy Optimization (GRPO):

Policy Architecture:
- Language model as base policy
- Token-level probability computation
- Group advantage estimation
- KL divergence constraints
Reward Structure:
- +1.0 for correct answers
- +0.2 for proper reasoning format
- Bonus for utilizing expansions
- Normalized advantages within groups

Stage 3: Rejection Sampling & Additional SFT

Quality-focused data augmentation:

Sampling Strategy:
- Multiple candidates per question
- Temperature-controlled generation
- Reward-based filtering
- Expansion preservation check
Additional Training:
- Fine-tuning on best samples
- Shorter training cycle
- Preservation of reasoning structure
- Integration of expansions

Stage 4: Final RL

Comprehensive reinforcement learning:

Policy Updates:
- KL-constrained optimization
- Reference model comparison
- Stable policy improvement
- Expansion-aware updates
Monitoring:
- Reward tracking
- Loss curves
- Policy divergence checks
- Expansion utilization metrics

Stage 5: Optional Distillation

Knowledge transfer to smaller models:

Student Selection:
- Smaller Qwen2.5 variants
- Architecture preservation
- Memory optimization
- Expansion handling capability
Training Process:
- Teacher prediction generation
- Student mimicry learning
- Checkpoint management
- CoT structure preservation

Advanced Topics

Scaling Up

Distributed Training:

# Add to model configuration
device_map = "auto"  # or specific device mapping

Dataset Expansion:
- Collect more DeepSeek CoT samples
- Gather targeted Anthropic expansions
- Implement custom reward models
- Add task-specific datasets

Memory Management

Gradient Checkpointing:
```
model.gradient_checkpointing_enable()
```

Mixed Precision:

from torch.cuda.amp import autocast

with autocast():
    outputs = model(input_ids)

Custom Rewards

Implement domain-specific rewards:

def compute_domain_reward(response, ground_truth, has_expansion=False):
    reward = base_reward(response, ground_truth)
    if has_expansion:
        reward *= 1.1  # Bonus for utilizing expansions
    reward += domain_specific_score(response)
    return reward

Citing & Acknowledgments

If you use this code, please cite:

@misc{deepseek2024r1,
  title={DeepSeek-R1: Augmenting Reasoning via Reinforcement Learning},
  author={DeepSeek Team},
  year={2024},
  publisher={arXiv}
}

Contributors

Nicolas W Schlaepfer (Initial Implementation)
DeepSeek Team (Original R1 Methodology)
Qwen Team (Base Models)
Anthropic (Claude Integration)

License

MIT License. See LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.gitignore		.gitignore
2501.12948v1.pdf		2501.12948v1.pdf
LICENSE		LICENSE
README.md		README.md
deepseek_qwen2_5_integration_r1.py		deepseek_qwen2_5_integration_r1.py
deepseekpapertext.md		deepseekpapertext.md
requirements.txt		requirements.txt

License

nschlaepfer/ChainForge-R1-SuperCoT

Folders and files

Latest commit

History

Repository files navigation

R1 + Super CoT: ChainForge

What is Super Chain of Thought?

Anthropic Integration Details

1. Uncertainty Detection

2. Expansion Process

3. Implementation Details

Latest Results & Achievements

Key Features & Updates

Paper Methodology & Our Implementation

1. Base Architecture

2. Training Pipeline Stages

Stage 0: Hybrid CoT Generation

Stage 1: Cold-Start SFT

Stage 2: Reasoning-Oriented RL

Stage 3: Rejection Sampling & Additional SFT

Stage 4: Final RL

Stage 5: Optional Distillation

3. Implementation Details

4. Distillation Implementation

Requirements

Usage

Basic Usage

Advanced Usage

Custom Prompts

API Integration

Training Configuration

Single-Script Implementation

Table of Contents

Overview

Features

Project Structure

Key Components

Usage

Basic Usage

Advanced Usage

Custom Prompts

DeepSeek API Usage

Hyperparameter Tuning

Pipeline Stages

Stage 0: Hybrid CoT Generation

Stage 1: Cold-Start SFT

Stage 2: Reasoning-Oriented RL

Stage 3: Rejection Sampling & Additional SFT

Stage 4: Final RL

Stage 5: Optional Distillation

Advanced Topics

Scaling Up

Memory Management

Custom Rewards

Citing & Acknowledgments

Contributors

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages