A rule-based speculative decoding framework that accelerates code generation by predicting common patterns like indentation, brackets, and code structure without requiring a separate draft model.
RSD (Rule-based Speculative Decoding) is an innovative approach to accelerate large language model inference for code generation tasks. Unlike traditional speculative decoding that requires a separate draft model, RSD uses rule-based heuristics to predict common code patterns, particularly focusing on Python code generation with Llama-3 models.
- Rule-based Draft Generation: Predicts common code patterns without requiring a separate draft model
- Python Code Optimization: Specifically designed for Python code generation with intelligent indentation prediction
- Llama-3 Compatibility: Optimized for Llama-3 tokenization and code generation patterns
- KV Cache Support: Full support for KV cache to accelerate inference
- Visual Token Analysis: Interactive notebook for analyzing token patterns and rule effectiveness
- Performance Monitoring: Real-time performance metrics and acceptance rate tracking
- SpeculativeDecoding Class: Main implementation with both rule-based and traditional speculative decoding
- Rule-based Draft Generator: Analyzes code context to predict indentation and common patterns
- Token Visualization: Interactive tools for analyzing token patterns and rule effectiveness
- Performance Metrics: Comprehensive benchmarking and analysis tools
The framework implements several rule-based predictions:
- Indentation Prediction: After a colon (
:) in Python code, predicts the appropriate indentation level - Space Token Optimization: Pre-encodes common space patterns for faster lookup
- Context Analysis: Analyzes current code structure to determine appropriate next tokens
# Clone the repository
git clone <repository-url>
cd rsd
# Install dependencies
pip install torch transformers human-eval matplotlib ipywidgetsfrom main import SpeculativeDecoding
# Initialize the decoder
decoder = SpeculativeDecoding(
small_model_name="", # Not used in rule-based mode
large_model_name="/path/to/llama-3-8b-instruct",
gamma=4,
device="cuda",
use_rule_based_only=True # Enable rule-based mode
)
# Generate code
result = decoder.generate_text(
prompt="def fibonacci(n):",
max_length=512,
temperature=0,
use_speculative=True
)
print(f"Generated {result['total_tokens']} tokens in {result['elapsed_time']:.2f}s")
print(f"Speed: {result['tokens_per_second']:.2f} tokens/s")python main.py --use_speculativeThis will run the framework on HumanEval dataset examples and provide performance comparisons.
Use the included Jupyter notebook (find_token.ipynb) to analyze token patterns:
from find_token import colorize_tokens
# Visualize token patterns
colorize_tokens(tokenizer, generated_tokens, max_colors=20)This provides interactive visualization of:
- Token boundaries and patterns
- Space and indentation tokens
- Code structure analysis
- Rule effectiveness evaluation
The framework tracks several key metrics:
- Tokens per Second: Generation speed
- Latency: End-to-end generation time
The framework implements sophisticated indentation prediction:
- Context Analysis: Analyzes the current code structure and indentation level
- Colon Detection: Identifies when a colon (
:) indicates the need for indentation - Indentation Calculation: Computes the appropriate indentation level based on context
- Token Generation: Generates the correct number of space tokens
This framework is particularly useful for:
- Code Generation Research: Analyzing token patterns in code generation
- Speculative Decoding Studies: Comparing rule-based vs. model-based approaches
- Performance Optimization: Identifying bottlenecks in code generation
- Tokenization Analysis: Understanding how different models tokenize code
We evaluated the framework on 10 HumanEval examples using the current indentation-only rule implementation:
- Standard Decoding: 37.6652 tokens/s
- Rule-based Speculation: 38.6796 tokens/s
The framework provides real-time visualization of rule-based predictions during generation:
- Green Text: Since indentation itself cannot be directly highlighted, tokens shown in green indicate that the indentation prediction was accepted by the model. The green token is the first token following the successfully predicted indentation.
- Normal Text: Standard generation without rule-based prediction
We plan to extend the framework with more sophisticated rules including bracket prediction, common code patterns, and multi-language support. The goal is to achieve 1.2-1.5x speedup with comprehensive rule implementations.
We welcome contributions! You can help by:
- Implementing new rules for code patterns
- Adding support for more programming languages
- Improving performance and evaluation tools
- Contributing to documentation and examples
Feel free to fork the repository and submit pull requests. Together we can build a comprehensive rule-based speculative decoding framework!
