Skip to content

aslonv/Number-Theoretic-Transform---CPU-acceleration-with-AVX2

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Number Theoretic Transform (NTT) - CPU Acceleration with AVX2

License: Apache-2.0 Rust AVX2

A high-performance implementation of the forward and backward Number Theoretic Transform (NTT) over Z_Q[X]/(X^N + 1), where N is a power of two and Q is an NTT-friendly prime satisfying Q ≡ 1 mod 2N.

This library provides both scalar and SIMD-optimized implementations, with the AVX2 version delivering significant performance improvements for suitable workloads.

Table of Contents

Features

  • 🚀 High Performance: AVX2-optimized implementation for 31-bit primes with SIMD acceleration
  • 🔢 Wide Range Support: Base implementation supports primes up to 61 bits using scalar operations
  • ⚙️ Configurable: Modular design with support for custom parameters (prime modulus Q, root of unity ψ, maximum transform size N)
  • 🧪 Well Tested: Comprehensive unit tests ensuring correctness across different scenarios
  • 📊 Benchmarked: Built-in performance benchmarks to measure and compare implementations
  • 🔧 Montgomery Arithmetic: Efficient modular multiplication using Montgomery reduction
  • ✅ Robust: Proper handling of edge cases and reduction during computations

Performance

The AVX2 implementation provides significant speedup over the base scalar version:

  • Up to 4x faster for 31-bit primes on AVX2-capable CPUs
  • Vectorized operations processing 8 elements simultaneously
  • Optimized memory access patterns for better cache utilization

Benchmark results on typical hardware show consistent performance gains across different transform sizes (N = 2^11 to 2^17).

Installation

Prerequisites

  • Rust 1.75+: Install from rustup.rs
  • AVX2 CPU: For optimal performance (Intel Haswell 2013+, AMD Excavator 2015+)
  • 64-bit architecture: Currently supports x86_64 only

Adding to Your Project

Add this to your Cargo.toml:

[dependencies]
app = { git = "https://github.com/aslonv/Number-Theoretic-Transform---CPU-acceleration-with-AVX2" }

Or for local development:

git clone https://github.com/aslonv/Number-Theoretic-Transform---CPU-acceleration-with-AVX2.git
cd Number-Theoretic-Transform---CPU-acceleration-with-AVX2
cargo build --release

Quick Start

Basic Usage

use app::dft::{ntt::Table, DFT};

fn main() {
    // Create NTT table with default 61-bit prime
    let ntt = Table::new();
    
    // Your data (must be power of 2 length)
    let mut data = vec![1, 2, 3, 4, 5, 6, 7, 8];
    
    // Forward transform
    ntt.forward_inplace(&mut data);
    println!("Forward NTT: {:?}", data);
    
    // Backward transform (inverse)
    ntt.backward_inplace(&mut data);
    println!("Recovered data: {:?}", data);
}

Using Custom Parameters

use app::dft::{ntt::Table, DFT};

fn main() {
    // Custom 31-bit prime for AVX2 optimization
    let q = 0xFFF00001u64;  // 31-bit NTT-friendly prime
    let psi = 0x7A329F10u64; // Primitive 2^20-th root of unity
    let max_log_n = 20;     // Support transforms up to 2^20
    
    let ntt = Table::with_params(q, psi, max_log_n);
    
    let mut data: Vec<u64> = (0..16).collect();
    
    ntt.forward_inplace(&mut data);
    ntt.backward_inplace(&mut data);
    
    // Data should be recovered (modulo q)
    for (i, &val) in data.iter().enumerate() {
        assert_eq!(val % q, i as u64 % q);
    }
}

API Documentation

Core Types

Table<u64>

The main NTT implementation supporting both scalar and AVX2 operations.

Constructors:

  • Table::new() - Creates table with default 61-bit prime
  • Table::with_params(q, psi, max_log_n) - Creates table with custom parameters

Methods:

  • forward_inplace(&self, data: &mut [u64]) - In-place forward NTT
  • backward_inplace(&self, data: &mut [u64]) - In-place inverse NTT
  • forward_inplace_lazy(&self, data: &mut [u64]) - Forward NTT without full reduction
  • backward_inplace_lazy(&self, data: &mut [u64]) - Inverse NTT without full reduction
  • q(&self) -> &u64 - Get the prime modulus

Requirements

  • Input size: Must be a power of 2 (e.g., 2, 4, 8, 16, ..., 2^20)
  • Element range: Values should be in range [0, q) for correct results
  • Prime validation: Q must be NTT-friendly (Q ≡ 1 mod 2N)

AVX2 Acceleration

The library automatically detects AVX2 support and uses optimized routines when:

  1. CPU supports AVX2 instructions
  2. Prime modulus is ≤ 31 bits
  3. Running on x86_64 architecture

Project Structure

src/
├── lib.rs              # Library root and exports
├── dft.rs              # DFT trait definition
└── dft/
    └── ntt.rs          # NTT implementation (scalar + AVX2)
benches/
└── ntt.rs              # Performance benchmarks
tests/                  # Integration tests (if any)

Key Components

  • Montgomery Arithmetic: Efficient modular reduction for AVX2 path
  • Bit Reversal: Optimized permutation for both scalar and SIMD
  • Twiddle Factor Precomputation: Cached roots of unity for performance
  • SIMD Intrinsics: Hand-optimized AVX2 assembly for critical loops

Benchmarking

Run comprehensive benchmarks comparing scalar vs AVX2 performance:

cargo bench

This will benchmark both implementations across multiple transform sizes and output detailed performance metrics.

Sample Results

NTT/Base/2048          time: [15.2 μs 15.3 μs 15.4 μs]
NTT/AVX2/2048          time: [4.1 μs 4.2 μs 4.3 μs]   # ~3.7x speedup

Contributing

We welcome contributions! Please see our contributing guidelines:

Development Setup

  1. Fork and clone the repository
  2. Install dependencies: cargo build
  3. Run tests: cargo test
  4. Check formatting: cargo fmt --check
  5. Run lints: cargo clippy

Contribution Areas

  • Algorithm optimizations: Improve existing implementations
  • Platform support: Add support for other SIMD instruction sets (AVX-512, ARM NEON)
  • Testing: Add more edge cases and validation tests
  • Documentation: Improve examples and API documentation
  • Benchmarking: Add more comprehensive performance analysis

Code Style

  • Follow standard Rust formatting (cargo fmt)
  • Add documentation for public APIs
  • Include tests for new functionality
  • Benchmark performance-critical changes

Troubleshooting

Common Issues

Q: Getting incorrect results after NTT round-trip

  • Ensure input size is a power of 2
  • Verify input values are in range [0, q)
  • Check that q is NTT-friendly for your transform size

Q: AVX2 optimizations not activating

  • Verify CPU supports AVX2: check /proc/cpuinfo on Linux
  • Ensure using 31-bit or smaller prime modulus
  • Confirm running on x86_64 architecture

Q: Build errors on older Rust versions

  • Update to Rust 1.75+ using rustup update

Q: Performance lower than expected

  • Enable release mode: cargo run --release
  • Verify AVX2 is being used with CPU-appropriate primes
  • Check for thermal throttling on benchmarking system

Getting Help

  • Create an issue for bugs or feature requests
  • Check existing issues for known problems
  • Review benchmarks for performance expectations

License

Licensed under the Apache License, Version 2.0. See LICENSE for details.

Copyright 2025 Begali Aslonov

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

Author: Begali Aslonov
Contact: [email protected]
GitHub: @aslonv
X: @aslonv

About

High-performance NTT. Supports 61-bit primes, with AVX2-optimized code for 31-bit primes

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages