A high-performance implementation of the forward and backward Number Theoretic Transform (NTT) over Z_Q[X]/(X^N + 1), where N is a power of two and Q is an NTT-friendly prime satisfying Q ≡ 1 mod 2N.
This library provides both scalar and SIMD-optimized implementations, with the AVX2 version delivering significant performance improvements for suitable workloads.
- Features
- Performance
- Installation
- Quick Start
- API Documentation
- Project Structure
- Benchmarking
- Contributing
- Troubleshooting
- License
- 🚀 High Performance: AVX2-optimized implementation for 31-bit primes with SIMD acceleration
- 🔢 Wide Range Support: Base implementation supports primes up to 61 bits using scalar operations
- ⚙️ Configurable: Modular design with support for custom parameters (prime modulus Q, root of unity ψ, maximum transform size N)
- 🧪 Well Tested: Comprehensive unit tests ensuring correctness across different scenarios
- 📊 Benchmarked: Built-in performance benchmarks to measure and compare implementations
- 🔧 Montgomery Arithmetic: Efficient modular multiplication using Montgomery reduction
- ✅ Robust: Proper handling of edge cases and reduction during computations
The AVX2 implementation provides significant speedup over the base scalar version:
- Up to 4x faster for 31-bit primes on AVX2-capable CPUs
- Vectorized operations processing 8 elements simultaneously
- Optimized memory access patterns for better cache utilization
Benchmark results on typical hardware show consistent performance gains across different transform sizes (N = 2^11 to 2^17).
- Rust 1.75+: Install from rustup.rs
- AVX2 CPU: For optimal performance (Intel Haswell 2013+, AMD Excavator 2015+)
- 64-bit architecture: Currently supports x86_64 only
Add this to your Cargo.toml
:
[dependencies]
app = { git = "https://github.com/aslonv/Number-Theoretic-Transform---CPU-acceleration-with-AVX2" }
Or for local development:
git clone https://github.com/aslonv/Number-Theoretic-Transform---CPU-acceleration-with-AVX2.git
cd Number-Theoretic-Transform---CPU-acceleration-with-AVX2
cargo build --release
use app::dft::{ntt::Table, DFT};
fn main() {
// Create NTT table with default 61-bit prime
let ntt = Table::new();
// Your data (must be power of 2 length)
let mut data = vec![1, 2, 3, 4, 5, 6, 7, 8];
// Forward transform
ntt.forward_inplace(&mut data);
println!("Forward NTT: {:?}", data);
// Backward transform (inverse)
ntt.backward_inplace(&mut data);
println!("Recovered data: {:?}", data);
}
use app::dft::{ntt::Table, DFT};
fn main() {
// Custom 31-bit prime for AVX2 optimization
let q = 0xFFF00001u64; // 31-bit NTT-friendly prime
let psi = 0x7A329F10u64; // Primitive 2^20-th root of unity
let max_log_n = 20; // Support transforms up to 2^20
let ntt = Table::with_params(q, psi, max_log_n);
let mut data: Vec<u64> = (0..16).collect();
ntt.forward_inplace(&mut data);
ntt.backward_inplace(&mut data);
// Data should be recovered (modulo q)
for (i, &val) in data.iter().enumerate() {
assert_eq!(val % q, i as u64 % q);
}
}
The main NTT implementation supporting both scalar and AVX2 operations.
Constructors:
Table::new()
- Creates table with default 61-bit primeTable::with_params(q, psi, max_log_n)
- Creates table with custom parameters
Methods:
forward_inplace(&self, data: &mut [u64])
- In-place forward NTTbackward_inplace(&self, data: &mut [u64])
- In-place inverse NTTforward_inplace_lazy(&self, data: &mut [u64])
- Forward NTT without full reductionbackward_inplace_lazy(&self, data: &mut [u64])
- Inverse NTT without full reductionq(&self) -> &u64
- Get the prime modulus
- Input size: Must be a power of 2 (e.g., 2, 4, 8, 16, ..., 2^20)
- Element range: Values should be in range [0, q) for correct results
- Prime validation: Q must be NTT-friendly (Q ≡ 1 mod 2N)
The library automatically detects AVX2 support and uses optimized routines when:
- CPU supports AVX2 instructions
- Prime modulus is ≤ 31 bits
- Running on x86_64 architecture
src/
├── lib.rs # Library root and exports
├── dft.rs # DFT trait definition
└── dft/
└── ntt.rs # NTT implementation (scalar + AVX2)
benches/
└── ntt.rs # Performance benchmarks
tests/ # Integration tests (if any)
- Montgomery Arithmetic: Efficient modular reduction for AVX2 path
- Bit Reversal: Optimized permutation for both scalar and SIMD
- Twiddle Factor Precomputation: Cached roots of unity for performance
- SIMD Intrinsics: Hand-optimized AVX2 assembly for critical loops
Run comprehensive benchmarks comparing scalar vs AVX2 performance:
cargo bench
This will benchmark both implementations across multiple transform sizes and output detailed performance metrics.
NTT/Base/2048 time: [15.2 μs 15.3 μs 15.4 μs]
NTT/AVX2/2048 time: [4.1 μs 4.2 μs 4.3 μs] # ~3.7x speedup
We welcome contributions! Please see our contributing guidelines:
- Fork and clone the repository
- Install dependencies:
cargo build
- Run tests:
cargo test
- Check formatting:
cargo fmt --check
- Run lints:
cargo clippy
- Algorithm optimizations: Improve existing implementations
- Platform support: Add support for other SIMD instruction sets (AVX-512, ARM NEON)
- Testing: Add more edge cases and validation tests
- Documentation: Improve examples and API documentation
- Benchmarking: Add more comprehensive performance analysis
- Follow standard Rust formatting (
cargo fmt
) - Add documentation for public APIs
- Include tests for new functionality
- Benchmark performance-critical changes
Q: Getting incorrect results after NTT round-trip
- Ensure input size is a power of 2
- Verify input values are in range [0, q)
- Check that q is NTT-friendly for your transform size
Q: AVX2 optimizations not activating
- Verify CPU supports AVX2: check
/proc/cpuinfo
on Linux - Ensure using 31-bit or smaller prime modulus
- Confirm running on x86_64 architecture
Q: Build errors on older Rust versions
- Update to Rust 1.75+ using
rustup update
Q: Performance lower than expected
- Enable release mode:
cargo run --release
- Verify AVX2 is being used with CPU-appropriate primes
- Check for thermal throttling on benchmarking system
- Create an issue for bugs or feature requests
- Check existing issues for known problems
- Review benchmarks for performance expectations
Licensed under the Apache License, Version 2.0. See LICENSE for details.
Copyright 2025 Begali Aslonov
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Author: Begali Aslonov
Contact: [email protected]
GitHub: @aslonv
X: @aslonv