Number Theoretic Transform (NTT) - CPU Acceleration with AVX2

A high-performance implementation of the forward and backward Number Theoretic Transform (NTT) over Z_Q[X]/(X^N + 1), where N is a power of two and Q is an NTT-friendly prime satisfying Q ≡ 1 mod 2N.

This library provides both scalar and SIMD-optimized implementations, with the AVX2 version delivering significant performance improvements for suitable workloads.

Features

🚀 High Performance: AVX2-optimized implementation for 31-bit primes with SIMD acceleration
🔢 Wide Range Support: Base implementation supports primes up to 61 bits using scalar operations
⚙️ Configurable: Modular design with support for custom parameters (prime modulus Q, root of unity ψ, maximum transform size N)
🧪 Well Tested: Comprehensive unit tests ensuring correctness across different scenarios
📊 Benchmarked: Built-in performance benchmarks to measure and compare implementations
🔧 Montgomery Arithmetic: Efficient modular multiplication using Montgomery reduction
✅ Robust: Proper handling of edge cases and reduction during computations

Performance

The AVX2 implementation provides significant speedup over the base scalar version:

Up to 4x faster for 31-bit primes on AVX2-capable CPUs
Vectorized operations processing 8 elements simultaneously
Optimized memory access patterns for better cache utilization

Benchmark results on typical hardware show consistent performance gains across different transform sizes (N = 2^11 to 2^17).

Installation

Prerequisites

Rust 1.75+: Install from rustup.rs
AVX2 CPU: For optimal performance (Intel Haswell 2013+, AMD Excavator 2015+)
64-bit architecture: Currently supports x86_64 only

Adding to Your Project

Add this to your Cargo.toml:

[dependencies]
app = { git = "https://github.com/aslonv/Number-Theoretic-Transform---CPU-acceleration-with-AVX2" }

Or for local development:

git clone https://github.com/aslonv/Number-Theoretic-Transform---CPU-acceleration-with-AVX2.git
cd Number-Theoretic-Transform---CPU-acceleration-with-AVX2
cargo build --release

Quick Start

Basic Usage

use app::dft::{ntt::Table, DFT};

fn main() {
    // Create NTT table with default 61-bit prime
    let ntt = Table::new();
    
    // Your data (must be power of 2 length)
    let mut data = vec![1, 2, 3, 4, 5, 6, 7, 8];
    
    // Forward transform
    ntt.forward_inplace(&mut data);
    println!("Forward NTT: {:?}", data);
    
    // Backward transform (inverse)
    ntt.backward_inplace(&mut data);
    println!("Recovered data: {:?}", data);
}

Using Custom Parameters

use app::dft::{ntt::Table, DFT};

fn main() {
    // Custom 31-bit prime for AVX2 optimization
    let q = 0xFFF00001u64;  // 31-bit NTT-friendly prime
    let psi = 0x7A329F10u64; // Primitive 2^20-th root of unity
    let max_log_n = 20;     // Support transforms up to 2^20
    
    let ntt = Table::with_params(q, psi, max_log_n);
    
    let mut data: Vec<u64> = (0..16).collect();
    
    ntt.forward_inplace(&mut data);
    ntt.backward_inplace(&mut data);
    
    // Data should be recovered (modulo q)
    for (i, &val) in data.iter().enumerate() {
        assert_eq!(val % q, i as u64 % q);
    }
}

API Documentation

Core Types

`Table<u64>`

The main NTT implementation supporting both scalar and AVX2 operations.

Constructors:

Table::new() - Creates table with default 61-bit prime
Table::with_params(q, psi, max_log_n) - Creates table with custom parameters

Methods:

forward_inplace(&self, data: &mut [u64]) - In-place forward NTT
backward_inplace(&self, data: &mut [u64]) - In-place inverse NTT
forward_inplace_lazy(&self, data: &mut [u64]) - Forward NTT without full reduction
backward_inplace_lazy(&self, data: &mut [u64]) - Inverse NTT without full reduction
q(&self) -> &u64 - Get the prime modulus

Requirements

Input size: Must be a power of 2 (e.g., 2, 4, 8, 16, ..., 2^20)
Element range: Values should be in range [0, q) for correct results
Prime validation: Q must be NTT-friendly (Q ≡ 1 mod 2N)

AVX2 Acceleration

The library automatically detects AVX2 support and uses optimized routines when:

CPU supports AVX2 instructions
Prime modulus is ≤ 31 bits
Running on x86_64 architecture

Project Structure

src/
├── lib.rs              # Library root and exports
├── dft.rs              # DFT trait definition
└── dft/
    └── ntt.rs          # NTT implementation (scalar + AVX2)
benches/
└── ntt.rs              # Performance benchmarks
tests/                  # Integration tests (if any)

Key Components

Montgomery Arithmetic: Efficient modular reduction for AVX2 path
Bit Reversal: Optimized permutation for both scalar and SIMD
Twiddle Factor Precomputation: Cached roots of unity for performance
SIMD Intrinsics: Hand-optimized AVX2 assembly for critical loops

Benchmarking

Run comprehensive benchmarks comparing scalar vs AVX2 performance:

cargo bench

This will benchmark both implementations across multiple transform sizes and output detailed performance metrics.

Sample Results

NTT/Base/2048          time: [15.2 μs 15.3 μs 15.4 μs]
NTT/AVX2/2048          time: [4.1 μs 4.2 μs 4.3 μs]   # ~3.7x speedup

Contributing

We welcome contributions! Please see our contributing guidelines:

Development Setup

Fork and clone the repository
Install dependencies: cargo build
Run tests: cargo test
Check formatting: cargo fmt --check
Run lints: cargo clippy

Contribution Areas

Algorithm optimizations: Improve existing implementations
Platform support: Add support for other SIMD instruction sets (AVX-512, ARM NEON)
Testing: Add more edge cases and validation tests
Documentation: Improve examples and API documentation
Benchmarking: Add more comprehensive performance analysis

Code Style

Follow standard Rust formatting (cargo fmt)
Add documentation for public APIs
Include tests for new functionality
Benchmark performance-critical changes

Troubleshooting

Common Issues

Q: Getting incorrect results after NTT round-trip

Ensure input size is a power of 2
Verify input values are in range [0, q)
Check that q is NTT-friendly for your transform size

Q: AVX2 optimizations not activating

Verify CPU supports AVX2: check /proc/cpuinfo on Linux
Ensure using 31-bit or smaller prime modulus
Confirm running on x86_64 architecture

Q: Build errors on older Rust versions

Update to Rust 1.75+ using rustup update

Q: Performance lower than expected

Enable release mode: cargo run --release
Verify AVX2 is being used with CPU-appropriate primes
Check for thermal throttling on benchmarking system

Getting Help

Create an issue for bugs or feature requests
Check existing issues for known problems
Review benchmarks for performance expectations

License

Licensed under the Apache License, Version 2.0. See LICENSE for details.

Copyright 2025 Begali Aslonov

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

Author: Begali Aslonov
Contact: [email protected]
GitHub: @aslonv
X: @aslonv

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
benches		benches
src		src
tests		tests
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Number Theoretic Transform (NTT) - CPU Acceleration with AVX2

Table of Contents

Features

Performance

Installation

Prerequisites

Adding to Your Project

Quick Start

Basic Usage

Using Custom Parameters

API Documentation

Core Types

`Table<u64>`

Requirements

AVX2 Acceleration

Project Structure

Key Components

Benchmarking

Sample Results

Contributing

Development Setup

Contribution Areas

Code Style

Troubleshooting

Common Issues

Getting Help

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

License

aslonv/Number-Theoretic-Transform---CPU-acceleration-with-AVX2

Folders and files

Latest commit

History

Repository files navigation

Number Theoretic Transform (NTT) - CPU Acceleration with AVX2

Table of Contents

Features

Performance

Installation

Prerequisites

Adding to Your Project

Quick Start

Basic Usage

Using Custom Parameters

API Documentation

Core Types

Table<u64>

Requirements

AVX2 Acceleration

Project Structure

Key Components

Benchmarking

Sample Results

Contributing

Development Setup

Contribution Areas

Code Style

Troubleshooting

Common Issues

Getting Help

License

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

`Table<u64>`

Packages