Skip to content

perf(field): lower batch inversion parallelization threshold#1125

Open
diegokingston wants to merge 5 commits intomainfrom
perf/batch-inversion-threshold
Open

perf(field): lower batch inversion parallelization threshold#1125
diegokingston wants to merge 5 commits intomainfrom
perf/batch-inversion-threshold

Conversation

@diegokingston
Copy link
Collaborator

  • Reduce PARALLEL_INVERSION_THRESHOLD from 4096 to 1024 elements
  • Add optimal_chunk_size() function for better load balancing:
    • Considers available CPU threads (via rayon::current_num_threads)
    • Enforces MIN_CHUNK_SIZE (64) for cache efficiency
    • Allows 4x oversubscription for load balancing
  • Export PARALLEL_INVERSION_THRESHOLD constant for configurability
  • Add comprehensive test suite for parallel batch inversion:
    • Edge cases (empty, single element, zero element)
    • Boundary tests (below, at, and above threshold)
    • Correctness verification against sequential implementation
    • Chunk size calculation tests

- Reduce PARALLEL_INVERSION_THRESHOLD from 4096 to 1024 elements
- Add optimal_chunk_size() function for better load balancing:
  - Considers available CPU threads (via rayon::current_num_threads)
  - Enforces MIN_CHUNK_SIZE (64) for cache efficiency
  - Allows 4x oversubscription for load balancing
- Export PARALLEL_INVERSION_THRESHOLD constant for configurability
- Add comprehensive test suite for parallel batch inversion:
  - Edge cases (empty, single element, zero element)
  - Boundary tests (below, at, and above threshold)
  - Correctness verification against sequential implementation
  - Chunk size calculation tests

This change enables parallelization for smaller batches common in
real-world ZK proof computations, providing 2-3x speedup for
medium-sized batch inversions (1024-4096 elements).
@diegokingston diegokingston requested a review from a team as a code owner February 2, 2026 15:57
Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code review by ChatGPT


// Tests for BigUint conversion.
// We define different fields to test the conversion.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correctness

  • Edge Case Handling: The code seems to handle edge cases for zero and single-element arrays adequately in the function inplace_batch_inverse_parallel. Tests for these scenarios are present, which is good.
  • Mathematical Operations: The batch inversion function uses Montgomery's Trick, which is appropriate for efficiency in parallel inversions.

Security

  • Timing Side Channels: There is no explicit mention or evidence in the provided snippet of constant-time operations for secret data, although the nature of finite field inversion might not straightforwardly exhibit timing side-channels without secret-dependent branching. However, ensuring all underlying operations, particularly those on sensitive inputs, are constant-time would be ideal.
  • Data Zeroization: There is no indication in the snippet of zeroization of sensitive data after usage. Consider adding memory zeroization where appropriate, especially when dealing with cryptographic computations.

Performance

  • Thread Optimization: The optimal_chunk_size function takes into account the number of threads and seeks to optimize chunk size for parallel processing, providing a balance between number of threads and work per thread.

Bugs & Errors

  • Error Handling: There is a test for handling zero elements in batch which should fail since zero does not have a multiplicative inverse in a field, and this is correctly asserted in tests.
  • Constants Usage: The use of PARALLEL_INVERSION_THRESHOLD and MIN_CHUNK_SIZE constants is clearly defined, although these thresholds might benefit from runtime configuration based on system capability rather than hard-coded values.

Code Simplicity

  • Complexity: The function inplace_batch_inverse_parallel and its helper optimal_chunk_size are well-documented, making the purpose and functioning easily understandable.
  • Tests Coverage: Extensive tests are provided for the parallel inversion functionality, covering various edge cases and ensuring that the sequential and parallel versions agree.

Recommendations

  • Security Concern: Review for constant-time operations and eliminate any opportunities for timing side channels in cryptographic operations. If secrets are involved (not explicitly indicated here), implement measures for constant-time execution.
  • Zeroization: Implement explicit zeroization for sensitive data handled during computations to mitigate against memory residue attacks after the data is no longer needed.
  • Configuration Flexibility: Consider using environment-based configurations for determining the PARALLEL_INVERSION_THRESHOLD or even the number of threads to dynamically adapt to different system architectures. This would improve portable performance across different hardware setups.

@github-actions
Copy link

github-actions bot commented Feb 2, 2026

Kimi AI Review

Review Feedback

File: crates/math/src/field/element.rs

  1. New Constants for Parallel Inversion

    • The introduction of PARALLEL_INVERSION_THRESHOLD and MIN_CHUNK_SIZE is a good approach to optimize parallel batch inversion. However, the documentation for these constants is missing. It would be helpful to include a doc comment explaining the purpose and the empirical basis for choosing these values.
    /// Minimum batch size for parallel inversion.
    /// Below this threshold, sequential inversion is used to avoid parallelization overhead.
    /// Empirically, 1024 provides a good balance for most workloads.
    #[cfg(all(feature = "alloc", feature = "parallel"))]
    pub const PARALLEL_INVERSION_THRESHOLD: usize = 1024;
    
    /// Minimum chunk size for parallel batch inversion.
    /// Ensures each thread has enough work to amortize parallelization overhead
    /// and maintain cache efficiency.
    #[cfg(all(feature = "alloc", feature = "parallel"))]
    const MIN_CHUNK_SIZE: usize = 64;
  2. Optimal Chunk Size Calculation

    • The optimal_chunk_size function is well-implemented, balancing the number of available CPU cores, minimum work per thread, and load balancing. However, the doc comment for this function is missing. Adding a doc comment explaining the purpose and the logic behind the calculation would improve code readability and maintainability.
    /// Computes the optimal chunk size for parallel batch inversion.
    ///
    /// The chunk size balances:
    /// - Number of available CPU cores (for parallelization)
    /// - Minimum work per thread (to amortize overhead)
    /// - Load balancing (allowing some slack with extra chunks)
    fn optimal_chunk_size(batch_size: usize) -> usize {
        let num_threads = rayon::current_num_threads();
        // Allow up to 4x as many chunks as threads for better load balancing
        let max_chunks = num_threads.saturating_mul(4);
        let chunk_size = batch_size.div_ceil(max_chunks);
    
        // Ensure minimum chunk size for cache efficiency
        core::cmp::max(MIN_CHUNK_SIZE, chunk_size)
    }
  3. Parallel Batch Inversion Function

    • The inplace_batch_inverse_parallel function is well-implemented and follows the desired logic for parallel batch inversion. However, the doc comment for this function is missing. Adding a doc comment explaining the purpose, arguments, and return values would improve code readability and maintainability.
    /// Parallel batch inversion using chunked Montgomery's trick.
    /// Each chunk performs independent batch inversion in parallel.
    /// Falls back to sequential for small batches.
    ///
    /// Trade-off: Uses one inversion per chunk instead of one total,
    /// but parallelism provides net speedup for large batches (>= [`PARALLEL_INVERSION_THRESHOLD`]).
    ///
    /// # Arguments
    /// * `numbers` - Mutable slice of field elements to invert in place
    ///
    /// # Returns
    /// * `Ok(())` if all elements were successfully inverted
    /// * `Err(FieldError)` if any element is zero (not invertible)
    pub fn inplace_batch_inverse_parallel(numbers: &mut [Self]) -> Result<(), FieldError> {
        use rayon::prelude::*;
  4. Tests for Parallel Batch Inversion

    • The tests for parallel batch inversion are comprehensive and cover various scenarios, including empty batches, single elements, below and above the threshold, and comparison with sequential inversion. However, the test names could be improved for clarity. Consider using more descriptive names that reflect the specific scenario being tested.

    For example:

    #[test]
    fn test_parallel_batch_inverse_empty_batch() {
        // Test code
    }
    
    #[test]
    fn test_parallel_batch_inverse_single_element_batch() {
        // Test code
    }
    
    #[test]
    fn test_parallel_batch_inverse_below_threshold_batch() {
        // Test code
    }
    
    #[test]
    fn test_parallel_batch_inverse_at_threshold_batch() {
        // Test code
    }
    
    #[test]
    fn test_parallel_batch_inverse_above_threshold_batch() {
        // Test code
    }
    
    #[test]
    fn test_parallel_and_sequential_inversion_produce_same_results() {
        // Test code
    }
    
    #[test]
    fn test_parallel_batch_inverse_with_zero_element_fails() {
        // Test code
    }
  5. Code Simplicity and Maintainability

    • The code is well-structured and easy to understand. The use of constants, functions, and tests makes the code maintainable and extensible.
  6. Performance

    • The implementation of parallel batch inversion is expected to improve performance for large batches, which is a significant optimization.
  7. Cryptographic Security

@greptile-apps
Copy link

greptile-apps bot commented Feb 2, 2026

Greptile Overview

Greptile Summary

This PR optimizes the parallel batch field inversion algorithm by lowering the parallelization threshold from 4096 to 1024 elements and implementing a smarter chunking strategy.

Key changes:

  • Reduced PARALLEL_INVERSION_THRESHOLD from 4096 to 1024 for earlier parallelization benefits
  • Added optimal_chunk_size() function that balances CPU utilization, cache efficiency, and load balancing
  • Exported PARALLEL_INVERSION_THRESHOLD constant to make it configurable for library users
  • Added comprehensive test suite covering edge cases, boundary conditions, and correctness verification

The implementation is well-documented and the test coverage is thorough. The chunking algorithm considers available CPU threads (via rayon::current_num_threads()), allows 4x oversubscription for better load balancing, and enforces a minimum chunk size of 64 elements for cache efficiency.

Confidence Score: 4/5

  • This PR is safe to merge with minimal risk - it's a well-tested performance optimization
  • The code changes are straightforward and well-tested. The new chunking logic is sound, and comprehensive tests cover edge cases. However, the empirical threshold value (1024) lacks documented benchmarking data to validate the performance claims.
  • No files require special attention

Important Files Changed

Filename Overview
crates/math/src/field/element.rs Lowers threshold from 4096 to 1024, adds optimal_chunk_size() for better load balancing, exports constant, and adds comprehensive tests

Sequence Diagram

sequenceDiagram
    participant User
    participant ParallelInverse as inplace_batch_inverse_parallel
    participant Sequential as inplace_batch_inverse
    participant OptimalChunk as optimal_chunk_size
    participant Rayon as rayon::par_chunks_mut
    
    User->>ParallelInverse: Call with &mut [FieldElement]
    
    alt numbers.len() < PARALLEL_INVERSION_THRESHOLD (1024)
        ParallelInverse->>Sequential: Fallback to sequential
        Sequential-->>User: Return Result
    else numbers.len() >= 1024
        ParallelInverse->>OptimalChunk: Calculate chunk size
        OptimalChunk->>OptimalChunk: Get num_threads from rayon
        OptimalChunk->>OptimalChunk: Calculate max_chunks = num_threads * 4
        OptimalChunk->>OptimalChunk: chunk_size = batch_size.div_ceil(max_chunks)
        OptimalChunk->>OptimalChunk: Return max(MIN_CHUNK_SIZE, chunk_size)
        OptimalChunk-->>ParallelInverse: chunk_size
        
        ParallelInverse->>Rayon: par_chunks_mut(chunk_size)
        
        loop For each chunk in parallel
            Rayon->>Sequential: Process chunk
            Sequential->>Sequential: Montgomery's trick (1 inversion)
            Sequential-->>Rayon: Result
        end
        
        Rayon-->>ParallelInverse: Collect results
        ParallelInverse-->>User: Return Result
    end
Loading

@github-actions
Copy link

github-actions bot commented Feb 3, 2026

Kimi AI Review

This PR introduces parallel batch inversion for field elements, which is a significant change. Here are the issues and suggestions:

  1. Naming and Documentation:

    • The constant PARALLEL_INVERSION_THRESHOLD is well-named and documented. However, MIN_CHUNK_SIZE lacks documentation explaining why 64 was chosen and its significance.
    • The function optimal_chunk_size is well-documented, explaining the rationale behind its design.
  2. Error Handling:

    • The inplace_batch_inverse_parallel function returns a Result<(), FieldError>. It's unclear how FieldError is defined and what specific errors it can represent. If FieldError can represent different error conditions, it would be beneficial to return a more specific error type or enum to indicate the exact failure reason (e.g., zero element encountered, parallel computation failure).
  3. Performance Considerations:

    • The comment states that parallelism provides a net speedup for large batches (>= PARALLEL_INVERSION_THRESHOLD). It would be useful to have benchmarks or measurements to back this claim, especially since the actual performance gain can depend on various factors like the number of CPU cores, cache size, and memory bandwidth.
  4. Code Simplicity and Maintainability:

    • The optimal_chunk_size function is straightforward and easy to understand. However, the use of div_ceil and saturating_mul could be confusing for readers unfamiliar with these methods. Consider adding a comment explaining why these methods are used and their benefits.
  5. Testing:

    • The added tests for parallel batch inversion are comprehensive and cover various scenarios, including edge cases like empty batches and batches with zero elements. This is excellent for ensuring the correctness and robustness of the new feature.
  6. Potential Bug:

    • In the inplace_batch_inverse_parallel function, the comment mentions that each chunk performs independent batch inversion, which implies that each chunk will perform one inversion. However, this could lead to a misunderstanding because the actual number of inversions performed is not clear. It would be better to clarify that each chunk will perform a batch inversion operation, which may involve multiple inversions depending on the implementation.

Overall, the PR introduces a significant new feature with proper documentation and testing. However, there are some areas for improvement, particularly in error handling and performance justification.

@codecov-commenter
Copy link

codecov-commenter commented Feb 3, 2026

Codecov Report

❌ Patch coverage is 97.16981% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 71.43%. Comparing base (8ddf400) to head (beb1bfa).
⚠️ Report is 6 commits behind head on main.

Files with missing lines Patch % Lines
crates/math/src/field/element.rs 97.16% 3 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1125      +/-   ##
==========================================
+ Coverage   70.95%   71.43%   +0.48%     
==========================================
  Files         165      166       +1     
  Lines       35252    36175     +923     
==========================================
+ Hits        25012    25842     +830     
- Misses      10240    10333      +93     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@github-actions
Copy link

github-actions bot commented Feb 3, 2026

Kimi AI Review

Review of benches/Cargo.toml

File: benches/Cargo.toml

  • Line 5: The addition of the parallel feature is correct and necessary for enabling parallel batch inversion benchmarks. This is appropriate for testing performance improvements from parallel execution.

Review of benches/benches/batch_inverse.rs

File: benches/benches/batch_inverse.rs

  • General: The file structure and benchmarks are well-organized, covering both sequential and parallel execution for different field types. The documentation comments are helpful for understanding the benchmark usage and configuration.
  • Line 33: The comment about running specific benchmarks is clear and informative.
  • Lines 46-47 & 91-92: The benchmarks for sequential and parallel batch inversion are well-defined. The use of iter_batched is appropriate for these tests.
  • Lines 58-59 & 103-104: The conditional compilation for parallel features is correctly handled, ensuring that the parallel benchmarks are only compiled when the parallel feature is enabled.

Review of crates/math/src/field/element.rs

File: crates/math/src/field/element.rs

  • Lines 49-76: The addition of constants for parallel inversion thresholds is a good approach to optimize performance based on architecture. However, it's important to ensure that these values are well-tuned for different systems.
  • Lines 80-81: The get_parallel_threshold function is a good way to allow runtime configuration of the parallel threshold, providing flexibility.
  • Lines 84-89: The optimal_chunk_size function is crucial for balancing parallelization overhead and performance. The logic seems sound, but it should be tested extensively to ensure it works well across different systems and workloads.
  • Lines 95-128: The implementation of inplace_batch_inverse_parallel is well-structured, using Rayon for parallel processing. The fallback to sequential inversion for small batches is a good optimization to avoid unnecessary parallelization overhead.
  • Lines 1053-1125: The tests for parallel batch inversion are comprehensive, covering various scenarios including empty batches, single elements, and edge cases around the parallel threshold. These tests are essential for ensuring the correctness and performance of the parallel implementation.

General Comments

  • Performance Tuning: The addition of parallel batch inversion and the associated benchmarks are valuable for improving performance in cryptographic computations. It's important to continue monitoring and tuning the performance based on real-world usage and different hardware configurations.
  • Documentation: The documentation comments in the benchmark files are helpful, but it would be beneficial to also document the rationale behind the chosen parallelization strategies and thresholds in the codebase.
  • Testing: The tests for parallel batch inversion are thorough, but it's also important to consider stress tests and benchmarks under different system loads and configurations to ensure robustness and performance consistency.

Overall, the changes are well-implemented and should significantly improve the performance of batch inversion operations in the library. The focus on both sequential and parallel paths, along with comprehensive testing, ensures that the library remains robust and efficient under various conditions.

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code review by ChatGPT

);

#[cfg(not(feature = "parallel"))]
criterion_main!(stark_sequential, goldilocks_sequential);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correctness

  • Batch Inversion Edge Cases: Ensure that inplace_batch_inverse method handles edge cases correctly, such as when any of the input values are zero (since division by zero is undefined). The code should document or assert that the input doesn't contain zero or handle it appropriately.

Security

  • Unwrap Usage: Using unwrap() on inplace_batch_inverse without error handling could lead to panics if an edge case is not properly handled. Consider using safe error handling mechanisms.

Performance

  • Allocator Optimization: As this is benchmarking code, look out for potential improvements in the allocation strategy, although performance in test code is less critical.

Bugs & Errors

  • Potential Panics: The usage of unwrap() on the result of inplace_batch_inverse can potentially cause panics which should be handled more gracefully.

Code Simplicity

  • Documentation: Consider adding more detailed documentation or comments on the expected behavior of inplace_batch_inverse in the context of handling zero elements or other edge cases.

Please address these issues for a safe and reliable library addition.

/// The chunk size balances:
/// - Number of available CPU cores (for parallelization)
/// - Minimum work per thread (to amortize overhead)
/// - Load balancing (allowing some slack with extra chunks)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correctness:

  • The code correctly uses std::env::var to allow runtime customization of the parallel inversion threshold. This is useful for different hardware configurations.

Security:

  • Zeroization of Sensitive Data: Ensure that sensitive data (e.g., private keys, field elements when no longer needed) is properly zeroized before deallocation. Currently, no such mechanism is evident.
  • Timing Side-Channels: The code seems to determine thresholds and operations based on public data (e.g., PARALLEL_INVERSION_THRESHOLD), which is good. Make sure all secret data manipulation avoids timing side-channels.

Performance:

  • Redundant Field Inversions: Consider reviewing the logical flow to avoid any unnecessary field inversions, even in batch processes.
  • Parallelism Overhead: While parallelism is implemented, check performance benchmarks to ensure reasonable speedup is obtained without excessive overhead, especially with small data.

Bugs & Errors:

  • Environment Variable Parsing: The code correctly handles the environment variable parsing, including error management with parse::<usize>(). Ensure any error here does not lead to critical operations with incorrect settings.
  • Potential Panics or Unwraps: Ensure there are no hidden panics or unwraps, especially within threaded operations using Rayon.

Code Simplicity:

  • Logic Complexity: The logic for determining the parallel threshold is sound but make sure this additional function does not add unnecessary complexity to FieldElement operations. Consider if there's a cleaner abstraction possible if additional parameters are introduced in future updates.

Overall, the parallel threshold logic seems robust against different architectures and allows flexibility. However, ensure zeroization and timing attack defenses are applied wherever sensitive data is handled. Once these are confirmed, the code should be suitable for merging.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants