perf(field): lower batch inversion parallelization threshold#1125
perf(field): lower batch inversion parallelization threshold#1125diegokingston wants to merge 5 commits intomainfrom
Conversation
diegokingston
commented
Feb 2, 2026
- Reduce PARALLEL_INVERSION_THRESHOLD from 4096 to 1024 elements
- Add optimal_chunk_size() function for better load balancing:
- Considers available CPU threads (via rayon::current_num_threads)
- Enforces MIN_CHUNK_SIZE (64) for cache efficiency
- Allows 4x oversubscription for load balancing
- Export PARALLEL_INVERSION_THRESHOLD constant for configurability
- Add comprehensive test suite for parallel batch inversion:
- Edge cases (empty, single element, zero element)
- Boundary tests (below, at, and above threshold)
- Correctness verification against sequential implementation
- Chunk size calculation tests
- Reduce PARALLEL_INVERSION_THRESHOLD from 4096 to 1024 elements - Add optimal_chunk_size() function for better load balancing: - Considers available CPU threads (via rayon::current_num_threads) - Enforces MIN_CHUNK_SIZE (64) for cache efficiency - Allows 4x oversubscription for load balancing - Export PARALLEL_INVERSION_THRESHOLD constant for configurability - Add comprehensive test suite for parallel batch inversion: - Edge cases (empty, single element, zero element) - Boundary tests (below, at, and above threshold) - Correctness verification against sequential implementation - Chunk size calculation tests This change enables parallelization for smaller batches common in real-world ZK proof computations, providing 2-3x speedup for medium-sized batch inversions (1024-4096 elements).
|
|
||
| // Tests for BigUint conversion. | ||
| // We define different fields to test the conversion. | ||
|
|
There was a problem hiding this comment.
Correctness
- Edge Case Handling: The code seems to handle edge cases for zero and single-element arrays adequately in the function
inplace_batch_inverse_parallel. Tests for these scenarios are present, which is good. - Mathematical Operations: The batch inversion function uses Montgomery's Trick, which is appropriate for efficiency in parallel inversions.
Security
- Timing Side Channels: There is no explicit mention or evidence in the provided snippet of constant-time operations for secret data, although the nature of finite field inversion might not straightforwardly exhibit timing side-channels without secret-dependent branching. However, ensuring all underlying operations, particularly those on sensitive inputs, are constant-time would be ideal.
- Data Zeroization: There is no indication in the snippet of zeroization of sensitive data after usage. Consider adding memory zeroization where appropriate, especially when dealing with cryptographic computations.
Performance
- Thread Optimization: The
optimal_chunk_sizefunction takes into account the number of threads and seeks to optimize chunk size for parallel processing, providing a balance between number of threads and work per thread.
Bugs & Errors
- Error Handling: There is a test for handling zero elements in batch which should fail since zero does not have a multiplicative inverse in a field, and this is correctly asserted in tests.
- Constants Usage: The use of
PARALLEL_INVERSION_THRESHOLDandMIN_CHUNK_SIZEconstants is clearly defined, although these thresholds might benefit from runtime configuration based on system capability rather than hard-coded values.
Code Simplicity
- Complexity: The function
inplace_batch_inverse_paralleland its helperoptimal_chunk_sizeare well-documented, making the purpose and functioning easily understandable. - Tests Coverage: Extensive tests are provided for the parallel inversion functionality, covering various edge cases and ensuring that the sequential and parallel versions agree.
Recommendations
- Security Concern: Review for constant-time operations and eliminate any opportunities for timing side channels in cryptographic operations. If secrets are involved (not explicitly indicated here), implement measures for constant-time execution.
- Zeroization: Implement explicit zeroization for sensitive data handled during computations to mitigate against memory residue attacks after the data is no longer needed.
- Configuration Flexibility: Consider using environment-based configurations for determining the
PARALLEL_INVERSION_THRESHOLDor even the number of threads to dynamically adapt to different system architectures. This would improve portable performance across different hardware setups.
Kimi AI ReviewReview FeedbackFile:
|
Greptile OverviewGreptile SummaryThis PR optimizes the parallel batch field inversion algorithm by lowering the parallelization threshold from 4096 to 1024 elements and implementing a smarter chunking strategy. Key changes:
The implementation is well-documented and the test coverage is thorough. The chunking algorithm considers available CPU threads (via Confidence Score: 4/5
|
| Filename | Overview |
|---|---|
| crates/math/src/field/element.rs | Lowers threshold from 4096 to 1024, adds optimal_chunk_size() for better load balancing, exports constant, and adds comprehensive tests |
Sequence Diagram
sequenceDiagram
participant User
participant ParallelInverse as inplace_batch_inverse_parallel
participant Sequential as inplace_batch_inverse
participant OptimalChunk as optimal_chunk_size
participant Rayon as rayon::par_chunks_mut
User->>ParallelInverse: Call with &mut [FieldElement]
alt numbers.len() < PARALLEL_INVERSION_THRESHOLD (1024)
ParallelInverse->>Sequential: Fallback to sequential
Sequential-->>User: Return Result
else numbers.len() >= 1024
ParallelInverse->>OptimalChunk: Calculate chunk size
OptimalChunk->>OptimalChunk: Get num_threads from rayon
OptimalChunk->>OptimalChunk: Calculate max_chunks = num_threads * 4
OptimalChunk->>OptimalChunk: chunk_size = batch_size.div_ceil(max_chunks)
OptimalChunk->>OptimalChunk: Return max(MIN_CHUNK_SIZE, chunk_size)
OptimalChunk-->>ParallelInverse: chunk_size
ParallelInverse->>Rayon: par_chunks_mut(chunk_size)
loop For each chunk in parallel
Rayon->>Sequential: Process chunk
Sequential->>Sequential: Montgomery's trick (1 inversion)
Sequential-->>Rayon: Result
end
Rayon-->>ParallelInverse: Collect results
ParallelInverse-->>User: Return Result
end
…ss/lambdaworks into perf/batch-inversion-threshold
Kimi AI ReviewThis PR introduces parallel batch inversion for field elements, which is a significant change. Here are the issues and suggestions:
Overall, the PR introduces a significant new feature with proper documentation and testing. However, there are some areas for improvement, particularly in error handling and performance justification. |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1125 +/- ##
==========================================
+ Coverage 70.95% 71.43% +0.48%
==========================================
Files 165 166 +1
Lines 35252 36175 +923
==========================================
+ Hits 25012 25842 +830
- Misses 10240 10333 +93 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Kimi AI ReviewReview of
|
| ); | ||
|
|
||
| #[cfg(not(feature = "parallel"))] | ||
| criterion_main!(stark_sequential, goldilocks_sequential); |
There was a problem hiding this comment.
Correctness
- Batch Inversion Edge Cases: Ensure that
inplace_batch_inversemethod handles edge cases correctly, such as when any of the input values are zero (since division by zero is undefined). The code should document or assert that the input doesn't contain zero or handle it appropriately.
Security
- Unwrap Usage: Using
unwrap()oninplace_batch_inversewithout error handling could lead to panics if an edge case is not properly handled. Consider using safe error handling mechanisms.
Performance
- Allocator Optimization: As this is benchmarking code, look out for potential improvements in the allocation strategy, although performance in test code is less critical.
Bugs & Errors
- Potential Panics: The usage of
unwrap()on the result ofinplace_batch_inversecan potentially cause panics which should be handled more gracefully.
Code Simplicity
- Documentation: Consider adding more detailed documentation or comments on the expected behavior of
inplace_batch_inversein the context of handling zero elements or other edge cases.
Please address these issues for a safe and reliable library addition.
| /// The chunk size balances: | ||
| /// - Number of available CPU cores (for parallelization) | ||
| /// - Minimum work per thread (to amortize overhead) | ||
| /// - Load balancing (allowing some slack with extra chunks) |
There was a problem hiding this comment.
Correctness:
- The code correctly uses
std::env::varto allow runtime customization of the parallel inversion threshold. This is useful for different hardware configurations.
Security:
- Zeroization of Sensitive Data: Ensure that sensitive data (e.g., private keys, field elements when no longer needed) is properly zeroized before deallocation. Currently, no such mechanism is evident.
- Timing Side-Channels: The code seems to determine thresholds and operations based on public data (e.g.,
PARALLEL_INVERSION_THRESHOLD), which is good. Make sure all secret data manipulation avoids timing side-channels.
Performance:
- Redundant Field Inversions: Consider reviewing the logical flow to avoid any unnecessary field inversions, even in batch processes.
- Parallelism Overhead: While parallelism is implemented, check performance benchmarks to ensure reasonable speedup is obtained without excessive overhead, especially with small data.
Bugs & Errors:
- Environment Variable Parsing: The code correctly handles the environment variable parsing, including error management with
parse::<usize>(). Ensure any error here does not lead to critical operations with incorrect settings. - Potential Panics or Unwraps: Ensure there are no hidden panics or unwraps, especially within threaded operations using Rayon.
Code Simplicity:
- Logic Complexity: The logic for determining the parallel threshold is sound but make sure this additional function does not add unnecessary complexity to
FieldElementoperations. Consider if there's a cleaner abstraction possible if additional parameters are introduced in future updates.
Overall, the parallel threshold logic seems robust against different architectures and allows flexibility. However, ensure zeroization and timing attack defenses are applied wherever sensitive data is handled. Once these are confirmed, the code should be suitable for merging.