Direct to APC WIP#52
Conversation
…ith non-APC but not sure if APC works
… simple example works
…new, so that we can skip subs calculation in non apc path
…p passes for 10 APC
- Add ApcParams struct to bundle APC-related parameters (subs, opt_widths,
post_opt_offsets, height, width, calls_per_row) with helper methods
is_apc(), thread_count(), effective_height()
- Add RowSlice::create_apc_aware() static factory method to encapsulate
APC-aware row slice construction logic
- Add FILL_DUMMY_ROW_APC macro to handle dummy row filling for both
APC and non-APC cases
- Update all RV32IM CUDA tracegen kernels (alu, auipc, beq, blt, divrem,
jal_lui, jalr, less_than, load_sign_extend, loadstore, mul, mulh, shift)
to use the new abstractions
- Update cuda_abi.rs FFI bindings to pass ApcParams struct instead of
individual parameters
- Remove cuda from default features in Cargo.toml
This reduces code duplication and makes APC handling more maintainable
across all chip implementations.
4714b6a to
8fa26d7
Compare
- Remove field and redundant APC height checks from tracegen - Refactor to handle dummy rows internally - Replace FILL_DUMMY_ROW macro with direct fill_zero_no_offset calls - Remove row_print_buffer.cuh debug utility and to_debug_uint helper - Simplify CUDA kernel launchers by removing apc_height parameter - Clean up .rs files to match simplified CUDA interface
d628408 to
feb6138
Compare
| size_t num_records | ||
| ) { | ||
| if (apc.is_apc()) { | ||
| // Beyond APC buffer - nothing to do |
There was a problem hiding this comment.
Tried a hard error but unfortunately didn't work because OVM rounds # of threads launched from kernel to multiple of MAX_THREADS = 1024, so there wlll be idle threads regardless.
inline std::pair<dim3, dim3> kernel_launch_params(
size_t count,
size_t threads_per_block = MAX_THREADS
) {
size_t block = std::min(count, threads_per_block);
size_t grid = div_ceil(count, block);
return std::make_pair(dim3(grid, 1, 1), dim3(block, 1, 1));
}
| // Dummy row - fill zeros and return null | ||
| if (idx >= num_records) { | ||
| row.fill_zero_no_offset(0, apc.opt_widths[slot]); | ||
| return RowSlice::null(); |
There was a problem hiding this comment.
Is it ok to access a null row slice? If not, how do we know it's not accessed?
There was a problem hiding this comment.
Yes it is and guaranteed to be never accessed. See response here: https://github.com/powdr-labs/openvm/pull/52/files/183c96d3c5cf1bb213866c2b685ab93c59d94b12#r2656757779
|
|
||
| __device__ __forceinline__ RowSlice shift_row(size_t n) const { | ||
| return RowSlice(ptr + n, stride); | ||
| return RowSlice(ptr + n, stride, optimized_offset, dummy_offset, subs, is_apc); |
There was a problem hiding this comment.
Can you explain why none of the other members change when shifting? Is it because we look at a different row of the same table?
There was a problem hiding this comment.
It's simply because this API is never used in practice for APC (it's only used in poseidon2.cu).
I added an !is_apc assertion here.
There was a problem hiding this comment.
Also made the added arguments 0, 0, nullptr, false.
| RowSlice row = RowSlice::create_apc_aware( | ||
| d_trace, height, idx, sizeof(Rv32BaseAluCols<uint8_t>), apc, d_records.len() | ||
| ); | ||
| if (!row.is_valid()) return; |
There was a problem hiding this comment.
Is this because filling with zeros is handled by create_apc_aware?
There was a problem hiding this comment.
Yes exactly and also as answer to your other question, we return a nullptr in create_apc_aware after filling in the zeros, so that here we can catch those cases and immediately return, so nullptr is guaranteed to be never accessed.
This is because we can't return from alu_tracegen inside create_apc_aware, so have to do via this nullptr indirection, and again this is because we handle the filling zeros inside create_apc_aware, as one of your prior comments suggested (and I agreed because this reduces diffs).
|
All comments should be fixed :) |
| for (int i = 0; i < limbs_len; i++) { | ||
| uint32_t limb_u32 = x & mask; | ||
| limbs[i] = limb_u32; | ||
| limbs.write(i, limb_u32); |
| __device__ __forceinline__ void fill_zero_no_offset(size_t column_index_from, size_t length) const { | ||
| #pragma unroll | ||
| for (size_t i = 0, c = column_index_from; i < length; i++, c++) { | ||
| ptr[c * stride] = 0; |
There was a problem hiding this comment.
| ptr[c * stride] = 0; | |
| write(c, 0); |
?
| /// @param start Starting index in the substitution array | ||
| /// @param len Number of entries to scan | ||
| /// @return Number of entries equal to UINT32_MAX in the range | ||
| __device__ __forceinline__ size_t number_of_gaps_in(const uint32_t *sub, size_t start, size_t len) { |
There was a problem hiding this comment.
Would there be a way to make this function more local? It's called only once, but it is defined here and in trace_access.h.
Ready for review.