Skip to content

Direct to APC WIP#52

Open
qwang98 wants to merge 39 commits into
v1.4.2-upgradefrom
skip-dummy-wip
Open

Direct to APC WIP#52
qwang98 wants to merge 39 commits into
v1.4.2-upgradefrom
skip-dummy-wip

Conversation

@qwang98
Copy link
Copy Markdown

@qwang98 qwang98 commented Dec 23, 2025

Ready for review.

…new, so that we can skip subs calculation in non apc path
Comment thread extensions/rv32im/circuit/cuda/include/rv32im/cores/shift.cuh Outdated
Comment thread extensions/rv32im/circuit/cuda/src/alu.cu Outdated
Comment thread extensions/rv32im/circuit/cuda/src/alu.cu Outdated
Comment thread extensions/rv32im/circuit/src/auipc/cuda.rs
Comment thread extensions/rv32im/circuit/src/cuda_abi.rs Outdated
      - Add ApcParams struct to bundle APC-related parameters (subs, opt_widths,
        post_opt_offsets, height, width, calls_per_row) with helper methods
        is_apc(), thread_count(), effective_height()

      - Add RowSlice::create_apc_aware() static factory method to encapsulate
        APC-aware row slice construction logic

      - Add FILL_DUMMY_ROW_APC macro to handle dummy row filling for both
        APC and non-APC cases

      - Update all RV32IM CUDA tracegen kernels (alu, auipc, beq, blt, divrem,
        jal_lui, jalr, less_than, load_sign_extend, loadstore, mul, mulh, shift)
        to use the new abstractions

      - Update cuda_abi.rs FFI bindings to pass ApcParams struct instead of
        individual parameters

      - Remove cuda from default features in Cargo.toml

      This reduces code duplication and makes APC handling more maintainable
      across all chip implementations.
   - Remove  field and redundant APC height checks from tracegen
   - Refactor  to handle dummy rows internally
   - Replace FILL_DUMMY_ROW macro with direct fill_zero_no_offset calls
   - Remove row_print_buffer.cuh debug utility and to_debug_uint helper
   - Simplify CUDA kernel launchers by removing apc_height parameter
   - Clean up .rs files to match simplified CUDA interface
Comment thread Cargo.toml Outdated
Comment thread crates/circuits/primitives/cuda/include/primitives/histogram.cuh Outdated
Comment thread crates/circuits/primitives/cuda/include/primitives/trace_access.h
size_t num_records
) {
if (apc.is_apc()) {
// Beyond APC buffer - nothing to do
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be a hard error?

Copy link
Copy Markdown
Author

@qwang98 qwang98 Jan 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tried a hard error but unfortunately didn't work because OVM rounds # of threads launched from kernel to multiple of MAX_THREADS = 1024, so there wlll be idle threads regardless.

inline std::pair<dim3, dim3> kernel_launch_params(
    size_t count,
    size_t threads_per_block = MAX_THREADS
) {
    size_t block = std::min(count, threads_per_block);
    size_t grid = div_ceil(count, block);
    return std::make_pair(dim3(grid, 1, 1), dim3(block, 1, 1));
}

// Dummy row - fill zeros and return null
if (idx >= num_records) {
row.fill_zero_no_offset(0, apc.opt_widths[slot]);
return RowSlice::null();
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it ok to access a null row slice? If not, how do we know it's not accessed?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment thread crates/circuits/primitives/cuda/include/primitives/trace_access.h Outdated
Comment thread crates/circuits/primitives/cuda/include/primitives/trace_access.h Outdated
Comment thread crates/circuits/primitives/cuda/include/primitives/trace_access.h

__device__ __forceinline__ RowSlice shift_row(size_t n) const {
return RowSlice(ptr + n, stride);
return RowSlice(ptr + n, stride, optimized_offset, dummy_offset, subs, is_apc);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain why none of the other members change when shifting? Is it because we look at a different row of the same table?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's simply because this API is never used in practice for APC (it's only used in poseidon2.cu).

I added an !is_apc assertion here.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also made the added arguments 0, 0, nullptr, false.

RowSlice row = RowSlice::create_apc_aware(
d_trace, height, idx, sizeof(Rv32BaseAluCols<uint8_t>), apc, d_records.len()
);
if (!row.is_valid()) return;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this because filling with zeros is handled by create_apc_aware?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes exactly and also as answer to your other question, we return a nullptr in create_apc_aware after filling in the zeros, so that here we can catch those cases and immediately return, so nullptr is guaranteed to be never accessed.

This is because we can't return from alu_tracegen inside create_apc_aware, so have to do via this nullptr indirection, and again this is because we handle the filling zeros inside create_apc_aware, as one of your prior comments suggested (and I agreed because this reduces diffs).

Comment thread extensions/rv32im/circuit/cuda/src/alu.cu Outdated
@qwang98
Copy link
Copy Markdown
Author

qwang98 commented Jan 2, 2026

All comments should be fixed :)

@qwang98 qwang98 requested a review from Schaeff January 2, 2026 04:53
for (int i = 0; i < limbs_len; i++) {
uint32_t limb_u32 = x & mask;
limbs[i] = limb_u32;
limbs.write(i, limb_u32);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this required?

__device__ __forceinline__ void fill_zero_no_offset(size_t column_index_from, size_t length) const {
#pragma unroll
for (size_t i = 0, c = column_index_from; i < length; i++, c++) {
ptr[c * stride] = 0;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
ptr[c * stride] = 0;
write(c, 0);

?

/// @param start Starting index in the substitution array
/// @param len Number of entries to scan
/// @return Number of entries equal to UINT32_MAX in the range
__device__ __forceinline__ size_t number_of_gaps_in(const uint32_t *sub, size_t start, size_t len) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would there be a way to make this function more local? It's called only once, but it is defined here and in trace_access.h.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants