Open
Description
CudaMalloc
shouldn't be used to allocate temporary memory for a CUDA kernel. CudaMalloc
is very slow, and it synchronizes the device, which is catastrophic if you are running multiple kernels at the same time.
There needs to be some sub-allocator that will allocate some memory at the start of the program and use that for temporary storage.