CUDA functions should not use `CudaMalloc` for temporary memory

 `CudaMalloc` shouldn't be used to allocate temporary memory for a CUDA kernel.  `CudaMalloc` is very slow, and it synchronizes the device, which is catastrophic if you are running multiple kernels at the same time. 

There needs to be some sub-allocator that will allocate some memory at the start of the program and use that for temporary storage.