Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GPU_X] 'cudaErrorNotSupported': 'operation not supported' #47270

Open
iarspider opened this issue Feb 5, 2025 · 5 comments
Open

[GPU_X] 'cudaErrorNotSupported': 'operation not supported' #47270

iarspider opened this issue Feb 5, 2025 · 5 comments

Comments

@iarspider
Copy link
Contributor

RelVals 29634.402, 29634.403, 29634.404, 29634.406, 29661.402, 29834.402, 29834.403, 29834.404 failed in CMSSW_15_0_GPU_X_2025-02-04-2300 with StdException:

----- Begin Fatal Exception 05-Feb-2025 02:33:03 CET-----------------------
An exception of category 'StdException' occurred while
   [0] Processing  Event run: 1 lumi: 8 event: 703 stream: 3
   [1] Running path 'MC_Ele5_Open_Unseeded'
   [2] Calling method for module HGCalSoARecHitsLayerClustersProducer@alpaka/'hltHgcalSoARecHitsLayerClustersProducer'
Exception Message:
A std::exception was thrown.
/data/cmsbld/jenkins/workspace/build-any-ib/w/el8_amd64_gcc12/external/alpaka/1.2.0-92470b733c547768aafa79b7bf3f2362/include/alpaka/mem/buf/BufUniformCudaHipRt.hpp(302) 'TApi::mallocAsync( &memPtr, static_cast<std::size_t>(width) * sizeof(TElem), queue.getNativeHandle())' returned error  : 'cudaErrorNotSupported': 'operation not supported'!
----- End Fatal Exception -------------------------------------------------
@iarspider
Copy link
Contributor Author

assign heterogeneous

@cmsbuild
Copy link
Contributor

cmsbuild commented Feb 5, 2025

New categories assigned: heterogeneous

@fwyzard,@makortel you have been requested to review this Pull request/Issue and eventually sign? Thanks

@cmsbuild
Copy link
Contributor

cmsbuild commented Feb 5, 2025

cms-bot internal usage

@cmsbuild
Copy link
Contributor

cmsbuild commented Feb 5, 2025

A new Issue was created by @iarspider.

@Dr15Jones, @antoniovilela, @makortel, @mandrenguyen, @rappoccio, @sextonkennedy, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@fwyzard
Copy link
Contributor

fwyzard commented Feb 5, 2025

Looks like the jobs are running on a 1/8th slice of an H100, with only 1 GB or GPU memory:

CUDA device 0: NVIDIA H100L-1-12C MIG 1g.12gb (sm_90)

Maybe that is not enough for the Phase-2 workflow with 4 concurrent streams ?

@rovere ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants