Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Out of memory errors don't work properly #496

Closed
shkodm opened this issue Jan 16, 2024 · 5 comments
Closed

Out of memory errors don't work properly #496

shkodm opened this issue Jan 16, 2024 · 5 comments
Assignees
Labels

Comments

@shkodm
Copy link
Member

shkodm commented Jan 16, 2024

Found on develop branch when trying to run large example (automatically generated).
Can, for instance, can be reproduced:

256 x 512 x 256: [ ] [0] Cumulative allocation of 168231424 b (30.2 GB) Works fine
256 x 800 x 256: Cumulative allocation of -24577536 b (18446744073.7 GB) Throws out of memory error correctly, but incorrect reporting of total allocated memory.
512 x 800 x 256: Tries to allocated incorect (much smaller) total amount of memory and throws a different error:

Cumulative allocation of -154486272 b (25.6 GB)
[  ] Initializing Lattice ...
[ ] FATAL ERROR: an illegal memory access was encountered in Lattice.hpp at line 445

Expected behaviour: correct memory allocation is reported, and correct error is throws (the behaviour of master branch). Probably related to some casting or overflow

@llaniewski llaniewski added the bug label Jan 16, 2024
@llaniewski llaniewski self-assigned this Jan 16, 2024
@llaniewski
Copy link
Member

Seems the problem was introduced here:
101e6c2#diff-d31965790d0025ccd455cffbe6c4c9fdcdcc33946b62479b634f87f9d8a574f9R197
when @kubagalecki deleted (size_t) conversion in the calculation of size. I'll make a pull request to fix this (and fix the printing at the same time)

@shkodm
Copy link
Member Author

shkodm commented Jan 17, 2024

@llaniewski it is probably a different bug, but some things still don't work as expected (also the same on master branch). I run on 2 V100 on Bunya, each with 80GB GPUs, my case is large, so I split between 2.
I get:
Cumulative allocation of 63.GB)
and then
an illegal memory access was encountered in Lattice.hpp at line 279

The error is the same even if try I split between 3 GPUs (40GB each, so plenty of space even if there is some unaccounted memory)

@llaniewski
Copy link
Member

@shkodm Just to clarify, do these large cases run on the master branch?

@shkodm
Copy link
Member Author

shkodm commented Jan 17, 2024

@llaniewski no, they also don't work on master branch. The error happens in the Lattice.cu at the same place (CUDA kernel synchronisation). I ran with d3q27_pf_velocity model.

@llaniewski
Copy link
Member

Closing this issue and moving the discussion of the size limitation to #499 . Addressing it is a bigger thing and wound need testing of performance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants