This repository was archived by the owner on Dec 16, 2022. It is now read-only.
Replies: 1 comment 3 replies
-
|
Hi,
Are you saying that there are other people on the same machine using the GPUs at the same time? |
Beta Was this translation helpful? Give feedback.
3 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I am currently trying to train an ELMo model using the AllenNLP package. Executing the training command will lead to CUDA Out of Memory. The entire dataset I am trying to train on is ~2 million clinical notes. A toy implementation of 10 rows can be executed without error previously.
I tried setting the
batch_size=1andmax_instances_in_memory=2to reduce memory usage but the same issue persists. When I increased tobatch_size=2, the amount of memory that PyTorch tries to allocate remains the same. So I'm unsure of whether reducing the batch size could help.I have 4 GPUs on the server that I'm working on. Tried both distributed and non-distributed training but the same issue persists. Even when using distributed training, PyTorch tries to allocate the same amount of memory. (Intuitively I thought distributed on 2 devices, memory usage would be halved too)
I am currently using a customised DatasetReader to read my serialised files which are stored in pkl format in lists. The datasetreader is not set to be lazy, I tried implementing it but did not see any improvement in results.
Before the CUDA Out of Memory error appears, the program runs for a minute before terminating. I was able to see the logs where they stated
Worker 0 memory usage: 4.5GwithGPU 0 memory usage: 3.6G. Does this mean that there is insufficient GPU space? However, I've read online about people training ELMo with Tesla V100/P100 too. How can I optimise my codes?I am wondering if this is an issue with my implementation or this is an issue with the existing GPU space? As seen in the
nvidia-smilog, there are memory left on certain GPUs but using them, the memory will be "allocated by PyTorch" due to other users using the GPU. Any suggestions on how I can tackle this issue?Beta Was this translation helpful? Give feedback.
All reactions