CUDA Out of Memory with Batch Size=1 #5603

gohjiayi · 2022-03-22T04:50:30Z

gohjiayi
Mar 22, 2022

I am currently trying to train an ELMo model using the AllenNLP package. Executing the training command will lead to CUDA Out of Memory. The entire dataset I am trying to train on is ~2 million clinical notes. A toy implementation of 10 rows can be executed without error previously.

I tried setting the batch_size=1 and max_instances_in_memory=2 to reduce memory usage but the same issue persists. When I increased to batch_size=2, the amount of memory that PyTorch tries to allocate remains the same. So I'm unsure of whether reducing the batch size could help.

I have 4 GPUs on the server that I'm working on. Tried both distributed and non-distributed training but the same issue persists. Even when using distributed training, PyTorch tries to allocate the same amount of memory. (Intuitively I thought distributed on 2 devices, memory usage would be halved too)

I am currently using a customised DatasetReader to read my serialised files which are stored in pkl format in lists. The datasetreader is not set to be lazy, I tried implementing it but did not see any improvement in results.

Before the CUDA Out of Memory error appears, the program runs for a minute before terminating. I was able to see the logs where they stated Worker 0 memory usage: 4.5G with GPU 0 memory usage: 3.6G. Does this mean that there is insufficient GPU space? However, I've read online about people training ELMo with Tesla V100/P100 too. How can I optimise my codes?

I am wondering if this is an issue with my implementation or this is an issue with the existing GPU space? As seen in the nvidia-smi log, there are memory left on certain GPUs but using them, the memory will be "allocated by PyTorch" due to other users using the GPU. Any suggestions on how I can tackle this issue?

  File "/home/jiayi/anaconda3/envs/elmo/lib/python3.8/site-packages/allennlp/training/gradient_descent_trainer.py", line 793, in _try_train
    train_metrics = self._train_epoch(epoch)
  File "/home/jiayi/anaconda3/envs/elmo/lib/python3.8/site-packages/allennlp/training/gradient_descent_trainer.py", line 529, in _train_epoch
    MixedPrecisionBackwardCallback(self._serialization_dir).on_backward(
  File "/home/jiayi/anaconda3/envs/elmo/lib/python3.8/site-packages/allennlp/training/callbacks/backward.py", line 25, in on_backward
    trainer._scaler.scale(batch_outputs["loss"]).backward()  # type: ignore
  File "/home/jiayi/anaconda3/envs/elmo/lib/python3.8/site-packages/torch/tensor.py", line 245, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/home/jiayi/anaconda3/envs/elmo/lib/python3.8/site-packages/torch/autograd/__init__.py", line 145, in backward
    Variable._execution_engine.run_backward(
RuntimeError: CUDA out of memory. Tried to allocate 1.67 GiB (GPU 2; 15.90 GiB total capacity; 13.27 GiB already allocated; 693.75 MiB free; 13.84 GiB reserved in total by PyTorch)
loading instances: 39it [00:09,  4.13it/s]

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 455.23.05    Driver Version: 455.23.05    CUDA Version: 11.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla P100-PCIE...  On   | 00000000:3B:00.0 Off |                    0 |
| N/A   28C    P0    30W / 250W |    512MiB / 16280MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla P100-PCIE...  On   | 00000000:5E:00.0 Off |                    0 |
| N/A   40C    P0   108W / 250W |   5641MiB / 16280MiB |    100%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100S-PCI...  On   | 00000000:AF:00.0 Off |                    0 |
| N/A   32C    P0    36W / 250W |  29251MiB / 32510MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-PCIE...  On   | 00000000:D8:00.0 Off |                    0 |
| N/A   62C    P0   187W / 250W |  26755MiB / 32510MiB |    100%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

0 | 2022-03-22 12:34:24,628 - INFO - allennlp.training.gradient_descent_trainer - Worker 0 memory usage: 4.5G
0 | 2022-03-22 12:34:24,628 - INFO - allennlp.training.gradient_descent_trainer - Worker 1 memory usage: 4.5G
/home/jiayi/anaconda3/envs/elmo/lib/python3.8/site-packages/torch/cuda/memory.py:260: FutureWarning: torch.cuda.reset_max_memory_allocated now calls torch.cuda.reset_peak_memory_stats, which resets /all/ peak memory stats.
  warnings.warn(
/home/jiayi/anaconda3/envs/elmo/lib/python3.8/site-packages/torch/cuda/memory.py:260: FutureWarning: torch.cuda.reset_max_memory_allocated now calls torch.cuda.reset_peak_memory_stats, which resets /all/ peak memory stats.
  warnings.warn(
0 | 2022-03-22 12:34:24,638 - INFO - allennlp.training.gradient_descent_trainer - GPU 0 memory usage: 3.6G
0 | 2022-03-22 12:34:24,638 - INFO - allennlp.training.gradient_descent_trainer - GPU 1 memory usage: 3.6G

epwalsh · 2022-03-24T17:06:07Z

epwalsh
Mar 24, 2022

Hi,

As seen in the nvidia-smi log, there are memory left on certain GPUs but using them, the memory will be "allocated by PyTorch" due to other users using the GPU.

Are you saying that there are other people on the same machine using the GPUs at the same time?

3 replies

gohjiayi Mar 24, 2022
Author

Hi there, I am currently working on a remote server and the GPUs are shared by multiple users. Yes, at the time of capturing the nvidia-smi log, most of the processes were executed by other users.

epwalsh Mar 24, 2022

That's probably the issue then. I'd suggest you coordinate with the other users to find a time when no one else is using the GPUs.

gohjiayi Mar 24, 2022
Author

Thank you for your insights. I will let you know if that could be successfully carried out then.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA Out of Memory with Batch Size=1 #5603

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

CUDA Out of Memory with Batch Size=1 #5603

Uh oh!

Uh oh!

gohjiayi Mar 22, 2022

Replies: 1 comment · 3 replies

Uh oh!

epwalsh Mar 24, 2022

Uh oh!

gohjiayi Mar 24, 2022 Author

Uh oh!

epwalsh Mar 24, 2022

Uh oh!

gohjiayi Mar 24, 2022 Author

gohjiayi
Mar 22, 2022

Replies: 1 comment 3 replies

epwalsh
Mar 24, 2022

gohjiayi Mar 24, 2022
Author

gohjiayi Mar 24, 2022
Author