Skip to content

Commit ff7305f

Browse files
authored
[bugfix] Prevent a DDP failure using copy (#9239)
1 parent 3e71046 commit ff7305f

File tree

3 files changed

+14
-5
lines changed

3 files changed

+14
-5
lines changed

CHANGELOG.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -273,9 +273,13 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
273273

274274
- Fixed not setting a default value for `max_epochs` if `max_time` was specified on the `Trainer` constructor ([#9072](https://github.com/PyTorchLightning/pytorch-lightning/pull/9072))
275275

276+
276277
- Fixed the CometLogger, no longer modifies the metrics in place. Instead creates a copy of metrics before performing any operations ([#9150](https://github.com/PyTorchLightning/pytorch-lightning/pull/9150))
277278

278279

280+
- Fixed `DDP` "CUDA error: initialization error" due to a `copy` instead of `deepcopy` on `ResultCollection` ([#9239](https://github.com/PyTorchLightning/pytorch-lightning/pull/9239))
281+
282+
279283
## [1.4.3] - 2021-08-17
280284

281285
- Fixed plateau scheduler stepping on incomplete epoch ([#8861](https://github.com/PyTorchLightning/pytorch-lightning/pull/8861))

pytorch_lightning/loops/batch/training_batch_loop.py

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@
1111
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
1212
# See the License for the specific language governing permissions and
1313
# limitations under the License.
14-
from copy import copy
14+
from copy import deepcopy
1515
from functools import partial
1616
from typing import Any, Callable, Dict, List, Optional, Tuple
1717

@@ -142,12 +142,12 @@ def advance(self, batch, batch_idx):
142142

143143
result = self._run_optimization(batch_idx, split_batch, opt_idx, optimizer)
144144
if result:
145-
self.batch_outputs[opt_idx].append(copy(result.result_collection))
145+
self.batch_outputs[opt_idx].append(deepcopy(result.result_collection))
146146
else:
147147
# in manual optimization, there is no looping over optimizers
148148
result = self._run_optimization(batch_idx, split_batch)
149149
if result:
150-
self.batch_outputs[0].append(copy(result.result_collection))
150+
self.batch_outputs[0].append(deepcopy(result.result_collection))
151151

152152
def teardown(self) -> None:
153153
# release memory

pytorch_lightning/trainer/connectors/logger_connector/result.py

Lines changed: 7 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,7 @@
1717
from typing import Any, Callable, Dict, List, Mapping, Optional, Tuple, Union
1818

1919
import torch
20+
from torch.functional import Tensor
2021
from torchmetrics import Metric
2122

2223
from pytorch_lightning.core.mixins import DeviceDtypeModuleMixin
@@ -435,8 +436,12 @@ def log(
435436
) -> None:
436437
"""See :meth:`~pytorch_lightning.core.lightning.LightningModule.log`"""
437438
# no metrics should be logged with graphs
438-
if not enable_graph and isinstance(value, torch.Tensor):
439-
value = value.detach()
439+
if not enable_graph:
440+
441+
def detach_fn(tensor: Tensor) -> Tensor:
442+
return tensor.detach()
443+
444+
value = apply_to_collection(value, Tensor, detach_fn)
440445

441446
# move metrics to cpu on TPU.
442447
if isinstance(value, torch.Tensor) and value.device.type == "xla":

0 commit comments

Comments
 (0)