Logging synchronization and confusion matrix log #21475
Unanswered
scy-helbling
asked this question in
DDP / multi-GPU / multi-node
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
I run a distributed training on 2 GPUs and want to use auto-synchronized torchmetrics for logging as well as a confusion matrix.
Single-script minimal example:
When one of the two problem lines (A or B) is commented out, the script runs fine. Howerver, when I use both options (problem lines A and B both uncommented), training halts at the end of the first epoch, presumeably due to a race condition because there are two different synchronization systems at work: one for the TorchMetric epoch-aggregation and one for the confusion matrix'
.compute()(which I guess is called internally when I use.plot)Versions and environment:
Beta Was this translation helpful? Give feedback.
All reactions