Avoid nan by using zeros instead of empty dummy tensor (#2648)

yunjiangster · facebook-github-bot · commit dab48e4a438a · 2024-12-21T00:30:01.000-08:00
Summary: Pull Request resolved: #2648 This appears to solve the nan issue when we enable `torch.autograd.set_detect_anomaly(True)` The error below appears non-deterministically, and after a few training steps, presumably because `self._dummy_tensor` can be very large but not nan at the beginning, and after a few iterations reaches `nan`. ``` RuntimeError: Function 'All2All_Seq_Req_WaitBackward' returned nan values in its 0th output; num_outputs = 1; num_inputs = 0; outputs[0].shape = [1, ]; outputs[i] = nan [ torch.cuda.FloatTensor{1} ] ``` Reviewed By: iamzainhuda Differential Revision: D67535635 fbshipit-source-id: 50a70163afa8d17b3ed6f6c59c118315193c9839
diff --git a/torchrec/distributed/comm_ops.py b/torchrec/distributed/comm_ops.py
@@ -107,10 +107,13 @@ def __init__(self, pg: dist.ProcessGroup, device: torch.device) -> None:
         # This dummy tensor is used to build the autograd graph between
         # CommOp-Req and CommOp-Await. The actual forward tensors, and backwards gradient tensors
         # are stored in self.tensor
-        self.dummy_tensor: torch.Tensor = torch.empty(
-            1,
-            requires_grad=True,
-            device=device,
+        # torch.zeros is a call_function, not placeholder, hence fx.trace incompatible.
+        self.dummy_tensor: torch.Tensor = torch.zeros_like(
+            torch.empty(
+                1,
+                requires_grad=True,
+                device=device,
+            )
         )
 
     def _wait_impl(self) -> W: