[einsum] Call view instead of sum to remediate MPS regression (pytorch#87135)

janeyx99 · pytorchmergebot · commit 80790ecee4f0 · 2022-10-18T23:01:28.000Z
Fixes pytorch#87010. It turns out that squeeze is much faster than sum, and view is faster than squeeze, so we should default to that whenever possible. Benchmarking results show that, on MPS, we would be going from the following code taking **29.89ms instead of the current 1466ms, almost a 50x speedup**. ``` q = torch.rand(16, 4096, 40, device='mps', dtype=torch.float) k = torch.rand(16, 4096, 40, device='mps', dtype=torch.float) torch.einsum('b i d, b j d -> b i j', q, k).max().item() ``` And a regular einsum will now take **.506ms instead of 2.76ms.** ``` q = torch.rand(16, 4096, 40, device='mps', dtype=torch.float) k = torch.rand(16, 4096, 40, device='mps', dtype=torch.float) torch.einsum('b i d, b j d -> b i j', q, k) ``` Special thanks to @soulitzer for helping me experiment + figure out how to squash the remaining 5x regression due to squeeze being slower than view!! Pull Request resolved: pytorch#87135 Approved by: https://github.com/soulitzer, https://github.com/malfet, https://github.com/albanD
diff --git a/aten/src/ATen/native/Linear.cpp b/aten/src/ATen/native/Linear.cpp
@@ -545,9 +545,17 @@ Tensor einsum(c10::string_view equation, TensorList operands, at::OptionalIntArr
 
   // Sum out contraction dims
   if (perm_index - out_num_dim > 0) {
-    std::vector<int64_t> sum_dims(perm_index - out_num_dim);
-    std::iota(sum_dims.begin(), sum_dims.end(), out_num_dim);
-    ops[0] = ops[0].sum(sum_dims);
+    if (num_ops > 1) {
+      auto sizes = ops[0].sym_sizes().vec();
+      for (auto dim = perm_index - 1; dim >= out_num_dim; --dim) {
+        sizes.erase(sizes.begin() + dim);
+      }
+      return ops[0].view_symint(sizes);
+    } else {
+      std::vector<int64_t> sum_dims(perm_index - out_num_dim);
+      std::iota(sum_dims.begin(), sum_dims.end(), out_num_dim);
+      return ops[0].sum(sum_dims);
+    }
   }
 
   return ops[0];