Enabled force_shape_pad for triton tests in test_kernel_benchmark (pytorch#147620)

iupaikov-amd · pytorchmergebot · commit 6061664266a1 · 2025-02-25T18:06:48.000Z
During ROCm runs we naturally have those tests show that padding path will be slower for our archs and the pad_mm chooses to opt out of padding thus failing those tests. Reasoning for this is per my understanding those tests don't check IF the operation should be padded in the first place, but HOW is it padded and if it's done in a correct way. More than that the tests shouldn't really be hardware dependent or have some condition for them. Similar PR for reference: pytorch#141768 Pull Request resolved: pytorch#147620 Approved by: https://github.com/jeffdaily, https://github.com/chenyang78, https://github.com/shunting314
diff --git a/test/inductor/test_kernel_benchmark.py b/test/inductor/test_kernel_benchmark.py
@@ -137,7 +137,9 @@ def f(x):
     # TODO: Currently the Triton mm template +  relu fusion causes slowdown on XPU,
     # Need to refine the template and config for XPU.
     @expectedFailureXPU
-    @config.patch(max_autotune=True, max_autotune_gemm_backends="TRITON")
+    @config.patch(
+        max_autotune=True, max_autotune_gemm_backends="TRITON", force_shape_pad=True
+    )
     @fresh_inductor_cache()
     def test_matmul_triton_kernel_benchmark(self):
         M = 12544
@@ -153,7 +155,9 @@ def f(a, b):
         f(a, b)
         self.verify_compiled_kernels()
 
-    @config.patch(max_autotune=True, max_autotune_gemm_backends="TRITON")
+    @config.patch(
+        max_autotune=True, max_autotune_gemm_backends="TRITON", force_shape_pad=True
+    )
     @fresh_inductor_cache()
     def test_mm_triton_kernel_benchmark(self):
         M = 2048