draft ukernel selection logic #1652

metascroy · 2025-02-03T03:44:22Z

This is a draft to do ukernel selection based on cpu_info.

This relates to #1376

pytorch-bot · 2025-02-03T03:44:26Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/1652

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

ROCM Infra failures during checkout of PyTorch

❌ 1 New Failure

As of commit 7a43be4 with merge base 7815262 ():

NEW FAILURE - The following job has failed:

PR Label Check / Check PR Labels (gh)
Process completed with exit code 1.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

metascroy · 2025-02-03T03:45:22Z

torchao/experimental/ops/linear_8bit_act_xbit_weight/linear_8bit_act_xbit_weight.cpp

@@ -98,7 +98,7 @@ LinearTilingParams get_default_linear_tiling_params(
  TORCHAO_CHECK(num_threads >= 1, "num_threads must be >= 1");

  tiling_params.mc_by_mr = 1;
-  int mc = tiling_params.mc_by_mr * ukernel_config.mr;
+  int mc = tiling_params.mc_by_mr * ukernel_config.kernels[0].mr;


ukernel_config now includes an array of kernels based on mr. Still need to add mr selection logic here, for now it just selects the first one.

metascroy · 2025-02-03T17:22:56Z

torchao/experimental/ops/linear_8bit_act_xbit_weight/kernel_selector.h

+  static UKernelConfigCacheType ukernel_config_cache;
+
+  // Check cache
+  auto it = ukernel_config_cache.find(header);


If we want uarch specific kernel per core, we can add uarch to cache key and look up uarch before looking in cache, e.g.,

auto uarch = get_current_core_uarch(); auto it = ukernel_config_cache.find({header, uarch});

metascroy · 2025-02-03T23:17:16Z

torchao/experimental/CMakeLists.txt

@@ -22,7 +22,7 @@ if(NOT TORCHAO_INCLUDE_DIRS)
  set(TORCHAO_INCLUDE_DIRS ${CMAKE_CURRENT_SOURCE_DIR}/../..)
 endif()

-option(TORCHAO_BUILD_KLEIDIAI "Download, build, and link against Arm KleidiAI library (arm64 only)" OFF)
+option(TORCHAO_BUILD_KLEIDIAI "Download, build, and link against Arm KleidiAI library (arm64 only)" ON)


TODO: nocommit

metascroy · 2025-02-03T23:17:42Z

torchao/experimental/tests/test_packed_linear_int8_dynamic_activation_intx_weight_layout.py

+                    # print(f"actual_val={actual_val}, expected_val={expected_val}")
+                    # self.assertTrue(torch.allclose(actual_val, expected_val, atol=1e-6))
+
+                    self.assertTrue(torch.abs(actual_val - expected_val) < 0.05)


Do not commit change. This is because kleidi has bf16 instead of fp32.

metascroy · 2025-02-03T23:18:16Z

torchao/experimental/ops/linear_8bit_act_xbit_weight/packed_weights_header.h

+       0});
+}
+
+struct KleidiAIPackingParams {


TODO: check if these packing params are sufficient for all kleidi.

metascroy · 2025-02-04T06:54:21Z

torchao/experimental/ops/linear_8bit_act_xbit_weight/kernel_selector.h

+          ukernel_config_cache[key] = torchao::ops::linear_8bit_act_xbit_weight::UKernelConfig{
+          /*preferred_alignment*/16,
+          /*weight_packing*/
+          {


We can rework the kleidiai integration to share weight packing, rather than repeat in each namespace.

It is shared in code, but exposed along with the kernel so you don't have to map it back to the kernel at call sites.

It is in shared code, but not in a way that is convenient to access with shared mr kernels because the same packing function (indexed by nr, kr, sr) is given 4 different names (based on namespace).

So we could refactor it to make one packing function in kai_matmul_clamp_f32_qai8dxp_qsi4c32p, rather than have them in further specific namespaces?

metascroy · 2025-02-04T06:54:39Z

torchao/experimental/ops/linear_8bit_act_xbit_weight/kernel_selector.h

+          /*kernels*/
+          {{
+            {
+            /*mr*/static_cast<int>(uk.get_m_step()),


List of methods index by mr.

digantdesai

Good start. If you can also think some more about code organization, for taking a lot more kernels, and scalability in general.

digantdesai · 2025-02-04T16:50:31Z

torchao/experimental/ops/linear_8bit_act_xbit_weight/CMakeLists.txt

@@ -8,13 +8,23 @@ cmake_minimum_required(VERSION 3.19)

 include(${CMAKE_CURRENT_SOURCE_DIR}/../../Utils.cmake)

+add_compile_options(-Wno-unused-function -Wno-unused-variable) # For some reason cpuinfo package has unused functions/variables


Fix it upstream?

digantdesai · 2025-02-04T16:53:01Z

torchao/experimental/ops/linear_8bit_act_xbit_weight/kernel_selector.h

+          ukernel_config_cache[key] = torchao::ops::linear_8bit_act_xbit_weight::UKernelConfig{
+          /*preferred_alignment*/16,
+          /*weight_packing*/
+          {


It is shared in code, but exposed along with the kernel so you don't have to map it back to the kernel at call sites.

digantdesai · 2025-02-04T16:57:10Z

torchao/experimental/ops/linear_8bit_act_xbit_weight/kernel_selector.h

+          assert (sr == uk.get_sr());
+
+          ukernel_config_cache[key] = torchao::ops::linear_8bit_act_xbit_weight::UKernelConfig{
+          /*preferred_alignment*/16,


nit

Suggested change

/*preferred_alignment*/16,

/*preferred_alignment*/uk.get_preferred_alignment(),

digantdesai · 2025-02-04T17:01:55Z

torchao/experimental/ops/linear_8bit_act_xbit_weight/kernel_selector.h

+  #if defined(TORCHAO_ENABLE_KLEIDI)
+  if (!target || *target == "kleidi_ai") {
+    if (weight_nbit == 4 && !has_weight_zeros) {
+      return torchao::ops::linear_8bit_act_xbit_weight::get_packed_weights_format_kleidi_ai(weight_nbit, has_weight_zeros, /*has_bias*/true, /*nr*/8, /*kr*/16, /*sr*/2);


in the future we would have to make a choice for nr based on a cpu type (or some static choice for AOT-weight-packing like this), and register [mr] kernels, which you are already planning.

Yes, we can use any method in cpuinfo to select packed_weights_format, including any packing params like nr. This is not entirely static because universal is only selected if cpuinfo_has_arm_neon_dot is available. We could also use fields from uarch to select things here I guess?

I wonder if we should pass n and k as params in addition to target. Implementers can then take into account matrix size when selecting nr?

digantdesai · 2025-02-04T17:07:22Z

torchao/experimental/ops/linear_8bit_act_xbit_weight/linear_8bit_act_xbit_weight.h

+  // ukernel must behave correctly no matter how buffers are aligned
+  size_t preferred_alignment{0}; 
+  weight_packing_config weight_packing;
+  std::array<kernel_config, 4> kernels;


Nit

Suggested change

std::array<kernel_config, 4> kernels;

std::array<kernel_config, MAX_MR_TYPES> kernels;

digantdesai · 2025-02-04T17:12:04Z

torchao/experimental/ops/linear_8bit_act_xbit_weight/linear_8bit_act_xbit_weight.h

+    weight_data_size_fn_type weight_data_size_fn{nullptr};
+    prepare_weight_data_fn_type prepare_weight_data_fn{nullptr};
+  };
+  struct kernel_config {


This makes sense that you have one packing kernel, and for which N gemm kernels index by mr, but the naming makes this confusing to read i.e. ukernel->kernel[mr].mr

digantdesai · 2025-02-04T17:18:53Z

torchao/experimental/ops/linear_8bit_act_xbit_weight/linear_8bit_act_xbit_weight.h

+  // preferred_alignment for activation and weight data
+  // Integration surfaces are not required to respect this alignment, and the
+  // ukernel must behave correctly no matter how buffers are aligned
+  size_t preferred_alignment{0}; 


we have to make sure this is same for all MRs, i.e. document, test

digantdesai · 2025-02-04T17:20:53Z

torchao/experimental/ops/linear_8bit_act_xbit_weight/kernel_selector.h

+        /*kernels*/
+        {{
+          {
+          /*mr*/static_cast<int>(uk.get_m_step()),


in the future, when querying for mr(s), we should ensure their weight packing function pointer is same

This goes to the comment about reworking the kleidiAI integration I guess?

metascroy · 2025-02-04T19:59:25Z

Good start. If you can also think some more about code organization, for taking a lot more kernels, and scalability in general.

Let me give it some more thought about breaking some code out.

metascroy · 2025-02-04T21:40:12Z

Adding @kimishpatel because he was curious about the PR. kernel_selector.h is the main code to pay attention to for runtime kernel selection.

draft ukenrel selection logic

f99c885

metascroy requested a review from digantdesai February 3, 2025 03:44

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Feb 3, 2025

metascroy commented Feb 3, 2025

View reviewed changes

up

cd19e03

metascroy commented Feb 3, 2025

View reviewed changes

up

a1572fd

metascroy commented Feb 4, 2025

View reviewed changes

digantdesai reviewed Feb 4, 2025

View reviewed changes

metascroy mentioned this pull request Feb 4, 2025

[Feature Request] Add dynamic kernel selection to torchao/experimental #1376

Open

metascroy requested a review from kimishpatel February 4, 2025 21:39

up

7a43be4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

draft ukernel selection logic #1652

draft ukernel selection logic #1652

metascroy commented Feb 3, 2025 •

edited

Loading

pytorch-bot bot commented Feb 3, 2025 •

edited

Loading

metascroy Feb 3, 2025

metascroy Feb 3, 2025

metascroy Feb 3, 2025

metascroy Feb 3, 2025

metascroy Feb 3, 2025

metascroy Feb 4, 2025

digantdesai Feb 4, 2025

metascroy Feb 4, 2025

metascroy Feb 4, 2025

digantdesai left a comment

digantdesai Feb 4, 2025

digantdesai Feb 4, 2025

digantdesai Feb 4, 2025

digantdesai Feb 4, 2025

metascroy Feb 4, 2025

digantdesai Feb 4, 2025

digantdesai Feb 4, 2025

digantdesai Feb 4, 2025

digantdesai Feb 4, 2025

metascroy Feb 4, 2025

metascroy commented Feb 4, 2025

metascroy commented Feb 4, 2025

		@@ -8,13 +8,23 @@ cmake_minimum_required(VERSION 3.19)

		include(${CMAKE_CURRENT_SOURCE_DIR}/../../Utils.cmake)

		add_compile_options(-Wno-unused-function -Wno-unused-variable) # For some reason cpuinfo package has unused functions/variables

	/preferred_alignment/16,
	/preferred_alignment/uk.get_preferred_alignment(),

	std::array<kernel_config, 4> kernels;
	std::array<kernel_config, MAX_MR_TYPES> kernels;

draft ukernel selection logic #1652

Are you sure you want to change the base?

draft ukernel selection logic #1652

Conversation

metascroy commented Feb 3, 2025 • edited Loading

pytorch-bot bot commented Feb 3, 2025 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/1652

❗ 1 Active SEVs

❌ 1 New Failure

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

digantdesai left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

metascroy commented Feb 4, 2025

metascroy commented Feb 4, 2025

metascroy commented Feb 3, 2025 •

edited

Loading

pytorch-bot bot commented Feb 3, 2025 •

edited

Loading