-
Notifications
You must be signed in to change notification settings - Fork 212
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
draft ukernel selection logic #1652
base: main
Are you sure you want to change the base?
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/1652
Note: Links to docs will display an error until the docs builds have been completed. ❗ 1 Active SEVsThere are 1 currently active SEVs. If your PR is affected, please view them below: ❌ 1 New FailureAs of commit 7a43be4 with merge base 7815262 ( NEW FAILURE - The following job has failed:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
@@ -98,7 +98,7 @@ LinearTilingParams get_default_linear_tiling_params( | |||
TORCHAO_CHECK(num_threads >= 1, "num_threads must be >= 1"); | |||
|
|||
tiling_params.mc_by_mr = 1; | |||
int mc = tiling_params.mc_by_mr * ukernel_config.mr; | |||
int mc = tiling_params.mc_by_mr * ukernel_config.kernels[0].mr; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ukernel_config now includes an array of kernels based on mr. Still need to add mr selection logic here, for now it just selects the first one.
static UKernelConfigCacheType ukernel_config_cache; | ||
|
||
// Check cache | ||
auto it = ukernel_config_cache.find(header); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we want uarch specific kernel per core, we can add uarch to cache key and look up uarch before looking in cache, e.g.,
auto uarch = get_current_core_uarch();
auto it = ukernel_config_cache.find({header, uarch});
@@ -22,7 +22,7 @@ if(NOT TORCHAO_INCLUDE_DIRS) | |||
set(TORCHAO_INCLUDE_DIRS ${CMAKE_CURRENT_SOURCE_DIR}/../..) | |||
endif() | |||
|
|||
option(TORCHAO_BUILD_KLEIDIAI "Download, build, and link against Arm KleidiAI library (arm64 only)" OFF) | |||
option(TORCHAO_BUILD_KLEIDIAI "Download, build, and link against Arm KleidiAI library (arm64 only)" ON) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO: nocommit
# print(f"actual_val={actual_val}, expected_val={expected_val}") | ||
# self.assertTrue(torch.allclose(actual_val, expected_val, atol=1e-6)) | ||
|
||
self.assertTrue(torch.abs(actual_val - expected_val) < 0.05) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do not commit change. This is because kleidi has bf16 instead of fp32.
0}); | ||
} | ||
|
||
struct KleidiAIPackingParams { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO: check if these packing params are sufficient for all kleidi.
ukernel_config_cache[key] = torchao::ops::linear_8bit_act_xbit_weight::UKernelConfig{ | ||
/*preferred_alignment*/16, | ||
/*weight_packing*/ | ||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can rework the kleidiai integration to share weight packing, rather than repeat in each namespace.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is shared in code, but exposed along with the kernel so you don't have to map it back to the kernel at call sites.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is in shared code, but not in a way that is convenient to access with shared mr kernels because the same packing function (indexed by nr, kr, sr) is given 4 different names (based on namespace).
So we could refactor it to make one packing function in kai_matmul_clamp_f32_qai8dxp_qsi4c32p, rather than have them in further specific namespaces?
/*kernels*/ | ||
{{ | ||
{ | ||
/*mr*/static_cast<int>(uk.get_m_step()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
List of methods index by mr.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good start. If you can also think some more about code organization, for taking a lot more kernels, and scalability in general.
@@ -8,13 +8,23 @@ cmake_minimum_required(VERSION 3.19) | |||
|
|||
include(${CMAKE_CURRENT_SOURCE_DIR}/../../Utils.cmake) | |||
|
|||
add_compile_options(-Wno-unused-function -Wno-unused-variable) # For some reason cpuinfo package has unused functions/variables |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix it upstream?
ukernel_config_cache[key] = torchao::ops::linear_8bit_act_xbit_weight::UKernelConfig{ | ||
/*preferred_alignment*/16, | ||
/*weight_packing*/ | ||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is shared in code, but exposed along with the kernel so you don't have to map it back to the kernel at call sites.
assert (sr == uk.get_sr()); | ||
|
||
ukernel_config_cache[key] = torchao::ops::linear_8bit_act_xbit_weight::UKernelConfig{ | ||
/*preferred_alignment*/16, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit
/*preferred_alignment*/16, | |
/*preferred_alignment*/uk.get_preferred_alignment(), |
#if defined(TORCHAO_ENABLE_KLEIDI) | ||
if (!target || *target == "kleidi_ai") { | ||
if (weight_nbit == 4 && !has_weight_zeros) { | ||
return torchao::ops::linear_8bit_act_xbit_weight::get_packed_weights_format_kleidi_ai(weight_nbit, has_weight_zeros, /*has_bias*/true, /*nr*/8, /*kr*/16, /*sr*/2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in the future we would have to make a choice for nr
based on a cpu type (or some static choice for AOT-weight-packing like this), and register [mr] kernels, which you are already planning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we can use any method in cpuinfo to select packed_weights_format, including any packing params like nr. This is not entirely static because universal is only selected if cpuinfo_has_arm_neon_dot is available. We could also use fields from uarch to select things here I guess?
I wonder if we should pass n and k as params in addition to target. Implementers can then take into account matrix size when selecting nr?
// ukernel must behave correctly no matter how buffers are aligned | ||
size_t preferred_alignment{0}; | ||
weight_packing_config weight_packing; | ||
std::array<kernel_config, 4> kernels; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit
std::array<kernel_config, 4> kernels; | |
std::array<kernel_config, MAX_MR_TYPES> kernels; |
weight_data_size_fn_type weight_data_size_fn{nullptr}; | ||
prepare_weight_data_fn_type prepare_weight_data_fn{nullptr}; | ||
}; | ||
struct kernel_config { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes sense that you have one packing kernel, and for which N gemm kernels index by mr, but the naming makes this confusing to read i.e. ukernel->kernel[mr].mr
// preferred_alignment for activation and weight data | ||
// Integration surfaces are not required to respect this alignment, and the | ||
// ukernel must behave correctly no matter how buffers are aligned | ||
size_t preferred_alignment{0}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we have to make sure this is same for all MRs, i.e. document, test
/*kernels*/ | ||
{{ | ||
{ | ||
/*mr*/static_cast<int>(uk.get_m_step()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in the future, when querying for mr(s), we should ensure their weight packing function pointer is same
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This goes to the comment about reworking the kleidiAI integration I guess?
Let me give it some more thought about breaking some code out. |
Adding @kimishpatel because he was curious about the PR. kernel_selector.h is the main code to pay attention to for runtime kernel selection. |
This is a draft to do ukernel selection based on cpu_info.
This relates to #1376