Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

draft ukernel selection logic #1652

Draft
wants to merge 4 commits into
base: main
Choose a base branch
from
Draft

draft ukernel selection logic #1652

wants to merge 4 commits into from

Conversation

metascroy
Copy link
Contributor

@metascroy metascroy commented Feb 3, 2025

This is a draft to do ukernel selection based on cpu_info.

This relates to #1376

@metascroy metascroy requested a review from digantdesai February 3, 2025 03:44
Copy link

pytorch-bot bot commented Feb 3, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/1652

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

❌ 1 New Failure

As of commit 7a43be4 with merge base 7815262 (image):

NEW FAILURE - The following job has failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Feb 3, 2025
@@ -98,7 +98,7 @@ LinearTilingParams get_default_linear_tiling_params(
TORCHAO_CHECK(num_threads >= 1, "num_threads must be >= 1");

tiling_params.mc_by_mr = 1;
int mc = tiling_params.mc_by_mr * ukernel_config.mr;
int mc = tiling_params.mc_by_mr * ukernel_config.kernels[0].mr;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ukernel_config now includes an array of kernels based on mr. Still need to add mr selection logic here, for now it just selects the first one.

static UKernelConfigCacheType ukernel_config_cache;

// Check cache
auto it = ukernel_config_cache.find(header);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want uarch specific kernel per core, we can add uarch to cache key and look up uarch before looking in cache, e.g.,

auto uarch = get_current_core_uarch();
auto it = ukernel_config_cache.find({header, uarch});

@@ -22,7 +22,7 @@ if(NOT TORCHAO_INCLUDE_DIRS)
set(TORCHAO_INCLUDE_DIRS ${CMAKE_CURRENT_SOURCE_DIR}/../..)
endif()

option(TORCHAO_BUILD_KLEIDIAI "Download, build, and link against Arm KleidiAI library (arm64 only)" OFF)
option(TORCHAO_BUILD_KLEIDIAI "Download, build, and link against Arm KleidiAI library (arm64 only)" ON)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: nocommit

# print(f"actual_val={actual_val}, expected_val={expected_val}")
# self.assertTrue(torch.allclose(actual_val, expected_val, atol=1e-6))

self.assertTrue(torch.abs(actual_val - expected_val) < 0.05)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do not commit change. This is because kleidi has bf16 instead of fp32.

0});
}

struct KleidiAIPackingParams {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: check if these packing params are sufficient for all kleidi.

ukernel_config_cache[key] = torchao::ops::linear_8bit_act_xbit_weight::UKernelConfig{
/*preferred_alignment*/16,
/*weight_packing*/
{
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can rework the kleidiai integration to share weight packing, rather than repeat in each namespace.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is shared in code, but exposed along with the kernel so you don't have to map it back to the kernel at call sites.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is in shared code, but not in a way that is convenient to access with shared mr kernels because the same packing function (indexed by nr, kr, sr) is given 4 different names (based on namespace).

So we could refactor it to make one packing function in kai_matmul_clamp_f32_qai8dxp_qsi4c32p, rather than have them in further specific namespaces?

/*kernels*/
{{
{
/*mr*/static_cast<int>(uk.get_m_step()),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

List of methods index by mr.

Copy link
Contributor

@digantdesai digantdesai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good start. If you can also think some more about code organization, for taking a lot more kernels, and scalability in general.

@@ -8,13 +8,23 @@ cmake_minimum_required(VERSION 3.19)

include(${CMAKE_CURRENT_SOURCE_DIR}/../../Utils.cmake)

add_compile_options(-Wno-unused-function -Wno-unused-variable) # For some reason cpuinfo package has unused functions/variables
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix it upstream?

ukernel_config_cache[key] = torchao::ops::linear_8bit_act_xbit_weight::UKernelConfig{
/*preferred_alignment*/16,
/*weight_packing*/
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is shared in code, but exposed along with the kernel so you don't have to map it back to the kernel at call sites.

assert (sr == uk.get_sr());

ukernel_config_cache[key] = torchao::ops::linear_8bit_act_xbit_weight::UKernelConfig{
/*preferred_alignment*/16,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit

Suggested change
/*preferred_alignment*/16,
/*preferred_alignment*/uk.get_preferred_alignment(),

#if defined(TORCHAO_ENABLE_KLEIDI)
if (!target || *target == "kleidi_ai") {
if (weight_nbit == 4 && !has_weight_zeros) {
return torchao::ops::linear_8bit_act_xbit_weight::get_packed_weights_format_kleidi_ai(weight_nbit, has_weight_zeros, /*has_bias*/true, /*nr*/8, /*kr*/16, /*sr*/2);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in the future we would have to make a choice for nr based on a cpu type (or some static choice for AOT-weight-packing like this), and register [mr] kernels, which you are already planning.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we can use any method in cpuinfo to select packed_weights_format, including any packing params like nr. This is not entirely static because universal is only selected if cpuinfo_has_arm_neon_dot is available. We could also use fields from uarch to select things here I guess?

I wonder if we should pass n and k as params in addition to target. Implementers can then take into account matrix size when selecting nr?

// ukernel must behave correctly no matter how buffers are aligned
size_t preferred_alignment{0};
weight_packing_config weight_packing;
std::array<kernel_config, 4> kernels;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit

Suggested change
std::array<kernel_config, 4> kernels;
std::array<kernel_config, MAX_MR_TYPES> kernels;

weight_data_size_fn_type weight_data_size_fn{nullptr};
prepare_weight_data_fn_type prepare_weight_data_fn{nullptr};
};
struct kernel_config {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense that you have one packing kernel, and for which N gemm kernels index by mr, but the naming makes this confusing to read i.e. ukernel->kernel[mr].mr

// preferred_alignment for activation and weight data
// Integration surfaces are not required to respect this alignment, and the
// ukernel must behave correctly no matter how buffers are aligned
size_t preferred_alignment{0};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have to make sure this is same for all MRs, i.e. document, test

/*kernels*/
{{
{
/*mr*/static_cast<int>(uk.get_m_step()),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in the future, when querying for mr(s), we should ensure their weight packing function pointer is same

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This goes to the comment about reworking the kleidiAI integration I guess?

@metascroy
Copy link
Contributor Author

Good start. If you can also think some more about code organization, for taking a lot more kernels, and scalability in general.

Let me give it some more thought about breaking some code out.

@metascroy
Copy link
Contributor Author

Adding @kimishpatel because he was curious about the PR. kernel_selector.h is the main code to pay attention to for runtime kernel selection.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants