Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[LoweringStrategy] Refactor to take num of rows/cols as inputs #955

Merged
merged 5 commits into from
Dec 5, 2024

Conversation

yzhang93
Copy link
Contributor

@yzhang93 yzhang93 commented Dec 3, 2024

This PR added support to take the number of cores (using flag --iree-amdaie-num-rows --iree-amdaie-num-cols) as input and generate L2 tile sizes accordingly. This PR also addressed some remaining issues when switching to 4x4 array usage.

  • In AIR pipeline, for matmul ops switch to use 4x4 array by default. However, the matmul-elementwise tests don't work with 4x4 array, so added flags to use 2x2 array in ci. CC @erwei-xilinx
  • Also change option name usePassPipeline to useTilePipeline.

maxCoreCols = 8;
break;
default:
llvm::errs() << "unhandled NPU partitioning.\n";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this supposed to bail out or exit at this point ?
I'm not sure what llvm::errs() does besides printing the message. In case it acts as an "assert" and behaves in the "bail out" mechanism then this is okay.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I'd put an assert(false && "unhandled target device, we must specify the array size here")

But isn't this info available already from somewhere like

?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not just printing the message, it will bail out like assert. And there are llvm::errs() used through out this file.

@newling @jtuyls Yes, we can use deviceModel.configPtr.NumCols to get num_rows/cols, but it still need some hardcode adjustments to show number of rows/cols for cores. For example, npu1_4col returns 5 cols and 6 rows, so we need to -1 cols and -2 rows for cores. While npu4 returns 8 cols and 6 rows, so we only need to -2 rows for cores.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can create a new function in the device model getNumCoreRows that returns the number of rows with cores and for the 5 columns, we probably just need to change that in device model.

Copy link
Contributor Author

@yzhang93 yzhang93 Dec 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I figured out we can use deviceModel.columns() to get correct number of cols from the device. It returns 4 cols for npu1_4col, and 5 cols for npu1. But I think it's okay, we'd never use 5 cols for npu1 device.

// CHECK-PACK-PEEL{LITERAL}: #packingConfig = #amdaie.packing_config<packing_config = [{packedSizes = [44, 64, 64], transposePackIndices = [1], unpackEmpty = [false], innerPerm = [[1, 0]], outerPerm = [[0, 1]]}, {packedSizes = [0, 0, 0, 4, 4, 8], transposePackIndices = [0, 1, 2], unpackEmpty = [false, false, true], innerPerm = [[0, 1], [1, 0], [0, 1]], outerPerm = [[0, 1, 3, 2], [0, 1, 3, 2], [0, 1, 3, 2]]}]>
// CHECK-PACK-PEEL{LITERAL}: #packingConfig = #amdaie.packing_config<packing_config = [{packedSizes = [44, 32, 64], transposePackIndices = [1], unpackEmpty = [false], innerPerm = [[1, 0]], outerPerm = [[0, 1]]}, {packedSizes = [0, 0, 0, 4, 4, 8], transposePackIndices = [0, 1, 2], unpackEmpty = [false, false, true], innerPerm = [[0, 1], [1, 0], [0, 1]], outerPerm = [[0, 1, 3, 2], [0, 1, 3, 2], [0, 1, 3, 2]]}]>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did it change from 64 -> 32 ?

Wouldn't this warrant inclusion of the row/col flags ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it's because now we change 2x2 core usage to 4x4 by default for AIR path. If you use flag --iree-amdaie-num-rows=2 --iree-amdaie-num-cols=2 you will see the previous sizes.

self.add_aie_compilation_flags(
[
"--iree-amdaie-matmul-elementwise-fusion",
"--iree-amdaie-num-rows=2",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't this change the default for all matmuls from 4x4 to 2x2? Maybe not what we want?

I'd be interested to know how many of the tests

for n_rows in [1,4]:
  for n_cols in [1,2,3,4]:
    # run matmul with these values  (on npu1_4col for say M=N=K=3*256).

work

Copy link
Collaborator

@jtuyls jtuyls Dec 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's being shown in a confusing way here, but these changes are actually in MatmulThinBias, not Matmul.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's added to MatmulThinBias class and used for matmul-elementwise tests in AIR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ha, I totally missed that...

{8C56B1BC-EB69-4D62-9344-F07F07905261}

@newling
Copy link
Contributor

newling commented Dec 4, 2024

Can the flags --xrt_lite_n_core_rows and --xrt_lite_n_core_cols be reused, instead of adding new flags --iree-amdaie-num-rows and --iree-amdaie-num-cols ? When would I use different values for these?

@jtuyls
Copy link
Collaborator

jtuyls commented Dec 4, 2024

Can the flags --xrt_lite_n_core_rows and --xrt_lite_n_core_cols be reused, instead of adding new flags --iree-amdaie-num-rows and --iree-amdaie-num-cols ? When would I use different values for these?

--xrt_lite_n_core_rows and --xrt_lite_n_core_cols are iree-run-module flags. I don't think they can be reused in iree-compile.

@newling
Copy link
Contributor

newling commented Dec 4, 2024

Can the flags --xrt_lite_n_core_rows and --xrt_lite_n_core_cols be reused, instead of adding new flags --iree-amdaie-num-rows and --iree-amdaie-num-cols ? When would I use different values for these?

--xrt_lite_n_core_rows and --xrt_lite_n_core_cols are iree-run-module flags. I don't think they can be reused in iree-compile.

That makes sense. Maybe (unrelated to this PR) the xrt_lite_n_core_rows and xrt_lite_n_core_cols can be inferred from the device? i.e. always just take the maximum, so for npu1_4col, infer that xrt_lite_n_core_row=xrt_lite_n_core_col=4` ? There was probably a reason Maks didn't do it this way though

Copy link
Contributor

@newling newling left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks for this.

I tried this locally with run.py with t M=N=512 K=4096 with different numbers of colums and rows, and I see

--iree-amdaie-num-rows=4", "--iree-amdaie-num-cols=1"
error: 'amdaie.logicalobjectfifo.from_memref' op should have at least one tile candidate

--iree-amdaie-num-rows=4", "--iree-amdaie-num-cols=2"
<unknown>:0: error: 'aie.memtile_dma' op could not find and assign a valid BD id

--iree-amdaie-num-rows=4", "--iree-amdaie-num-cols=3"
error: 'amdaie.logicalobjectfifo.from_memref' op should have at least one tile candidate

--iree-amdaie-num-rows=1", "--iree-amdaie-num-cols=2"
error: 'amdaie.logicalobjectfifo.from_memref' op should have at least one tile candidate

@yzhang93 have you got any example with n_rows != n_cols to run end to end? I'm just curious, I'm happy for progress to made on this as a follow-up.

@@ -368,6 +368,9 @@ struct AMDAIEDeviceModel {
return deviceConfig.maxVectorSizeBits;
}

uint32_t getNumCoreRows() const { return rows() - 2; }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/// Hardcoded row_offset == 2 -> AIE core rows start from 2

I'm wondering if there will might ever be more than 1 row of L2 memory tiles, or some other reason this might not work in the future

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should just be configPtr.AieTileNumRows

uint32_t maxCoreRows = deviceModel.getNumCoreRows();
uint32_t maxCoreCols = deviceModel.getNumCoreCols();
if (options.AMDAIENumRows <= 0 || options.AMDAIENumRows > maxCoreRows) {
llvm::report_fatal_error("option numRows is out of range\n");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
llvm::report_fatal_error("option numRows is out of range\n");
llvm::report_fatal_error(llvm::Twine("Invalid number of core rows (") +
std::to_string(options.AMDAIENumRows) +
"), must be in the range [1, " +
std::to_string(maxCoreRows) + "] for device " +
stringifyEnum(deviceModel.device));

Might save you in the future! A lit test of this would be nice if possible.

Copy link
Contributor Author

@yzhang93 yzhang93 Dec 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion! I've adopted it and manually tested the error messages. However, I don't think I can add a lit test because the program crashed with the error. expected-error doesn't work in such situations.

@@ -59,6 +59,8 @@ struct AMDAIEOptions {
bool insertLoopAroundCoreBlock{false};
bool matmulElementwiseFusion{false};
AMDAIEDevice AMDAIETargetDevice{AMDAIEDevice::npu1_4col};
int AMDAIENumRows{4};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These default values can be inferred from the default AMDAIETargetDevice: create device model and then call getNumCoreRows() and getNumCoreCols()

My thinking is that the fewer defaults to twiddle when we one day change to npu4 as the default, the better.

@@ -231,6 +233,14 @@ struct AMDAIEOptions {
clEnumValN(AMDAIEDevice::npu4, "npu4",
"Strix B0 NPU with 8 columns and 6 rows")));

binder.opt<int>("iree-amdaie-num-rows", AMDAIENumRows,
llvm::cl::cat(category),
llvm::cl::desc("Number of rows used in an AIE core array"));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
llvm::cl::desc("Number of rows used in an AIE core array"));
llvm::cl::desc("Number of rows used in an AIE core array. The compiler will choose a tiling strategy that uses no more than this number of rows. However, some workloads (like convolution) currently ignore this flag, and use a hardcoded number of rows."));

or something like that, for users without the context

@yzhang93
Copy link
Contributor Author

yzhang93 commented Dec 4, 2024

Looks good, thanks for this.

I tried this locally with run.py with t M=N=512 K=4096 with different numbers of colums and rows, and I see

--iree-amdaie-num-rows=4", "--iree-amdaie-num-cols=1"
error: 'amdaie.logicalobjectfifo.from_memref' op should have at least one tile candidate

--iree-amdaie-num-rows=4", "--iree-amdaie-num-cols=2"
<unknown>:0: error: 'aie.memtile_dma' op could not find and assign a valid BD id

--iree-amdaie-num-rows=4", "--iree-amdaie-num-cols=3"
error: 'amdaie.logicalobjectfifo.from_memref' op should have at least one tile candidate

--iree-amdaie-num-rows=1", "--iree-amdaie-num-cols=2"
error: 'amdaie.logicalobjectfifo.from_memref' op should have at least one tile candidate

@yzhang93 have you got any example with n_rows != n_cols to run end to end? I'm just curious, I'm happy for progress to made on this as a follow-up.

Thanks for trying different combinations! I think we had some e2e test running on 4x2 previously in a dev branch, but now as I tried and these failed with the same errors above. @jtuyls It looks like we'll need to generalize AssignTiles pass.

Copy link
Contributor

@Abhishek-Varma Abhishek-Varma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @yzhang93 !

LGTM % one comment.

Comment on lines 39 to 40
OpPassManager &variantPassManager, AMDAIEDevice device, int numRows,
int numCols, TilePassPipeline useTilePipeline,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
OpPassManager &variantPassManager, AMDAIEDevice device, int numRows,
int numCols, TilePassPipeline useTilePipeline,
OpPassManager &variantPassManager, AMDAIEDevice device, uint32_t numRows,
uint32_t numCols, TilePassPipeline useTilePipeline,

@yzhang93 yzhang93 enabled auto-merge (squash) December 5, 2024 06:52
@yzhang93 yzhang93 merged commit 63bc340 into nod-ai:main Dec 5, 2024
7 checks passed
Copy link
Contributor

@newling newling left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants