Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vectorization of linalg.fill #1095

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

newling
Copy link
Contributor

@newling newling commented Feb 11, 2025

This PR contains multiple changes to get vectorized assembly through peano, I will split it into multiple PRs.

Eyeballing the first test in the performance benchmark:

Before:

matmul_512_512_4096_bf16_f32_O2_npu1_4col_benchmark
--------------------------------------------------------------------------------------------------
Benchmark                                        Time             CPU   Iterations UserCounters...
--------------------------------------------------------------------------------------------------
BM_matmul/process_time/real_time_mean         2673 us         66.3 us            5 items_per_second=374.194/s
BM_matmul/process_time/real_time_median       2668 us         63.1 us            5 items_per_second=374.766/s
BM_matmul/process_time/real_time_stddev       19.3 us         16.7 us            5 items_per_second=2.69245/s
--------------------------------------------------------------------------------------------------
The largest program memory size (read from byte 72 of elf files) is 11184 bytes

After:

matmul_512_512_4096_bf16_f32_O2_npu1_4col_benchmark
--------------------------------------------------------------------------------------------------
Benchmark                                        Time             CPU   Iterations UserCounters...
--------------------------------------------------------------------------------------------------
BM_matmul/process_time/real_time_mean         2601 us         48.6 us            5 items_per_second=385.252/s
BM_matmul/process_time/real_time_median       2540 us         39.1 us            5 items_per_second=393.696/s
BM_matmul/process_time/real_time_stddev        132 us         21.0 us            5 items_per_second=18.3503/s
--------------------------------------------------------------------------------------------------
The largest program memory size (read from byte 72 of elf files) is 10112 bytes

So a nice saving on program memory, and maybe a marginal throughput boost. There a consistent saving of 1K memory across all (non-ukernel) benchmarks

@newling newling force-pushed the towards_vectorized_fill branch 2 times, most recently from 9c5d008 to 14e0dbb Compare February 12, 2025 00:10
newling added a commit that referenced this pull request Feb 20, 2025
… ops (#1117)

This is part of the PR to vectorize linalg.fill:
#1095

Basically one of the patterns introduced in
#1095 means that in one of
the subsequent passes (lowering to llvm dialect) a cast operation is
introduced outside of an `aie.core`, which needs to be inside the aie.core
for core-to-standard to work. i.e. we need to sink an operation into an
aie.core. Before this PR, there is already a pass to sink operations
into `amdaie.core`. This PR refactors that pass so that it can be reused
to sink into `aie.core` (or any other regioned op).
@newling newling force-pushed the towards_vectorized_fill branch from 14e0dbb to 9c12e38 Compare February 20, 2025 15:47
@newling newling marked this pull request as ready for review February 20, 2025 22:34
Copy link
Contributor

@Abhishek-Varma Abhishek-Varma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice % few comments.

Comment on lines +449 to +453
assert(initialVectorType && "vector must be of vector type");
assert(writeDestinationType.getElementType() ==
initialVectorType.getElementType() &&
"element types must match");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For my understanding : Why aren't these a candidate for returning match failure instead of asserting ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, a match failure is for ops that don't seem 'broken', they just don't fit a particular pattern. But I think a transfer_read that doesn't satisfy the checks here suggests something is very wrong, and we/user should take action to assess the situation. If this wasn't a pattern-based pass, I'd probably signalPassFailure(), but that's not an option with the pattern based approach.

@@ -65,41 +66,37 @@ void AMDAIEVectorizationPass::runOnOperation() {
SmallVector<Operation *> candidates;
funcOp.walk([&](Operation *op) {
// Only vectorize linalg ops (for now)
if (!isa<linalg::LinalgOp>(op)) return;
if (!isa<linalg::LinalgOp>(op)) return WalkResult::advance();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here and elsewhere : WalkResult::skip() perhaps ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure, it might work (and would be better because more efficient in theory). But I'd prefer to play with this in another PR -- I'd like to try and get vectorization working for all of these ops eventually.

@newling newling force-pushed the towards_vectorized_fill branch from 691a16e to d9f51aa Compare February 21, 2025 16:14
Comment on lines 457 to 467
arith::ConstantOp constantVectorSource = [&writeOp]() -> arith::ConstantOp {
Value current = writeOp.getVector();
while (Operation *op = current.getDefiningOp()) {
if (auto cOp = dyn_cast<arith::ConstantOp>(op)) return cOp;
if (op->getNumOperands() != 1) return {};
current = op->getOperand(0);
}
return {};
}();
if (!constantVectorSource) {
return rewriter.notifyMatchFailure(
writeOp, "vector isn't derived from arith.constant");
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment on maybeSplat applies here as well. No need to define such inlined functions when it's really just invoked once.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll change it, but I actually find the approach with lambdas clearer to read: (1) my eye can skip to end of function if it's not interested in it's details of the traversal (2) name encapsulation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, done! It does look nice now.

@newling newling force-pushed the towards_vectorized_fill branch from a5680a7 to 8bc90eb Compare February 21, 2025 19:37
@newling newling force-pushed the towards_vectorized_fill branch from 8bc90eb to f696e6c Compare February 21, 2025 23:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants