-
Notifications
You must be signed in to change notification settings - Fork 13.9k
[LV][POC] Use umin to avoid second-to-last iteration problems with EVL #143434
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
This patch is a proof of concept. More than anything else, its purpose is to ask: am I missing something here? I'm mostly just hoping to trigger some discussion. I make no claims the patch is generally correct, or well structured. I'm happy to iterate on a real patch once the direction is agreed upon. Background RISCV supports VL predication via the VL configuration register. The specification has an special case for the case when the requested AVL is between VLMAX and VLMAX*2 when a vsetvl instruction is executed. We have existing support for the VP family of intrinsics which allow the use of AVL (which by assumption) must be less than VLMAX. This results in the vectorizer needing to adjust the AVL requested to reconcile these different requirements. The current approach in the EVL vectorization path is that we introduce an intrinsic which fairly directly maps to the vsetvli behavior described above, and adjusts the induction variable of the vector loop to advance not by VF*UF, but by AVL (in this case, the result AVL, not the requested AVL). This means that the last *two* iterations can have fewer iterations active than VF*UF. This breaks a bunch of assumptions in the vectorizer, and has required a bunch of work to address, with some of that work still in progress. Proposal This patch implements a conceptually simple idea - we use a umin to compute the requested AVL for the VP intrinsics such that the requested AVL is VF*UF on all but the last iteration. (Compared to the existing EVL strategy, this basically redistributes active lanes between the last two iterations.) Doing so means we may end up with an extra "umin" in the loop, but simplies the implementation since the canonical IV does not need to be changed. The major advantage of this is that all remaining EVL specific restrictions can be removed, and that this variant can vectorize and predicate anything mask predication can handle. Results On TSCV, the current EVL strategy is 6.7% worse (in geomean across all tests) than the default main-loop + scalar epilogue approach. The umin-EVL variant (this patch) is 2.2% worse than the default. Interestingly, that improvement comes primarily by unblocking vectorization cases that the current EVL code can't handle. Without relaxating the restrictions (which are no longer needed for correctness) we only see a slight gain from the IV structure simplification. Looking into the reason for the remaining delta, there appear to be a few features in the vectorizer which don't work with *any* of the supported predication styles. These appear to be independently fixable, and I suspect all of the variants would improve slightly. For comparison, mask based predication is 18.5% worse than the scalar-epilogue approach. That does improve slightly if mask generation is improved (in a hacky way which is only safe on TSVC) but the two EVL variants are still significantly ahead. (All measurements on a BP3 w/-mcpu=spacemit-x60 -mllvm -force-tail-folding-style=XYZ -mllvm -prefer-predicate-over-epilogue=predicate-else-scalar-epilogue) Discussion A few bits to note. This relies on having the umin instruction, so probably only makes sense with zbb. This requires materializing the VF*UF term as a loop invariant expression. This will involve a read of vlenb, which may be slow on certain hardware. There's an alternate phrasing I can choose which doesn't require either the umin or read of vlenb, but it's signficantly more complicated. The umin formulation can probably be peeled off the last iteration of the loop now that Florian has implemented reverse peeling. The current code hasn't been updated to do this yet, but likely could be without a major redesign. This would result in a net result which looks like a vectorized (unpredicated) main body and a predicated vector epilogue. This formulation will likely require an extra register over the existing formulation (since you have to materialize the umin), but a quick glance doesn't reveal this being a systemic problem. A few of the sub-tests in TSVC require more instructions in the loop. A couple I glanced at seem to be LSR bugs, but I haven't glanced at them in any depth (and frankly, don't plan to as the results already look quite decent).
@llvm/pr-subscribers-vectorizers @llvm/pr-subscribers-llvm-transforms Author: Philip Reames (preames) ChangesThis patch is a proof of concept. More than anything else, its purpose is to ask: am I missing something here? I'm mostly just hoping to trigger some discussion. I make no claims the patch is generally correct, or well structured. I'm happy to iterate on a real patch once the direction is agreed upon. Background RISCV supports VL predication via the VL configuration register. The specification has an special case for the case when the requested AVL is between VLMAX and VLMAX*2 when a vsetvl instruction is executed. We have existing support for the VP family of intrinsics which allow the use of AVL (which by assumption) must be less than VLMAX. This results in the vectorizer needing to adjust the AVL requested to reconcile these different requirements. The current approach in the EVL vectorization path is that we introduce an intrinsic which fairly directly maps to the vsetvli behavior described above, and adjusts the induction variable of the vector loop to advance not by VFUF, but by AVL (in this case, the result AVL, not the requested AVL). This means that the last two iterations can have fewer iterations active than VFUF. This breaks a bunch of assumptions in the vectorizer, and has required a bunch of work to address, with some of that work still in progress. Proposal This patch implements a conceptually simple idea - we use a umin to compute the requested AVL for the VP intrinsics such that the requested AVL is VF*UF on all but the last iteration. (Compared to the existing EVL strategy, this basically redistributes active lanes between the last two iterations.) Doing so means we may end up with an extra "umin" in the loop, but simplies the implementation since the canonical IV does not need to be changed. The major advantage of this is that all remaining EVL specific restrictions can be removed, and that this variant can vectorize and predicate anything mask predication can handle. Results On TSCV, the current EVL strategy is 6.7% worse (in geomean across all tests) than the default main-loop + scalar epilogue approach. The umin-EVL variant (this patch) is 2.2% worse than the default. Interestingly, that improvement comes primarily by unblocking vectorization cases that the current EVL code can't handle. Without relaxating the restrictions (which are no longer needed for correctness) we only see a slight gain from the IV structure simplification. Looking into the reason for the remaining delta, there appear to be a few features in the vectorizer which don't work with any of the supported predication styles. These appear to be independently fixable, and I suspect all of the variants would improve slightly. For comparison, mask based predication is 18.5% worse than the scalar-epilogue approach. That does improve slightly if mask generation is improved (in a hacky way which is only safe on TSVC) but the two EVL variants are still significantly ahead. (All measurements on a BP3 w/-mcpu=spacemit-x60 -mllvm -force-tail-folding-style=XYZ -mllvm -prefer-predicate-over-epilogue=predicate-else-scalar-epilogue) Discussion A few bits to note. This relies on having the umin instruction, so probably only makes sense with zbb. This requires materializing the VF*UF term as a loop invariant expression. This will involve a read of vlenb, which may be slow on certain hardware. There's an alternate phrasing I can choose which doesn't require either the umin or read of vlenb, but it's signficantly more complicated. The umin formulation can probably be peeled off the last iteration of the loop now that Florian has implemented reverse peeling. The current code hasn't been updated to do this yet, but likely could be without a major redesign. This would result in a net result which looks like a vectorized (unpredicated) main body and a predicated vector epilogue. This formulation will likely require an extra register over the existing formulation (since you have to materialize the umin), but a quick glance doesn't reveal this being a systemic problem. A few of the sub-tests in TSVC require more instructions in the loop. A couple I glanced at seem to be LSR bugs, but I haven't glanced at them in any depth (and frankly, don't plan to as the results already look quite decent). Patch is 376.35 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/143434.diff 24 Files Affected:
diff --git a/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp b/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
index ea617f042566b..51703599edb5d 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
@@ -2166,9 +2166,7 @@ static void transformRecipestoEVLRecipes(VPlan &Plan, VPValue &EVL) {
VPRegionBlock *LoopRegion = Plan.getVectorLoopRegion();
VPBasicBlock *Header = LoopRegion->getEntryBasicBlock();
- // Create a scalar phi to track the previous EVL if fixed-order recurrence is
- // contained.
- VPInstruction *PrevEVL = nullptr;
+ VPValue *PrevEVL = nullptr;
bool ContainsFORs =
any_of(Header->phis(), IsaPred<VPFirstOrderRecurrencePHIRecipe>);
if (ContainsFORs) {
@@ -2183,8 +2181,7 @@ static void transformRecipestoEVLRecipes(VPlan &Plan, VPValue &EVL) {
VFSize > 32 ? Instruction::Trunc : Instruction::ZExt, MaxEVL,
Type::getInt32Ty(Ctx), DebugLoc());
}
- Builder.setInsertPoint(Header, Header->getFirstNonPhi());
- PrevEVL = Builder.createScalarPhi({MaxEVL, &EVL}, DebugLoc(), "prev.evl");
+ PrevEVL = MaxEVL;
}
for (VPUser *U : to_vector(Plan.getVF().users())) {
@@ -2273,22 +2270,13 @@ bool VPlanTransforms::tryAddExplicitVectorLength(
// The transform updates all users of inductions to work based on EVL, instead
// of the VF directly. At the moment, widened inductions cannot be updated, so
// bail out if the plan contains any.
- bool ContainsWidenInductions = any_of(
- Header->phis(),
- IsaPred<VPWidenIntOrFpInductionRecipe, VPWidenPointerInductionRecipe>);
- if (ContainsWidenInductions)
- return false;
auto *CanonicalIVPHI = Plan.getCanonicalIV();
- VPValue *StartV = CanonicalIVPHI->getStartValue();
- // Create the ExplicitVectorLengthPhi recipe in the main loop.
- auto *EVLPhi = new VPEVLBasedIVPHIRecipe(StartV, DebugLoc());
- EVLPhi->insertAfter(CanonicalIVPHI);
VPBuilder Builder(Header, Header->getFirstNonPhi());
// Compute original TC - IV as the AVL (application vector length).
VPValue *AVL = Builder.createNaryOp(
- Instruction::Sub, {Plan.getTripCount(), EVLPhi}, DebugLoc(), "avl");
+ Instruction::Sub, {Plan.getTripCount(), CanonicalIVPHI}, DebugLoc(), "avl");
if (MaxSafeElements) {
// Support for MaxSafeDist for correct loop emission.
VPValue *AVLSafe = Plan.getOrAddLiveIn(
@@ -2296,32 +2284,20 @@ bool VPlanTransforms::tryAddExplicitVectorLength(
VPValue *Cmp = Builder.createICmp(ICmpInst::ICMP_ULT, AVL, AVLSafe);
AVL = Builder.createSelect(Cmp, AVL, AVLSafe, DebugLoc(), "safe_avl");
}
- auto *VPEVL = Builder.createNaryOp(VPInstruction::ExplicitVectorLength, AVL,
- DebugLoc());
- auto *CanonicalIVIncrement =
- cast<VPInstruction>(CanonicalIVPHI->getBackedgeValue());
- Builder.setInsertPoint(CanonicalIVIncrement);
- VPSingleDefRecipe *OpVPEVL = VPEVL;
- if (unsigned IVSize = CanonicalIVPHI->getScalarType()->getScalarSizeInBits();
- IVSize != 32) {
- OpVPEVL = Builder.createScalarCast(
- IVSize < 32 ? Instruction::Trunc : Instruction::ZExt, OpVPEVL,
- CanonicalIVPHI->getScalarType(), CanonicalIVIncrement->getDebugLoc());
- }
- auto *NextEVLIV = Builder.createOverflowingOp(
- Instruction::Add, {OpVPEVL, EVLPhi},
- {CanonicalIVIncrement->hasNoUnsignedWrap(),
- CanonicalIVIncrement->hasNoSignedWrap()},
- CanonicalIVIncrement->getDebugLoc(), "index.evl.next");
- EVLPhi->addOperand(NextEVLIV);
+ // This is just a umin pattern
+ VPValue &VFxUF = Plan.getVFxUF();
+ VPValue *Cmp = Builder.createICmp(ICmpInst::ICMP_ULT, AVL, &VFxUF);
+ auto *VPEVL = Builder.createSelect(Cmp, AVL, &VFxUF, DebugLoc());
+
+ unsigned BitWidth = CanonicalIVPHI->getScalarType()->getScalarSizeInBits();
+ LLVMContext &Ctx = CanonicalIVPHI->getScalarType()->getContext();
+ VPEVL = Builder.createScalarCast(
+ BitWidth > 32 ? Instruction::Trunc : Instruction::ZExt, VPEVL,
+ Type::getInt32Ty(Ctx), DebugLoc());
transformRecipestoEVLRecipes(Plan, *VPEVL);
- // Replace all uses of VPCanonicalIVPHIRecipe by
- // VPEVLBasedIVPHIRecipe except for the canonical IV increment.
- CanonicalIVPHI->replaceAllUsesWith(EVLPhi);
- CanonicalIVIncrement->setOperand(0, CanonicalIVPHI);
// TODO: support unroll factor > 1.
Plan.setUF(1);
return true;
diff --git a/llvm/test/Transforms/LoopVectorize/RISCV/evl-compatible-loops.ll b/llvm/test/Transforms/LoopVectorize/RISCV/evl-compatible-loops.ll
index e40f51fd7bd70..86ce6a19e4747 100644
--- a/llvm/test/Transforms/LoopVectorize/RISCV/evl-compatible-loops.ll
+++ b/llvm/test/Transforms/LoopVectorize/RISCV/evl-compatible-loops.ll
@@ -8,14 +8,53 @@ define void @test_wide_integer_induction(ptr noalias %a, i64 %N) {
; CHECK-LABEL: define void @test_wide_integer_induction(
; CHECK-SAME: ptr noalias [[A:%.*]], i64 [[N:%.*]]) #[[ATTR0:[0-9]+]] {
; CHECK-NEXT: entry:
+; CHECK-NEXT: [[TMP0:%.*]] = sub i64 -1, [[N]]
+; CHECK-NEXT: [[TMP1:%.*]] = call i64 @llvm.vscale.i64()
+; CHECK-NEXT: [[TMP2:%.*]] = mul i64 [[TMP1]], 2
+; CHECK-NEXT: [[TMP3:%.*]] = icmp ult i64 [[TMP0]], [[TMP2]]
+; CHECK-NEXT: br i1 [[TMP3]], label [[SCALAR_PH:%.*]], label [[ENTRY:%.*]]
+; CHECK: vector.ph:
+; CHECK-NEXT: [[TMP4:%.*]] = call i64 @llvm.vscale.i64()
+; CHECK-NEXT: [[TMP5:%.*]] = mul i64 [[TMP4]], 2
+; CHECK-NEXT: [[TMP6:%.*]] = sub i64 [[TMP5]], 1
+; CHECK-NEXT: [[N_RND_UP:%.*]] = add i64 [[N]], [[TMP6]]
+; CHECK-NEXT: [[N_MOD_VF:%.*]] = urem i64 [[N_RND_UP]], [[TMP5]]
+; CHECK-NEXT: [[N_VEC:%.*]] = sub i64 [[N_RND_UP]], [[N_MOD_VF]]
+; CHECK-NEXT: [[TMP7:%.*]] = call i64 @llvm.vscale.i64()
+; CHECK-NEXT: [[TMP8:%.*]] = mul i64 [[TMP7]], 2
+; CHECK-NEXT: [[TMP9:%.*]] = call <vscale x 2 x i64> @llvm.stepvector.nxv2i64()
+; CHECK-NEXT: [[TMP10:%.*]] = mul <vscale x 2 x i64> [[TMP9]], splat (i64 1)
+; CHECK-NEXT: [[INDUCTION:%.*]] = add <vscale x 2 x i64> zeroinitializer, [[TMP10]]
+; CHECK-NEXT: [[TMP11:%.*]] = mul i64 1, [[TMP8]]
+; CHECK-NEXT: [[DOTSPLATINSERT:%.*]] = insertelement <vscale x 2 x i64> poison, i64 [[TMP11]], i64 0
+; CHECK-NEXT: [[DOTSPLAT:%.*]] = shufflevector <vscale x 2 x i64> [[DOTSPLATINSERT]], <vscale x 2 x i64> poison, <vscale x 2 x i32> zeroinitializer
; CHECK-NEXT: br label [[FOR_BODY:%.*]]
-; CHECK: for.body:
-; CHECK-NEXT: [[IV:%.*]] = phi i64 [ 0, [[ENTRY:%.*]] ], [ [[IV_NEXT:%.*]], [[FOR_BODY]] ]
+; CHECK: vector.body:
+; CHECK-NEXT: [[IV:%.*]] = phi i64 [ 0, [[ENTRY]] ], [ [[IV_NEXT:%.*]], [[FOR_BODY]] ]
+; CHECK-NEXT: [[VEC_IND:%.*]] = phi <vscale x 2 x i64> [ [[INDUCTION]], [[ENTRY]] ], [ [[VEC_IND_NEXT:%.*]], [[FOR_BODY]] ]
+; CHECK-NEXT: [[AVL:%.*]] = sub i64 [[N]], [[IV]]
+; CHECK-NEXT: [[TMP12:%.*]] = icmp ult i64 [[AVL]], [[TMP8]]
+; CHECK-NEXT: [[TMP13:%.*]] = select i1 [[TMP12]], i64 [[AVL]], i64 [[TMP8]]
+; CHECK-NEXT: [[TMP14:%.*]] = trunc i64 [[TMP13]] to i32
; CHECK-NEXT: [[ARRAYIDX:%.*]] = getelementptr inbounds i64, ptr [[A]], i64 [[IV]]
-; CHECK-NEXT: store i64 [[IV]], ptr [[ARRAYIDX]], align 8
-; CHECK-NEXT: [[IV_NEXT]] = add nuw nsw i64 [[IV]], 1
-; CHECK-NEXT: [[EXITCOND_NOT:%.*]] = icmp eq i64 [[IV_NEXT]], [[N]]
-; CHECK-NEXT: br i1 [[EXITCOND_NOT]], label [[FOR_COND_CLEANUP:%.*]], label [[FOR_BODY]]
+; CHECK-NEXT: [[TMP16:%.*]] = getelementptr inbounds i64, ptr [[ARRAYIDX]], i32 0
+; CHECK-NEXT: call void @llvm.vp.store.nxv2i64.p0(<vscale x 2 x i64> [[VEC_IND]], ptr align 8 [[TMP16]], <vscale x 2 x i1> splat (i1 true), i32 [[TMP14]])
+; CHECK-NEXT: [[IV_NEXT]] = add i64 [[IV]], [[TMP8]]
+; CHECK-NEXT: [[VEC_IND_NEXT]] = add <vscale x 2 x i64> [[VEC_IND]], [[DOTSPLAT]]
+; CHECK-NEXT: [[TMP17:%.*]] = icmp eq i64 [[IV_NEXT]], [[N_VEC]]
+; CHECK-NEXT: br i1 [[TMP17]], label [[MIDDLE_BLOCK:%.*]], label [[FOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]]
+; CHECK: middle.block:
+; CHECK-NEXT: br label [[FOR_COND_CLEANUP:%.*]]
+; CHECK: scalar.ph:
+; CHECK-NEXT: [[BC_RESUME_VAL:%.*]] = phi i64 [ 0, [[ENTRY1:%.*]] ]
+; CHECK-NEXT: br label [[FOR_BODY1:%.*]]
+; CHECK: for.body:
+; CHECK-NEXT: [[IV1:%.*]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[IV_NEXT1:%.*]], [[FOR_BODY1]] ]
+; CHECK-NEXT: [[ARRAYIDX1:%.*]] = getelementptr inbounds i64, ptr [[A]], i64 [[IV1]]
+; CHECK-NEXT: store i64 [[IV1]], ptr [[ARRAYIDX1]], align 8
+; CHECK-NEXT: [[IV_NEXT1]] = add nuw nsw i64 [[IV1]], 1
+; CHECK-NEXT: [[EXITCOND_NOT:%.*]] = icmp eq i64 [[IV_NEXT1]], [[N]]
+; CHECK-NEXT: br i1 [[EXITCOND_NOT]], label [[FOR_COND_CLEANUP]], label [[FOR_BODY1]], !llvm.loop [[LOOP3:![0-9]+]]
; CHECK: for.cond.cleanup:
; CHECK-NEXT: ret void
;
@@ -39,16 +78,60 @@ define void @test_wide_ptr_induction(ptr noalias %a, ptr noalias %b, i64 %N) {
; CHECK-LABEL: define void @test_wide_ptr_induction(
; CHECK-SAME: ptr noalias [[A:%.*]], ptr noalias [[B:%.*]], i64 [[N:%.*]]) #[[ATTR0]] {
; CHECK-NEXT: entry:
+; CHECK-NEXT: [[TMP0:%.*]] = sub i64 -1, [[N]]
+; CHECK-NEXT: [[TMP1:%.*]] = call i64 @llvm.vscale.i64()
+; CHECK-NEXT: [[TMP2:%.*]] = mul i64 [[TMP1]], 2
+; CHECK-NEXT: [[TMP3:%.*]] = icmp ult i64 [[TMP0]], [[TMP2]]
+; CHECK-NEXT: br i1 [[TMP3]], label [[SCALAR_PH:%.*]], label [[VECTOR_PH:%.*]]
+; CHECK: vector.ph:
+; CHECK-NEXT: [[TMP4:%.*]] = call i64 @llvm.vscale.i64()
+; CHECK-NEXT: [[TMP5:%.*]] = mul i64 [[TMP4]], 2
+; CHECK-NEXT: [[TMP6:%.*]] = sub i64 [[TMP5]], 1
+; CHECK-NEXT: [[N_RND_UP:%.*]] = add i64 [[N]], [[TMP6]]
+; CHECK-NEXT: [[N_MOD_VF:%.*]] = urem i64 [[N_RND_UP]], [[TMP5]]
+; CHECK-NEXT: [[N_VEC:%.*]] = sub i64 [[N_RND_UP]], [[N_MOD_VF]]
+; CHECK-NEXT: [[TMP7:%.*]] = call i64 @llvm.vscale.i64()
+; CHECK-NEXT: [[TMP8:%.*]] = mul i64 [[TMP7]], 2
; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
+; CHECK: vector.body:
+; CHECK-NEXT: [[EVL_BASED_IV:%.*]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_EVL_NEXT:%.*]], [[VECTOR_BODY]] ]
+; CHECK-NEXT: [[POINTER_PHI:%.*]] = phi ptr [ [[B]], [[VECTOR_PH]] ], [ [[PTR_IND:%.*]], [[VECTOR_BODY]] ]
+; CHECK-NEXT: [[TMP9:%.*]] = call i64 @llvm.vscale.i64()
+; CHECK-NEXT: [[TMP10:%.*]] = mul i64 [[TMP9]], 2
+; CHECK-NEXT: [[TMP11:%.*]] = mul i64 8, [[TMP8]]
+; CHECK-NEXT: [[TMP12:%.*]] = mul i64 [[TMP10]], 0
+; CHECK-NEXT: [[DOTSPLATINSERT:%.*]] = insertelement <vscale x 2 x i64> poison, i64 [[TMP12]], i64 0
+; CHECK-NEXT: [[DOTSPLAT:%.*]] = shufflevector <vscale x 2 x i64> [[DOTSPLATINSERT]], <vscale x 2 x i64> poison, <vscale x 2 x i32> zeroinitializer
+; CHECK-NEXT: [[TMP13:%.*]] = call <vscale x 2 x i64> @llvm.stepvector.nxv2i64()
+; CHECK-NEXT: [[TMP14:%.*]] = add <vscale x 2 x i64> [[DOTSPLAT]], [[TMP13]]
+; CHECK-NEXT: [[TMP15:%.*]] = mul <vscale x 2 x i64> [[TMP14]], splat (i64 8)
+; CHECK-NEXT: [[VECTOR_GEP:%.*]] = getelementptr i8, ptr [[POINTER_PHI]], <vscale x 2 x i64> [[TMP15]]
+; CHECK-NEXT: [[AVL:%.*]] = sub i64 [[N]], [[EVL_BASED_IV]]
+; CHECK-NEXT: [[TMP16:%.*]] = icmp ult i64 [[AVL]], [[TMP8]]
+; CHECK-NEXT: [[TMP17:%.*]] = select i1 [[TMP16]], i64 [[AVL]], i64 [[TMP8]]
+; CHECK-NEXT: [[TMP18:%.*]] = trunc i64 [[TMP17]] to i32
+; CHECK-NEXT: [[TMP19:%.*]] = getelementptr inbounds i64, ptr [[A]], i64 [[EVL_BASED_IV]]
+; CHECK-NEXT: [[TMP20:%.*]] = getelementptr inbounds ptr, ptr [[TMP19]], i32 0
+; CHECK-NEXT: call void @llvm.vp.store.nxv2p0.p0(<vscale x 2 x ptr> [[VECTOR_GEP]], ptr align 8 [[TMP20]], <vscale x 2 x i1> splat (i1 true), i32 [[TMP18]])
+; CHECK-NEXT: [[INDEX_EVL_NEXT]] = add i64 [[EVL_BASED_IV]], [[TMP8]]
+; CHECK-NEXT: [[PTR_IND]] = getelementptr i8, ptr [[POINTER_PHI]], i64 [[TMP11]]
+; CHECK-NEXT: [[TMP21:%.*]] = icmp eq i64 [[INDEX_EVL_NEXT]], [[N_VEC]]
+; CHECK-NEXT: br i1 [[TMP21]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP4:![0-9]+]]
+; CHECK: middle.block:
+; CHECK-NEXT: br label [[FOR_COND_CLEANUP:%.*]]
+; CHECK: scalar.ph:
+; CHECK-NEXT: [[BC_RESUME_VAL:%.*]] = phi i64 [ 0, [[ENTRY:%.*]] ]
+; CHECK-NEXT: [[BC_RESUME_VAL1:%.*]] = phi ptr [ [[B]], [[ENTRY]] ]
+; CHECK-NEXT: br label [[FOR_BODY:%.*]]
; CHECK: for.body:
-; CHECK-NEXT: [[EVL_BASED_IV:%.*]] = phi i64 [ 0, [[VECTOR_PH:%.*]] ], [ [[INDEX_EVL_NEXT:%.*]], [[VECTOR_BODY]] ]
-; CHECK-NEXT: [[ADDR:%.*]] = phi ptr [ [[INCDEC_PTR:%.*]], [[VECTOR_BODY]] ], [ [[B]], [[VECTOR_PH]] ]
+; CHECK-NEXT: [[IV:%.*]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[IV_NEXT:%.*]], [[FOR_BODY]] ]
+; CHECK-NEXT: [[ADDR:%.*]] = phi ptr [ [[INCDEC_PTR:%.*]], [[FOR_BODY]] ], [ [[BC_RESUME_VAL1]], [[SCALAR_PH]] ]
; CHECK-NEXT: [[INCDEC_PTR]] = getelementptr inbounds i8, ptr [[ADDR]], i64 8
-; CHECK-NEXT: [[ARRAYIDX:%.*]] = getelementptr inbounds i64, ptr [[A]], i64 [[EVL_BASED_IV]]
+; CHECK-NEXT: [[ARRAYIDX:%.*]] = getelementptr inbounds i64, ptr [[A]], i64 [[IV]]
; CHECK-NEXT: store ptr [[ADDR]], ptr [[ARRAYIDX]], align 8
-; CHECK-NEXT: [[INDEX_EVL_NEXT]] = add nuw nsw i64 [[EVL_BASED_IV]], 1
-; CHECK-NEXT: [[EXITCOND_NOT:%.*]] = icmp eq i64 [[INDEX_EVL_NEXT]], [[N]]
-; CHECK-NEXT: br i1 [[EXITCOND_NOT]], label [[FOR_COND_CLEANUP:%.*]], label [[VECTOR_BODY]]
+; CHECK-NEXT: [[IV_NEXT]] = add nuw nsw i64 [[IV]], 1
+; CHECK-NEXT: [[EXITCOND_NOT:%.*]] = icmp eq i64 [[IV_NEXT]], [[N]]
+; CHECK-NEXT: br i1 [[EXITCOND_NOT]], label [[FOR_COND_CLEANUP]], label [[FOR_BODY]], !llvm.loop [[LOOP5:![0-9]+]]
; CHECK: for.cond.cleanup:
; CHECK-NEXT: ret void
;
@@ -68,3 +151,11 @@ for.body:
for.cond.cleanup:
ret void
}
+;.
+; CHECK: [[LOOP0]] = distinct !{[[LOOP0]], [[META1:![0-9]+]], [[META2:![0-9]+]]}
+; CHECK: [[META1]] = !{!"llvm.loop.isvectorized", i32 1}
+; CHECK: [[META2]] = !{!"llvm.loop.unroll.runtime.disable"}
+; CHECK: [[LOOP3]] = distinct !{[[LOOP3]], [[META2]], [[META1]]}
+; CHECK: [[LOOP4]] = distinct !{[[LOOP4]], [[META1]], [[META2]]}
+; CHECK: [[LOOP5]] = distinct !{[[LOOP5]], [[META2]], [[META1]]}
+;.
diff --git a/llvm/test/Transforms/LoopVectorize/RISCV/inloop-reduction.ll b/llvm/test/Transforms/LoopVectorize/RISCV/inloop-reduction.ll
index 8e90287bac2a2..368143cada36d 100644
--- a/llvm/test/Transforms/LoopVectorize/RISCV/inloop-reduction.ll
+++ b/llvm/test/Transforms/LoopVectorize/RISCV/inloop-reduction.ll
@@ -132,21 +132,20 @@ define i32 @add_i16_i32(ptr nocapture readonly %x, i32 %n) {
; IF-EVL-OUTLOOP-NEXT: [[TMP4:%.*]] = mul i32 [[TMP3]], 4
; IF-EVL-OUTLOOP-NEXT: br label [[VECTOR_BODY:%.*]]
; IF-EVL-OUTLOOP: vector.body:
-; IF-EVL-OUTLOOP-NEXT: [[INDEX:%.*]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], [[VECTOR_BODY]] ]
; IF-EVL-OUTLOOP-NEXT: [[EVL_BASED_IV:%.*]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[INDEX_EVL_NEXT:%.*]], [[VECTOR_BODY]] ]
; IF-EVL-OUTLOOP-NEXT: [[VEC_PHI:%.*]] = phi <vscale x 4 x i32> [ zeroinitializer, [[VECTOR_PH]] ], [ [[TMP10:%.*]], [[VECTOR_BODY]] ]
; IF-EVL-OUTLOOP-NEXT: [[AVL:%.*]] = sub i32 [[N]], [[EVL_BASED_IV]]
-; IF-EVL-OUTLOOP-NEXT: [[TMP5:%.*]] = call i32 @llvm.experimental.get.vector.length.i32(i32 [[AVL]], i32 4, i1 true)
+; IF-EVL-OUTLOOP-NEXT: [[TMP6:%.*]] = icmp ult i32 [[AVL]], [[TMP4]]
+; IF-EVL-OUTLOOP-NEXT: [[TMP5:%.*]] = select i1 [[TMP6]], i32 [[AVL]], i32 [[TMP4]]
; IF-EVL-OUTLOOP-NEXT: [[TMP7:%.*]] = getelementptr inbounds i16, ptr [[X:%.*]], i32 [[EVL_BASED_IV]]
; IF-EVL-OUTLOOP-NEXT: [[TMP8:%.*]] = getelementptr inbounds i16, ptr [[TMP7]], i32 0
; IF-EVL-OUTLOOP-NEXT: [[VP_OP_LOAD:%.*]] = call <vscale x 4 x i16> @llvm.vp.load.nxv4i16.p0(ptr align 2 [[TMP8]], <vscale x 4 x i1> splat (i1 true), i32 [[TMP5]])
; IF-EVL-OUTLOOP-NEXT: [[TMP9:%.*]] = sext <vscale x 4 x i16> [[VP_OP_LOAD]] to <vscale x 4 x i32>
; IF-EVL-OUTLOOP-NEXT: [[VP_OP:%.*]] = add <vscale x 4 x i32> [[VEC_PHI]], [[TMP9]]
; IF-EVL-OUTLOOP-NEXT: [[TMP10]] = call <vscale x 4 x i32> @llvm.vp.merge.nxv4i32(<vscale x 4 x i1> splat (i1 true), <vscale x 4 x i32> [[VP_OP]], <vscale x 4 x i32> [[VEC_PHI]], i32 [[TMP5]])
-; IF-EVL-OUTLOOP-NEXT: [[INDEX_EVL_NEXT]] = add nuw i32 [[TMP5]], [[EVL_BASED_IV]]
-; IF-EVL-OUTLOOP-NEXT: [[INDEX_NEXT]] = add nuw i32 [[INDEX]], [[TMP4]]
-; IF-EVL-OUTLOOP-NEXT: [[TMP11:%.*]] = icmp eq i32 [[INDEX_NEXT]], [[N_VEC]]
-; IF-EVL-OUTLOOP-NEXT: br i1 [[TMP11]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]]
+; IF-EVL-OUTLOOP-NEXT: [[INDEX_EVL_NEXT]] = add nuw i32 [[EVL_BASED_IV]], [[TMP4]]
+; IF-EVL-OUTLOOP-NEXT: [[TMP14:%.*]] = icmp eq i32 [[INDEX_EVL_NEXT]], [[N_VEC]]
+; IF-EVL-OUTLOOP-NEXT: br i1 [[TMP14]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]]
; IF-EVL-OUTLOOP: middle.block:
; IF-EVL-OUTLOOP-NEXT: [[TMP12:%.*]] = call i32 @llvm.vector.reduce.add.nxv4i32(<vscale x 4 x i32> [[TMP10]])
; IF-EVL-OUTLOOP-NEXT: br label [[FOR_COND_CLEANUP_LOOPEXIT:%.*]]
@@ -163,7 +162,7 @@ define i32 @add_i16_i32(ptr nocapture readonly %x, i32 %n) {
; IF-EVL-OUTLOOP-NEXT: [[ADD]] = add nsw i32 [[R_07]], [[CONV]]
; IF-EVL-OUTLOOP-NEXT: [[INC]] = add nuw nsw i32 [[I_08]], 1
; IF-EVL-OUTLOOP-NEXT: [[EXITCOND:%.*]] = icmp eq i32 [[INC]], [[N]]
-; IF-EVL-OUTLOOP-NEXT: br i1 [[EXITCOND]], label [[FOR_COND_CLEANUP_LOOPEXIT]], label [[FOR_BODY]], !llvm.loop [[LOOP4:![0-9]+]]
+; IF-EVL-OUTLOOP-NEXT: br i1 [[EXITCOND]], label [[FOR_COND_CLEANUP_LOOPEXIT]], label [[FOR_BODY]], !llvm.loop [[LOOP3:![0-9]+]]
; IF-EVL-OUTLOOP: for.cond.cleanup.loopexit:
; IF-EVL-OUTLOOP-NEXT: [[ADD_LCSSA:%.*]] = phi i32 [ [[ADD]], [[FOR_BODY]] ], [ [[TMP12]], [[MIDDLE_BLOCK]] ]
; IF-EVL-OUTLOOP-NEXT: br label [[FOR_COND_CLEANUP]]
@@ -188,20 +187,19 @@ define i32 @add_i16_i32(ptr nocapture readonly %x, i32 %n) {
; IF-EVL-INLOOP-NEXT: [[TMP4:%.*]] = mul i32 [[TMP3]], 8
; IF-EVL-INLOOP-NEXT: br label [[VECTOR_BODY:%.*]]
; IF-EVL-INLOOP: vector.body:
-; IF-EVL-INLOOP-NEXT: [[INDEX:%.*]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], [[VECTOR_BODY]] ]
; IF-EVL-INLOOP-NEXT: [[EVL_BASED_IV:%.*]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[INDEX_EVL_NEXT:%.*]], [[VECTOR_BODY]] ]
; IF-EVL-INLOOP-NEXT: [[VEC_PHI:%.*]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[TMP11:%.*]], [[VECTOR_BODY]] ]
; IF-EVL-INLOOP-NEXT: [[TMP5:%.*]] = sub i32 [[N]], [[EVL_BASED_IV]]
-; IF-EVL-INLOOP-NEXT: [[TMP6:%.*]] = call i32 @llvm.experimental.get.vector.length.i32(i32 [[TMP5]], i32 8, i1 true)
+; IF-EVL-INLOOP-NEXT: [[TMP7:%.*]] = icmp ult i32 [[TMP5]], [[TMP4]]
+; IF-EVL-INLOOP-NEXT: [[TMP6:%.*]] = select i1 [[TMP7]], i32 [[TMP5]], i32 [[TMP4]]
; IF-EVL-INLOOP-NEXT: [[TMP8:%.*]] = getelementptr inbounds i16, ptr [[X:%.*]], i32 [[EVL_BASED_IV]]
; IF-EVL-INLOOP-NEXT: [[TMP9:%.*]] = getelementptr inbounds i16, ptr [[TMP8]], i32 0
; IF-EVL-INLOOP-NEXT: [[VP_OP_LOAD:%.*]] = call <vscale x 8 x i16> @llvm.vp.load.nxv8i16.p0(ptr align 2 [[TMP9]], <vscale x 8 x i1> splat (i1 true), i32 [[TMP6]])
; IF-EVL-INLOOP-NEXT: [[TMP14:%.*]] = sext <vscale x 8 x i16> [[VP_OP_LOAD]] to <vscale x 8 x i32>
; IF-EVL-INLOOP-NEXT: [[TMP10:%.*]] = call i32 @llvm.vp.reduce.add.nxv8i32(i32 0, <vscale x 8 x i32> [[TMP14]], <vscale x 8 x i1> splat (i1 true), i32 [[TMP6]])
; IF-EVL-INLOOP-NEXT: [[TMP11]] = add i32 [[TMP10]], [[VEC_PHI]]
-; IF-EVL-INLOOP-NEXT: [[INDEX_EVL_NEXT]] = add nuw i32 [[TMP6]], [[EVL_BASED_IV]]
-; IF-EVL-INLOOP-NEXT: [[INDEX_NEXT]] = add nuw i32 [[INDEX]], [[TMP4]]
-; IF-EVL-INLOOP-NEXT: [[TMP12:%.*]] = icmp eq i32 [[INDEX_NEXT]], [[N_VEC]]
+; IF-EVL-INLOOP-NEXT: [[INDEX_EVL_NEXT]] = add nuw i32 [[EVL_BASED_IV]], [[TMP4]]
+; IF-EVL-INLOOP-NEXT: [[TMP12:%.*]] = icmp eq i32 [[INDEX_EVL_NEXT]], [[N_VEC]]
; IF-EVL-INLOOP-NEXT: br i1 [[TMP12]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]]
; IF-EVL-INLOOP: middle.block:
; IF-EVL-INLOOP-NEXT: br label [[FOR_COND_CLEANUP_LOOPEXIT:%.*]]
@@ -218,7 +216,7 @@ define i32 @add_i16_i32(ptr nocapture readonly %x, i32 %n) {
; IF-EVL-INLOOP-NEXT: [[ADD]] = add nsw i32 [[R_07]], [[CONV]]
; IF-EVL-INLOOP-NEXT: [[INC]] = add nuw nsw i32 [[I_08]], 1
; IF-EVL-INLOOP-NEXT: [[EXITCOND:%.*]] = icmp eq i32 [[INC]], [[N]]
-; IF-EVL-INLOOP-NEXT: br i1 [[EXITCOND]], label [[FOR_COND_CLEANUP_LOOP...
[truncated]
|
You can test this locally with the following command:git-clang-format --diff HEAD~1 HEAD --extensions cpp -- llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp View the diff from clang-format here.diff --git a/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp b/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
index 161cf8246..733d54c36 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
@@ -2285,8 +2285,9 @@ bool VPlanTransforms::tryAddExplicitVectorLength(
VPBuilder Builder(Header, Header->getFirstNonPhi());
// Compute original TC - IV as the AVL (application vector length).
- VPValue *AVL = Builder.createNaryOp(
- Instruction::Sub, {Plan.getTripCount(), CanonicalIVPHI}, DebugLoc(), "avl");
+ VPValue *AVL = Builder.createNaryOp(Instruction::Sub,
+ {Plan.getTripCount(), CanonicalIVPHI},
+ DebugLoc(), "avl");
if (MaxSafeElements) {
// Support for MaxSafeDist for correct loop emission.
VPValue *AVLSafe = Plan.getOrAddLiveIn(
@@ -2302,9 +2303,9 @@ bool VPlanTransforms::tryAddExplicitVectorLength(
unsigned BitWidth = CanonicalIVPHI->getScalarType()->getScalarSizeInBits();
LLVMContext &Ctx = CanonicalIVPHI->getScalarType()->getContext();
- VPEVL = Builder.createScalarCast(
- BitWidth > 32 ? Instruction::Trunc : Instruction::ZExt, VPEVL,
- Type::getInt32Ty(Ctx), DebugLoc());
+ VPEVL = Builder.createScalarCast(BitWidth > 32 ? Instruction::Trunc
+ : Instruction::ZExt,
+ VPEVL, Type::getInt32Ty(Ctx), DebugLoc());
transformRecipestoEVLRecipes(Plan, *VPEVL);
|
Just want to make sure I understand correctly: when you said an extra umin here, you meant an extra umin instruction in the final machine code, right? Because in the LLVM IR level (i.e. vectorizer) you're replacing get.vector.length with umin. If that's the case, I guess the my (daring) question is: can we get rid of that umin instruction (in the codegen pipeline, ofc)?
where a0 is the AVL. This code, as you also mentioned, always yields VLMAX in a2 except the last iteration, forcing VL to be the values we favor regardless of the hardware implementation. But what if we optimize it to
Sure, this alters the behaviors because now a2 values in the last two iterations might be different (compared with the minu + vsetvli). But will that actually be unsafe? If it's dictating a memory operation the total number of elements, hence the address range, should be the same, so I don't think there will be a page fault.
I think I'm less concerned about this as we're already reading vlenb in our current approach |
Yes. I'd somewhat meant in both places, but the important one is the final assembly. The get.vector.length is expected to become a vsetvli, but the umin doesn't replace the vsetvli - we need both the umin and then the vsetvli.
If we do this at the IR level, I believe what you're describing is the existing EVL implementation approach. If we don't change the IR, and do this as a late rewrite, then doing so in a sound manner is tricky.
I agree, just noting the potential issue. |
Just chatted with Craig and realized that I blunderingly forgot my EVLIndVarSimplify Pass has effectively removed the vlenb reads |
I meant doing it (getting rid of minu instruction) later in the codegen pipeline, potentially after RISCVInsertVSETVLI, and wondering what should we do to ensure the soundness. |
EVLPhi->addOperand(NextEVLIV); | ||
// This is just a umin pattern | ||
VPValue &VFxUF = Plan.getVFxUF(); | ||
VPValue *Cmp = Builder.createICmp(ICmpInst::ICMP_ULT, AVL, &VFxUF); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why can't we use the umin intrinsic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's because VPBuilder currently doesn't provide a way to create min/max intrinsics, and VPInstruction doesn't have a recipe for min/max yet either.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an interesting approach. I have a few questions on my end:
This breaks a bunch of assumptions in the vectorizer, and has required a bunch of work to address, with some of that work still in progress.
Besides the work on the widen Induction recipe, is there anything else currently related to the penultimate EVL issue?
Interestingly, that improvement comes primarily by unblocking vectorization cases that the current EVL code can't handle.
Did you exclude the cases that current EVL code can't vectorize when comparing the numbers? I think that would give a clearer picture of the actual difference between the two approaches.
RISCV supports VL predication via the VL configuration register. The specification has an special case for the case when the requested AVL is between VLMAX and VLMAX*2 when a vsetvl instruction is executed.
The last question might be more related to the ISA spec. I've always been curious about this particular design in the RISC-V spec—what exactly is its purpose? Is it meant to benefit hardware performance, or is there another reason behind it? If this design does provide advantages for certain hardware, should we consider preserving the original approach when implementing the new EVL method, and allow the choice between the two via TTI or an option?
PrevEVL = Builder.createScalarPhi({MaxEVL, &EVL}, DebugLoc(), "prev.evl"); | ||
PrevEVL = MaxEVL; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code in if (ContainsFOR)
and vp.splice
can be removed if VLOPT works well.
EVLPhi->addOperand(NextEVLIV); | ||
// This is just a umin pattern | ||
VPValue &VFxUF = Plan.getVFxUF(); | ||
VPValue *Cmp = Builder.createICmp(ICmpInst::ICMP_ULT, AVL, &VFxUF); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's because VPBuilder currently doesn't provide a way to create min/max intrinsics, and VPInstruction doesn't have a recipe for min/max yet either.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for sharing! On a high-level it seems like this would help simplify things by getting rid of a special case, which would be great.
I am not familiar with how important the existing functionality is for performance on RISCV cores, IIRC when the support was added initially the understanding was that it would be beneficial, so it would be great for people familiar with the actual HW implementations to chime in.
CanonicalIVIncrement->getDebugLoc(), "index.evl.next"); | ||
EVLPhi->addOperand(NextEVLIV); | ||
// This is just a umin pattern | ||
VPValue &VFxUF = Plan.getVFxUF(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC we currently restrict UF to 1 for EVL tail folding. Do you have any data if this works if we remove the restriction?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe so, though I'm honestly missing why the the current strategy can't do it too. You just need to compute a separate remaining iteration count (and thus EVL) for each unrolled iteration (since all but the first could be zero). I have not looked into how the code structure would look here.
The key bit is that the VP intrinsics do claim to support the possibly zero EVL argument. We might have some bugs to flesh out there (possibly) since that codepath isn't being tested.
This patch is a proof of concept. More than anything else, its purpose is to ask: am I missing something here? I'm mostly just hoping to trigger some discussion.
I make no claims the patch is generally correct, or well structured. I'm happy to iterate on a real patch once the direction is agreed upon.
Background
RISCV supports VL predication via the VL configuration register. The specification has an special case for the case when the requested AVL is between VLMAX and VLMAX*2 when a vsetvl instruction is executed. We have existing support for the VP family of intrinsics which allow the use of AVL (which by assumption) must be less than VLMAX. This results in the vectorizer needing to adjust the AVL requested to reconcile these different requirements.
The current approach in the EVL vectorization path is that we introduce an intrinsic which fairly directly maps to the vsetvli behavior described above, and adjusts the induction variable of the vector loop to advance not by VFUF, but by AVL (in this case, the result AVL, not the requested AVL). This means that the last two iterations can have fewer iterations active than VFUF. This breaks a bunch of assumptions in the vectorizer, and has required a bunch of work to address, with some of that work still in progress.
Proposal
This patch implements a conceptually simple idea - we use a umin to compute the requested AVL for the VP intrinsics such that the requested AVL is VF*UF on all but the last iteration. (Compared to the existing EVL strategy, this basically redistributes active lanes between the last two iterations.) Doing so means we may end up with an extra "umin" in the loop, but simplies the implementation since the canonical IV does not need to be changed.
The major advantage of this is that all remaining EVL specific restrictions can be removed, and that this variant can vectorize and predicate anything mask predication can handle.
Results
On TSVC, the current EVL strategy is 6.7% worse (in geomean across all tests) than the default main-loop + scalar epilogue approach. The umin-EVL variant (this patch) is 2.2% worse than the default.
Interestingly, that improvement comes primarily by unblocking vectorization cases that the current EVL code can't handle. Without relaxating the restrictions (which are no longer needed for correctness) we only see a slight gain from the IV structure simplification.
Looking into the reason for the remaining delta, there appear to be a few features in the vectorizer which don't work with any of the supported predication styles. These appear to be independently fixable, and I suspect all of the variants would improve slightly.
For comparison, mask based predication is 18.5% worse than the scalar-epilogue approach. That does improve slightly if mask generation is improved (in a hacky way which is only safe on TSVC) but the two EVL variants are still significantly ahead.
(All measurements on a BP3 w/-mcpu=spacemit-x60 -mllvm -force-tail-folding-style=XYZ -mllvm -prefer-predicate-over-epilogue=predicate-else-scalar-epilogue)
Discussion
A few bits to note.
This relies on having the umin instruction, so probably only makes sense with zbb.
This requires materializing the VF*UF term as a loop invariant expression. This will involve a read of vlenb, which may be slow on certain hardware.
There's an alternate phrasing I can choose which doesn't require either the umin or read of vlenb, but it's signficantly more complicated.
The umin formulation can probably be peeled off the last iteration of the loop now that Florian has implemented reverse peeling. The current code hasn't been updated to do this yet, but likely could be without a major redesign. This would result in a net result which looks like a vectorized (unpredicated) main body and a predicated vector epilogue.
This formulation will likely require an extra register over the existing formulation (since you have to materialize the umin), but a quick glance doesn't reveal this being a systemic problem.
A few of the sub-tests in TSVC require more instructions in the loop. A couple I glanced at seem to be LSR bugs, but I haven't glanced at them in any depth (and frankly, don't plan to as the results already look quite decent).