-
Notifications
You must be signed in to change notification settings - Fork 737
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fast-Path String and Vector Hash Code Methods on Power #21081
base: master
Are you sure you want to change the base?
Conversation
b8ac845
to
bf75642
Compare
bf75642
to
d61d6a1
Compare
Fast-path ArraysSupport.vectorizedHashCode and String.hashCodeImplCompressed methods on Power. The String.hashCodeImplDecompressed method has already been fast-pathed. Since the other methods use the same logic, the existing code can be modified to recognise and accomodate them. Signed-off-by: Luke Li <[email protected]>
d61d6a1
to
ef1ccb2
Compare
Weirdly, the baseline build is outperforming the fast path implementation on small arrays. Vectors:
Strings:
|
Some updated string data with compressed and uncompressed strings Compressed:
Decompressed:
|
It seems like |
fyi @zl-wang |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you Luke, I have one area of the code that I'm not sure about, but the rest are suggestions or things to bring into attention.
Also, wondering if you tested the code by comparing hashcode output with and without the fast-path to make sure the hashing is the same.
|
||
// Skip header of the array | ||
intptr_t hdrSize = TR::Compiler->om.contiguousArrayHeaderSizeInBytes(); | ||
generateTrg1Src1ImmInstruction(cg, TR::InstOpCode::addi, node, valueReg, valueReg, hdrSize); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems to me that adding a load of dataAddr pointer instead of adding headerSize here is only thing holding enabling this for OffHeap, @zl-wang. Not sure if worth doing it separately here or with other platforms later.
loadConstant(cg, node, 0xF, tempReg); | ||
generateTrg1Src2Instruction(cg, TR::InstOpCode::AND, node, tempReg, valueReg, tempReg); | ||
generateTrg1Src1ImmInstruction(cg, TR::InstOpCode::cmpi4, node, condReg, tempReg, 0x0); | ||
generateConditionalBranchInstruction(cg, TR::InstOpCode::beq, node, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although its from the older code, but wondering if and_r
(and.
) instruction can be used to not need a separate cmp
instruction, and if its worth it.
generateTrg1Src2Instruction(cg, TR::InstOpCode::AND, node, tempReg, valueReg, tempReg); | ||
generateTrg1Src1ImmInstruction(cg, TR::InstOpCode::cmpi4, node, condReg, tempReg, 0x0); | ||
generateConditionalBranchInstruction(cg, TR::InstOpCode::beq, node, | ||
VSXLabel, condReg); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have the generateConditionalBranchInstruction
in one line.
TR::Register *vtmp2Reg = cg->allocateRegister(TR_VRF); | ||
TR::Register *vconstant0Reg = cg->allocateRegister(TR_VRF); | ||
TR::Register *vconstantNegReg = cg->allocateRegister(TR_VRF); | ||
TR::Register *vunpackMaskReg = cg->allocateRegister(TR_VRF); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that the vunpackMaskReg
register and some instructions that use it are generated and not used, when type is Int32
and for all types when isSigned==true
.
generateTrg1Src2Instruction(cg, TR::InstOpCode::add, node, endReg, valueReg, vendReg); | ||
|
||
// temp = 0 | ||
generateTrg1Src2Instruction(cg, TR::InstOpCode::XOR, node, tempReg, tempReg, tempReg); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems to me that initializing temp to null is not necessary, the two paths it can go after this both re-initialize temp with some value before using it.
// load unaligned v, mask out unwanted part | ||
// for example, if value = 0x12345, we mask out 0x12340~0x12344, keep 0x12345~0x1234F | ||
|
||
// vtmp1Reg = mem(valueReg & 0xFFFFFFFFFFFFFFF0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment is a bit misleading, because you still load valueReg as-is into vtmp1Reg. Comment insinuate that you're loading a lower address than valueReg (data start address) which can be a problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But doing lvx
on valueReg does mean loading the lower addresses in the case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, lvx
does the ANDing. I'm usually careful of loading a lower than given address but here it's 16byte aligned so it won't cross into another memory page.
Might need to check for OffHeap if dataAddress is at the beginning of some segment that we shouldn't access earlier address, still will be in the same 16byte alignment so I believe it would be fine.
generateTrg1Src2Instruction(cg, TR::InstOpCode::vmuluwm, node, vtmp2Reg, vtmp2Reg, multiplierReg); | ||
// restore the multiplier reg | ||
generateTrg1MemInstruction(cg, TR::InstOpCode::lxvw4x, node, multiplierReg, | ||
TR::MemoryReference::createWithIndexReg(cg, multiplierAddrReg, constant0Reg, 16)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although correct, I think you can delay loading the multiplierReg
initially in line 10535 to not have to load it twice. Currently it's loaded (and not used), then over-written here, then restored, which is redundant.
} | ||
generateTrg1Src1Instruction(cg, TR::InstOpCode::mtvsrwz, node, initialValueReg, hashReg); | ||
generateTrg1Src2ImmInstruction(cg, TR::InstOpCode::vsldoi, node, initialValueReg, vconstant0Reg, initialValueReg, 8); | ||
generateTrg1Src2ImmInstruction(cg, TR::InstOpCode::vsldoi, node, initialValueReg, initialValueReg, vconstant0Reg, 12); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these 3 instructions can be done in 2?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do I do that? This is the only way I can think of to load in the highest word in a vector register, while keeping all the other words 0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The one I thought of is using mtvsrws
then setting the non-highest words to 0 (with vsldoi
, or not sure if there's something else/better).
if (isLE) { | ||
generateTrg1Src1ImmInstruction(cg, TR::InstOpCode::addi, node, tempReg, tempReg, 0xF); | ||
} else { // for BE, we want the correct value to be in the third word | ||
generateTrg1Src1ImmInstruction(cg, TR::InstOpCode::addi, node, tempReg, tempReg, 0xF-12); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not fully understanding the calculations here, especially the 0xF-12
. Also worried given that the calculated address is loaded with lxvw4x
and depend on the min/max possible values, if it might load outside the static multiplierVectors and if that would cause an issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The BE one is probably incorrect; I just never tested it in BE. It should be fixable.
I will add four extra zeros to the static arrays to ensure loading will not be outside of them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There should be BE machines that you can test in, I'll send you details.
Yep, the code was tested and the values were correct for all the scenario I could think of. |
i can come back to review this later, but i would start by suggesting you read some of the vectorized fast-path implementations, e.g. String.equal/compareTo or String.indexOf. At least, you are able to handle the misaligned part much better on POWER10 (there are vector load/store with length instructions ... or look at arrayCopy helper code on POWER10). |
Fast-path ArraysSupport.vectorizedHashCode and String.hashCodeImplCompressed methods on Power. The String.hashCodeImplDecompressed method has already been fast-pathed. Since the other methods use the same logic, the existing code can be modified to recognise and accomodate them.