Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fast-Path String and Vector Hash Code Methods on Power #21081

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

luke-li-2003
Copy link
Contributor

Fast-path ArraysSupport.vectorizedHashCode and String.hashCodeImplCompressed methods on Power. The String.hashCodeImplDecompressed method has already been fast-pathed. Since the other methods use the same logic, the existing code can be modified to recognise and accomodate them.

@luke-li-2003 luke-li-2003 changed the title giFast-Path String and Vector Hash Code Methods on Power Fast-Path String and Vector Hash Code Methods on Power Feb 6, 2025
@luke-li-2003
Copy link
Contributor Author

Fast-path ArraysSupport.vectorizedHashCode and String.hashCodeImplCompressed
methods on Power. The String.hashCodeImplDecompressed method has already
been fast-pathed. Since the other methods use the same logic, the existing
code can be modified to recognise and accomodate them.

Signed-off-by: Luke Li <[email protected]>
@luke-li-2003
Copy link
Contributor Author

luke-li-2003 commented Feb 7, 2025

Weirdly, the baseline build is outperforming the fast path implementation on small arrays.

Vectors:

Data Type Array Length Base Build Fast Path
Byte 4 159M 78M
Byte 8 107M 60M
Byte 16 52M 62M
Byte 128 5M 27M
Int 4 170M 125M
Int 8 122M 109M
Int 16 62M 79M
Int 128 6M 13M

Strings:

Length Base Build Fast Path
10 65M 48M
20 37M 38M
40 14M 37M
80 7M 16M

@luke-li-2003
Copy link
Contributor Author

Some updated string data with compressed and uncompressed strings

Compressed:

Length Base Build Fast Path
8 69M 70M
16 37M 52M
32 20M 50M

Decompressed:

Length Base Build Fast Path
8 81M 78M
16 79M 83M
32 40M 42M

@luke-li-2003
Copy link
Contributor Author

It seems like String.hashCodeImplDecompressed that has already been implemented shares the behaviour of my changes, namely it is slower than a call-out for strings shorter than 8.

@luke-li-2003
Copy link
Contributor Author

fyi @zl-wang

Copy link
Contributor

@rmnattas rmnattas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you Luke, I have one area of the code that I'm not sure about, but the rest are suggestions or things to bring into attention.

Also, wondering if you tested the code by comparing hashcode output with and without the fast-path to make sure the hashing is the same.


// Skip header of the array
intptr_t hdrSize = TR::Compiler->om.contiguousArrayHeaderSizeInBytes();
generateTrg1Src1ImmInstruction(cg, TR::InstOpCode::addi, node, valueReg, valueReg, hdrSize);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems to me that adding a load of dataAddr pointer instead of adding headerSize here is only thing holding enabling this for OffHeap, @zl-wang. Not sure if worth doing it separately here or with other platforms later.

loadConstant(cg, node, 0xF, tempReg);
generateTrg1Src2Instruction(cg, TR::InstOpCode::AND, node, tempReg, valueReg, tempReg);
generateTrg1Src1ImmInstruction(cg, TR::InstOpCode::cmpi4, node, condReg, tempReg, 0x0);
generateConditionalBranchInstruction(cg, TR::InstOpCode::beq, node,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although its from the older code, but wondering if and_r (and.) instruction can be used to not need a separate cmp instruction, and if its worth it.

generateTrg1Src2Instruction(cg, TR::InstOpCode::AND, node, tempReg, valueReg, tempReg);
generateTrg1Src1ImmInstruction(cg, TR::InstOpCode::cmpi4, node, condReg, tempReg, 0x0);
generateConditionalBranchInstruction(cg, TR::InstOpCode::beq, node,
VSXLabel, condReg);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have the generateConditionalBranchInstruction in one line.

TR::Register *vtmp2Reg = cg->allocateRegister(TR_VRF);
TR::Register *vconstant0Reg = cg->allocateRegister(TR_VRF);
TR::Register *vconstantNegReg = cg->allocateRegister(TR_VRF);
TR::Register *vunpackMaskReg = cg->allocateRegister(TR_VRF);
Copy link
Contributor

@rmnattas rmnattas Feb 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that the vunpackMaskReg register and some instructions that use it are generated and not used, when type is Int32 and for all types when isSigned==true.

generateTrg1Src2Instruction(cg, TR::InstOpCode::add, node, endReg, valueReg, vendReg);

// temp = 0
generateTrg1Src2Instruction(cg, TR::InstOpCode::XOR, node, tempReg, tempReg, tempReg);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to me that initializing temp to null is not necessary, the two paths it can go after this both re-initialize temp with some value before using it.

// load unaligned v, mask out unwanted part
// for example, if value = 0x12345, we mask out 0x12340~0x12344, keep 0x12345~0x1234F

// vtmp1Reg = mem(valueReg & 0xFFFFFFFFFFFFFFF0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is a bit misleading, because you still load valueReg as-is into vtmp1Reg. Comment insinuate that you're loading a lower address than valueReg (data start address) which can be a problem.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But doing lvx on valueReg does mean loading the lower addresses in the case?

Copy link
Contributor

@rmnattas rmnattas Feb 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, lvx does the ANDing. I'm usually careful of loading a lower than given address but here it's 16byte aligned so it won't cross into another memory page.
Might need to check for OffHeap if dataAddress is at the beginning of some segment that we shouldn't access earlier address, still will be in the same 16byte alignment so I believe it would be fine.

generateTrg1Src2Instruction(cg, TR::InstOpCode::vmuluwm, node, vtmp2Reg, vtmp2Reg, multiplierReg);
// restore the multiplier reg
generateTrg1MemInstruction(cg, TR::InstOpCode::lxvw4x, node, multiplierReg,
TR::MemoryReference::createWithIndexReg(cg, multiplierAddrReg, constant0Reg, 16));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although correct, I think you can delay loading the multiplierReg initially in line 10535 to not have to load it twice. Currently it's loaded (and not used), then over-written here, then restored, which is redundant.

}
generateTrg1Src1Instruction(cg, TR::InstOpCode::mtvsrwz, node, initialValueReg, hashReg);
generateTrg1Src2ImmInstruction(cg, TR::InstOpCode::vsldoi, node, initialValueReg, vconstant0Reg, initialValueReg, 8);
generateTrg1Src2ImmInstruction(cg, TR::InstOpCode::vsldoi, node, initialValueReg, initialValueReg, vconstant0Reg, 12);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these 3 instructions can be done in 2?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do I do that? This is the only way I can think of to load in the highest word in a vector register, while keeping all the other words 0.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The one I thought of is using mtvsrws then setting the non-highest words to 0 (with vsldoi, or not sure if there's something else/better).

if (isLE) {
generateTrg1Src1ImmInstruction(cg, TR::InstOpCode::addi, node, tempReg, tempReg, 0xF);
} else { // for BE, we want the correct value to be in the third word
generateTrg1Src1ImmInstruction(cg, TR::InstOpCode::addi, node, tempReg, tempReg, 0xF-12);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not fully understanding the calculations here, especially the 0xF-12. Also worried given that the calculated address is loaded with lxvw4x and depend on the min/max possible values, if it might load outside the static multiplierVectors and if that would cause an issue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The BE one is probably incorrect; I just never tested it in BE. It should be fixable.

I will add four extra zeros to the static arrays to ensure loading will not be outside of them.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should be BE machines that you can test in, I'll send you details.

@luke-li-2003
Copy link
Contributor Author

Yep, the code was tested and the values were correct for all the scenario I could think of.

@zl-wang
Copy link
Contributor

zl-wang commented Feb 12, 2025

i can come back to review this later, but i would start by suggesting you read some of the vectorized fast-path implementations, e.g. String.equal/compareTo or String.indexOf. At least, you are able to handle the misaligned part much better on POWER10 (there are vector load/store with length instructions ... or look at arrayCopy helper code on POWER10).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants