New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Fast-Path String and Vector Hash Code Methods on Power #21081

Open

luke-li-2003 wants to merge 1 commit into eclipse-openj9:master from luke-li-2003:FastPathHashCode

+661 −245

Contributor

luke-li-2003 commented Feb 6, 2025

Fast-path ArraysSupport.vectorizedHashCode and String.hashCodeImplCompressed methods on Power. The String.hashCodeImplDecompressed method has already been fast-pathed. Since the other methods use the same logic, the existing code can be modified to recognise and accomodate them.

luke-li-2003 changed the title ~~giFast-Path String and Vector Hash Code Methods on Power~~ Fast-Path String and Vector Hash Code Methods on Power

luke-li-2003 force-pushed the FastPathHashCode branch from b8ac845 to bf75642 Compare

February 6, 2025 23:09

Contributor Author

luke-li-2003 commented Feb 6, 2025

Issue: https://github.ibm.com/runtimes/openj9-jit-power/issues/408

luke-li-2003 force-pushed the FastPathHashCode branch from bf75642 to d61d6a1 Compare

February 7, 2025 17:37


          Fast-Path String and Vector Hash Code Methods on Power

ef1ccb2

Fast-path ArraysSupport.vectorizedHashCode and String.hashCodeImplCompressed
methods on Power. The String.hashCodeImplDecompressed method has already
been fast-pathed. Since the other methods use the same logic, the existing
code can be modified to recognise and accomodate them.

Signed-off-by: Luke Li <[email protected]>

luke-li-2003 force-pushed the FastPathHashCode branch from d61d6a1 to ef1ccb2 Compare

February 7, 2025 19:09

Contributor Author

luke-li-2003 commented Feb 7, 2025 •

edited

Loading

Weirdly, the baseline build is outperforming the fast path implementation on small arrays.

Vectors:

Data Type	Array Length	Base Build	Fast Path
Byte	4	159M	78M
Byte	8	107M	60M
Byte	16	52M	62M
Byte	128	5M	27M
Int	4	170M	125M
Int	8	122M	109M
Int	16	62M	79M
Int	128	6M	13M

Strings:

Length	Base Build	Fast Path
10	65M	48M
20	37M	38M
40	14M	37M
80	7M	16M

Contributor Author

luke-li-2003 commented Feb 10, 2025

Some updated string data with compressed and uncompressed strings

Compressed:

Length	Base Build	Fast Path
8	69M	70M
16	37M	52M
32	20M	50M

Decompressed:

Length	Base Build	Fast Path
8	81M	78M
16	79M	83M
32	40M	42M

Contributor Author

luke-li-2003 commented Feb 10, 2025

It seems like String.hashCodeImplDecompressed that has already been implemented shares the behaviour of my changes, namely it is slower than a call-out for strings shorter than 8.

Contributor Author

luke-li-2003 commented Feb 10, 2025

rmnattas reviewed

View reviewed changes

Contributor

rmnattas left a comment

Thank you Luke, I have one area of the code that I'm not sure about, but the rest are suggestions or things to bring into attention.

Also, wondering if you tested the code by comparing hashcode output with and without the fast-path to make sure the hashing is the same.

runtime/compiler/p/codegen/J9TreeEvaluator.cpp

+                 // Skip header of the array
+                 intptr_t hdrSize = TR::Compiler->om.contiguousArrayHeaderSizeInBytes();
+                 generateTrg1Src1ImmInstruction(cg, TR::InstOpCode::addi, node, valueReg, valueReg, hdrSize);

Contributor

rmnattas Feb 12, 2025

Seems to me that adding a load of dataAddr pointer instead of adding headerSize here is only thing holding enabling this for OffHeap, @zl-wang. Not sure if worth doing it separately here or with other platforms later.

runtime/compiler/p/codegen/J9TreeEvaluator.cpp

+                 loadConstant(cg, node, 0xF, tempReg);
+                 generateTrg1Src2Instruction(cg, TR::InstOpCode::AND, node, tempReg, valueReg, tempReg);
+                 generateTrg1Src1ImmInstruction(cg, TR::InstOpCode::cmpi4, node, condReg, tempReg, 0x0);
+                 generateConditionalBranchInstruction(cg, TR::InstOpCode::beq, node,

Contributor

rmnattas Feb 12, 2025

Although its from the older code, but wondering if and_r (and.) instruction can be used to not need a separate cmp instruction, and if its worth it.

runtime/compiler/p/codegen/J9TreeEvaluator.cpp

+                 generateTrg1Src2Instruction(cg, TR::InstOpCode::AND, node, tempReg, valueReg, tempReg);
+                 generateTrg1Src1ImmInstruction(cg, TR::InstOpCode::cmpi4, node, condReg, tempReg, 0x0);
+                 generateConditionalBranchInstruction(cg, TR::InstOpCode::beq, node,
+                    VSXLabel, condReg);

Contributor

rmnattas Feb 12, 2025

Have the generateConditionalBranchInstruction in one line.

runtime/compiler/p/codegen/J9TreeEvaluator.cpp

+                 TR::Register *vtmp2Reg = cg->allocateRegister(TR_VRF);
+                 TR::Register *vconstant0Reg = cg->allocateRegister(TR_VRF);
+                 TR::Register *vconstantNegReg = cg->allocateRegister(TR_VRF);
+                 TR::Register *vunpackMaskReg = cg->allocateRegister(TR_VRF);

Contributor

rmnattas Feb 12, 2025 •

edited

Loading

It seems that the vunpackMaskReg register and some instructions that use it are generated and not used, when type is Int32 and for all types when isSigned==true.

runtime/compiler/p/codegen/J9TreeEvaluator.cpp

+                 generateTrg1Src2Instruction(cg, TR::InstOpCode::add, node, endReg, valueReg, vendReg);
+                 // temp = 0
+                 generateTrg1Src2Instruction(cg, TR::InstOpCode::XOR, node, tempReg, tempReg, tempReg);

Contributor

rmnattas Feb 12, 2025

It seems to me that initializing temp to null is not necessary, the two paths it can go after this both re-initialize temp with some value before using it.

runtime/compiler/p/codegen/J9TreeEvaluator.cpp

+                 // load unaligned v, mask out unwanted part
+                 // for example, if value = 0x12345, we mask out 0x12340~0x12344, keep 0x12345~0x1234F
+                 // vtmp1Reg = mem(valueReg & 0xFFFFFFFFFFFFFFF0)

Contributor

rmnattas Feb 12, 2025

This comment is a bit misleading, because you still load valueReg as-is into vtmp1Reg. Comment insinuate that you're loading a lower address than valueReg (data start address) which can be a problem.

Contributor Author

luke-li-2003 Feb 12, 2025

But doing lvx on valueReg does mean loading the lower addresses in the case?

Contributor

rmnattas Feb 12, 2025 •

edited

Loading

I see, lvx does the ANDing. I'm usually careful of loading a lower than given address but here it's 16byte aligned so it won't cross into another memory page.
Might need to check for OffHeap if dataAddress is at the beginning of some segment that we shouldn't access earlier address, still will be in the same 16byte alignment so I believe it would be fine.

runtime/compiler/p/codegen/J9TreeEvaluator.cpp

+                    generateTrg1Src2Instruction(cg, TR::InstOpCode::vmuluwm, node, vtmp2Reg, vtmp2Reg, multiplierReg);
+                    // restore the multiplier reg
+                    generateTrg1MemInstruction(cg, TR::InstOpCode::lxvw4x, node, multiplierReg,
+                       TR::MemoryReference::createWithIndexReg(cg, multiplierAddrReg, constant0Reg, 16));

Contributor

rmnattas Feb 12, 2025

Although correct, I think you can delay loading the multiplierReg initially in line 10535 to not have to load it twice. Currently it's loaded (and not used), then over-written here, then restored, which is redundant.

runtime/compiler/p/codegen/J9TreeEvaluator.cpp

+                          }
+                       generateTrg1Src1Instruction(cg, TR::InstOpCode::mtvsrwz, node, initialValueReg, hashReg);
+                       generateTrg1Src2ImmInstruction(cg, TR::InstOpCode::vsldoi, node, initialValueReg, vconstant0Reg, initialValueReg, 8);
+                       generateTrg1Src2ImmInstruction(cg, TR::InstOpCode::vsldoi, node, initialValueReg, initialValueReg, vconstant0Reg, 12);

Contributor

rmnattas Feb 12, 2025

I think these 3 instructions can be done in 2?

Contributor Author

luke-li-2003 Feb 12, 2025

How do I do that? This is the only way I can think of to load in the highest word in a vector register, while keeping all the other words 0.

Contributor

rmnattas Feb 12, 2025

The one I thought of is using mtvsrws then setting the non-highest words to 0 (with vsldoi, or not sure if there's something else/better).

runtime/compiler/p/codegen/J9TreeEvaluator.cpp

+                    if (isLE) {
+                       generateTrg1Src1ImmInstruction(cg, TR::InstOpCode::addi, node, tempReg, tempReg, 0xF);
+                    } else { // for BE, we want the correct value to be in the third word
+                       generateTrg1Src1ImmInstruction(cg, TR::InstOpCode::addi, node, tempReg, tempReg, 0xF-12);

Contributor

rmnattas Feb 12, 2025

I'm not fully understanding the calculations here, especially the 0xF-12. Also worried given that the calculated address is loaded with lxvw4x and depend on the min/max possible values, if it might load outside the static multiplierVectors and if that would cause an issue.

Contributor Author

luke-li-2003 Feb 12, 2025

The BE one is probably incorrect; I just never tested it in BE. It should be fixable.

I will add four extra zeros to the static arrays to ensure loading will not be outside of them.

Contributor

rmnattas Feb 12, 2025

There should be BE machines that you can test in, I'll send you details.

Contributor Author

luke-li-2003 commented Feb 12, 2025

Yep, the code was tested and the values were correct for all the scenario I could think of.

Contributor

zl-wang commented Feb 12, 2025

i can come back to review this later, but i would start by suggesting you read some of the vectorized fast-path implementations, e.g. String.equal/compareTo or String.indexOf. At least, you are able to handle the misaligned part much better on POWER10 (there are vector load/store with length instructions ... or look at arrayCopy helper code on POWER10).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet