Accelerate StringCoding.hasNegatives for JDK 11, 17, StringCoding.countPositives for JDK 21+ on x86 #21121

dylanjtuttle · 2025-02-12T14:34:26Z

This PR accelerates intrinsic candidates StringCoding.hasNegatives and StringCoding.countPositives on x86, the former on JDK 9-18 and the latter on JDK 19+.

This PR is incremental in a few ways.

The acceleration currently only delivers the desired performance boost for arrays of up to 31 elements. For arrays of 32 elements and longer, the acceleration still outperforms default OpenJ9, but can no longer keep up with HotSpot. For that, I will need to take advantage of larger SIMD instructions.
I have discovered a strange performance anomaly in OpenJ9 that causes it to perform significantly faster than expected for hasNegatives for arrays of 0-8 elements on JDK 19+. It performs so well for these short arrays that implementing my 'acceleration' would actually cause a performance regression there. While this anomaly is investigated, I will not be accelerating hasNegatives on JDK 19+.

In the interest of taking advantage of the performance boost as it currently stands for the 0.51 release, this PR will deliver these changes in their incremental state, with plans for another PR or two down the road to close the aforementioned gaps.

dylanjtuttle · 2025-02-12T14:36:11Z

Paging @vijaysun-omr, @0xdaryl for review and @r30shah, @dchopra001 as requested for comparison to acceleration on Z

vijaysun-omr · 2025-02-12T16:46:23Z

Looks fine to me from a quick review. I'll probably defer to @hzongaro for the review since he has much more direct awareness of this work than I do, and can easily review the inliner parts as well (that I skimmed over too).

dylanjtuttle · 2025-02-13T15:44:47Z

It appears that the crash is due to a recent OMR commit and is therefore unrelated to my changes. I can build successfully with the openj9 branch of the openj9-omr repo, and can now confirm all of the performance testing checks out.

BradleyWood · 2025-02-13T00:45:49Z

runtime/compiler/x/codegen/J9TreeEvaluator.cpp

+   // AND the residual bytes with the new mask
+   generateRegRegInstruction(TR::InstOpCode::TEST8RegReg, node, chunk, mask, cg);
+
+   // If the result is nonzero (i.e. at least one of the sign bits is set), return true


I think you can simplify this code like the following, no branching required.

xor resultReg, resultReg ; Set result to 0 test chunk, mask setne resultReg ; Set result to 1, if not eq

BradleyWood · 2025-02-13T00:45:56Z

runtime/compiler/x/codegen/J9TreeEvaluator.cpp

+   auto xmmmask = cg->allocateRegister();
+   auto mask = cg->allocateRegister();
+   auto i = cg->allocateRegister();
+   auto xmmchunk = cg->allocateRegister();


For vector registers you need to call cg->allocateRegister(TR_VRF);, otherwise register assignment, register spilling may not work right.

BradleyWood · 2025-02-13T00:47:08Z

runtime/compiler/x/codegen/J9TreeEvaluator.cpp

+   auto limit = cg->allocateRegister();
+   auto xmmmask = cg->allocateRegister();
+   auto mask = cg->allocateRegister();
+   auto i = cg->allocateRegister();


Use more descriptive naming i -> indexReg, ba -> bufReg, etc.

BradleyWood · 2025-02-13T00:49:06Z

runtime/compiler/x/codegen/J9TreeEvaluator.cpp

+   auto xmmchunk = cg->allocateRegister();
+   auto chunk = cg->allocateRegister();
+   auto bytes_left = cg->allocateRegister();
+   auto ecx = cg->allocateRegister();


Why do you need ecx here? Its not used, and as far as I can tell, non of these instructions implicitly use it.Why do you need ecx here? Its not used, and as far as I can tell, non of these instructions implicitly use it.

BradleyWood · 2025-02-13T00:49:26Z

runtime/compiler/x/codegen/J9TreeEvaluator.cpp

+   auto xmmchunk = cg->allocateRegister();
+   auto chunk = cg->allocateRegister();
+   auto bytes_left = cg->allocateRegister();
+   auto ecx = cg->allocateRegister();


Why do you need ecx here? Its not used, and as far as I can tell, non of these instructions implicitly use it.Why do you need ecx here? Its not used, and as far as I can tell, non of these instructions implicitly use it.

BradleyWood · 2025-02-13T03:00:28Z

runtime/compiler/x/codegen/J9TreeEvaluator.cpp

+   generateRegRegInstruction(TR::InstOpCode::PTESTRegReg, node, xmmchunk, xmmmask, cg);
+
+   // If the result is nonzero (i.e. at least one of the sign bits is set), break and return index
+   generateLabelInstruction(TR::InstOpCode::JNE4, node, returnIndexLabel, cg);


at least one of the sign bits is set

Don't you need to keep track of the index of last positive element? This function should find the number of leading positive elements so the location of that negative matters. You cannot return (i - off) as you do in the residue processing because you don't know where in the vector the first negative is.

The key here was pointed out by Henry early on when I was first trying to figure out how to accelerate countPositives:

/** * Count the number of leading positive bytes in the range. * * @implSpec the implementation must return len if there are no negative * bytes in the range. If there are negative bytes, the implementation must return * a value that is less than or equal to the index of the first negative byte * in the range. */

countPositives only needs to return a value less than or equal to the index of the first negative byte. There are a few places in String.java that call countPositives and then do the work of finding the exact index themselves. I suppose the reason it was designed this way is to take advantage of small time saves in situations where you don't care about the exact index (like when being called by hasNegatives).

BradleyWood · 2025-02-13T05:24:00Z

runtime/compiler/x/codegen/J9TreeEvaluator.cpp

-
+      case TR::java_lang_StringCoding_countPositives:
+         {
+            if (comp->target().cpu.supportsAVX())


fix indentation

BradleyWood · 2025-02-13T17:09:32Z

runtime/compiler/x/codegen/J9CodeGenerator.cpp

+      cg->setSupportsInlineStringCodingHasNegatives();
+      }
+   static bool disableInlineStringCodingCountPositives = feGetEnv("TR_DisableInlineStringCodingCountPositives") != NULL;
+   if (comp->target().cpu.supportsAVX() && !disableInlineStringCodingCountPositives &&


You don't use any AVX instructions, as far as I cant tell the highest req you have is SSE 4.1. However, this can be done with just SSE 2 if you use pcmpgtb.

BradleyWood · 2025-02-13T17:52:38Z

runtime/compiler/x/codegen/J9TreeEvaluator.cpp

+
+   // Prepare a 16 byte sign bit mask
+   static uint8_t dqMaskBytes[] = { 0x80, 0x80, 0x80, 0x80, 0x80, 0x80, 0x80, 0x80, 0x80, 0x80, 0x80, 0x80, 0x80, 0x80, 0x80, 0x80 };
+   auto dqMaskMR = generateX86MemoryReference(cg->findOrCreate16ByteConstant(node, dqMaskBytes), cg);


A couple of things with the main loop. You don't need to load this mask to track sign bits. You can compare each byte to 0 ( is 0 > arr[i] ) using pcmpgtb instruction. pcmpgtb produces at bit mask which can be converted into an integer with PMOVMSKB. If the mask is zero, the all vector elements are positive and you continue the loop. Otherwise, you can find the index of the first negative instruction using tzcnt (use bsf if bmi1 not available) instruction.

Here is some rough pseudo-code:

index = off while (index < loop_limit) movdqu data_xmm, [base_ptr + index]; Load 16 bytes from arr[i] pxor tmp_xmm, tmp_xmm ; create zero vector pcmpgtb tmp_xmm, data_xmm ; Compare: 0 > arr[i] -> 0xFF if negative, 0x00 otherwise pmovmskb neg_mask, tmp_xmm ; Extract bitmask of negative values test neg_mask, neg_mask ; Check if any negative values exist jnz .foundNegative ; If mask is nonzero, handle first negative add index, 16 ; all positive ; loop .foundNegative count = tzcount(neg_mask) + index - off jmp .done .residue ... .done ...

Secondly, you do not need to track bytes_left, its like having two loop index counters. Before entering the loop, calculate the loop limit as (len + off) & ~15. You only need to align to vector length. if (i < loop_limit) then execute the loop.

BradleyWood · 2025-02-13T18:01:31Z

runtime/compiler/x/codegen/J9TreeEvaluator.cpp

+               }
+         }
+         break;
+#if JAVA_SPEC_VERSION < 19


Excluding this call for JDK < 19 is OK, but why not do the same for the method declaration?

0xdaryl · 2025-02-14T00:19:21Z

runtime/compiler/x/codegen/J9TreeEvaluator.cpp

+   {
+   // Arguments to countPositives
+   // Byte array
+   auto ba = cg->evaluate(node->getChild(0));


I will let you process and action all of Brad's comments before reviewing this thoroughly, but one thing I will mention right away are my strong opinions on the use of the auto keyword and its impact on future code understanding for anyone but the author. I may be known to let one or two slip in on occasion, but since you have many (12) in succession I'll ask that you specify the data type for each (they look to be either TR::Node *'s or TR::Register *'s in your case.

I'll echo Brad's comments about variable naming as well. It is easier to read the code later if the variables that contain nodes are suffixed with Node and registers are suffixed with Reg.

dylanjtuttle added 4 commits February 12, 2025 06:38

base implementation of countPositives and hasNegatives

164c730

minor tweaks to hopefully speed things up a bit

493c0d7

disable implementation of hN, enable inlining of both for jdk 19+

2956a35

disable inlining of countPositives unless its caller is hasNegatives

233cf24

dylanjtuttle force-pushed the countPositivesIntrinsic branch from 543b8ac to 233cf24 Compare February 12, 2025 14:49

hzongaro self-assigned this Feb 12, 2025

hzongaro added the comp:jit label Feb 12, 2025

BradleyWood suggested changes Feb 13, 2025

View reviewed changes

0xdaryl reviewed Feb 14, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Accelerate StringCoding.hasNegatives for JDK 11, 17, StringCoding.countPositives for JDK 21+ on x86 #21121

Accelerate StringCoding.hasNegatives for JDK 11, 17, StringCoding.countPositives for JDK 21+ on x86 #21121

dylanjtuttle commented Feb 12, 2025 •

edited

Loading

dylanjtuttle commented Feb 12, 2025

vijaysun-omr commented Feb 12, 2025

dylanjtuttle commented Feb 13, 2025

BradleyWood Feb 13, 2025

BradleyWood Feb 13, 2025

BradleyWood Feb 13, 2025

BradleyWood Feb 13, 2025

BradleyWood Feb 13, 2025

BradleyWood Feb 13, 2025

dylanjtuttle Feb 14, 2025

BradleyWood Feb 13, 2025

BradleyWood Feb 13, 2025

BradleyWood Feb 13, 2025

BradleyWood Feb 13, 2025 •

edited

Loading

0xdaryl Feb 14, 2025

Accelerate StringCoding.hasNegatives for JDK 11, 17, StringCoding.countPositives for JDK 21+ on x86 #21121

Are you sure you want to change the base?

Accelerate StringCoding.hasNegatives for JDK 11, 17, StringCoding.countPositives for JDK 21+ on x86 #21121

Conversation

dylanjtuttle commented Feb 12, 2025 • edited Loading

dylanjtuttle commented Feb 12, 2025

vijaysun-omr commented Feb 12, 2025

dylanjtuttle commented Feb 13, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BradleyWood Feb 13, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dylanjtuttle commented Feb 12, 2025 •

edited

Loading

BradleyWood Feb 13, 2025 •

edited

Loading