Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

x86: Implement fma intrinsic #21118

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 12 additions & 0 deletions runtime/compiler/x/codegen/J9CodeGenerator.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -441,6 +441,18 @@ J9::X86::CodeGenerator::suppressInliningOfRecognizedMethod(TR::RecognizedMethod
switch (method)
{
case TR::java_lang_Object_clone:
return true;
case TR::java_lang_Math_fma_F:
case TR::java_lang_Math_fma_D:
case TR::java_lang_StrictMath_fma_F:
case TR::java_lang_StrictMath_fma_D:
{
static bool disableInlineFMA = feGetEnv("TR_DisableInlineFMA");

if (disableInlineFMA || !self()->comp()->target().cpu.supportsFeature(OMR_FEATURE_X86_FMA))
return false;
}
Comment on lines +445 to +454
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't a comment on the approach you've taken to handle code generation of these methods. It's more an observation about the diverse mechanisms we seem to have in OMR and OpenJ9 for indicating whether the implementation of a recognized method that exists in the Java class libraries should be inlined as is, or a preferred implementation should be inlined by code generation.

Sometimes we have special opcodes that represent the precise semantics of a method that is often implemented using single hardware instructions, as is the case for java.lang.Integer.numberOfLeadingZeros(), or there might be a very specific query indicating whether the method can be inlined by code generation, as is the case for java.lang.Class.isAssignableFrom().

The fused multiply-add methods feel like something that ought to be represented using operations in the IL, but I see that we've not been handling it that way with the existing code generation support for the java.lang.Math.fma and java.lang.StrictMath.fma methods. I think this whole area could stand some clean up to reduce the number of ways these choices are communicated/managed between code generation and inlining.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this whole area could stand some clean up to reduce the number of ways these choices are communicated/managed between code generation and inlining.

I agree, and I've been working on a PR for a little over a year now to simplify and make intrinsic handling more consistent and grok-able. Sadly, it's taking a while, but this problem is being worked on.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fused multiply-add methods feel like something that ought to be represented using operations in the IL

The issue I have with this, is that I don't think there would be any other place we could use an FMA opcode. You can't just translate a * b + c into FMA because it would violate strictfp.


return true;
default:
return false;
Expand Down
156 changes: 156 additions & 0 deletions runtime/compiler/x/codegen/J9TreeEvaluator.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -9065,6 +9065,150 @@ inlineNanoTime(
#endif
#endif // LINUX

TR::Register* J9::X86::TreeEvaluator::inlineMathFma(TR::Node* node, TR::CodeGenerator* cg)
{
TR::Node *firstChild = node->getFirstChild();
TR::Node *secondChild = node->getSecondChild();
TR::Node *thirdChild = node->getThirdChild();

TR::Register *lhsReg = NULL;
TR::Register *midReg = NULL;
TR::Register *rhsReg = NULL;
TR::Register *result = cg->allocateRegister(TR_FPR);

bool memLoadLhs = !firstChild->getRegister() && firstChild->getReferenceCount() == 1
&& firstChild->getOpCode().isLoadVar();

bool memLoadMiddle = !secondChild->getRegister() && secondChild->getReferenceCount() == 1
&& secondChild->getOpCode().isLoadVar();

bool memLoadRhs = !thirdChild->getRegister() && thirdChild->getReferenceCount() == 1
&& thirdChild->getOpCode().isLoadVar();

bool is64Bit = node->getDataType().isDouble();

TR::InstOpCode::Mnemonic fpMovRegRegOpcode = is64Bit ? TR::InstOpCode::MOVSDRegReg : TR::InstOpCode::MOVSSRegReg;
result->setIsSinglePrecision(!is64Bit);

TR_ASSERT_FATAL(cg->comp()->target().cpu.supportsFeature(OMR_FEATURE_X86_FMA), "Cannot generate inline fma implementation without FMA extensions");

// Choose fma instruction carefully, based on operand form, to reduce number of copies
if (memLoadLhs)
{
TR::InstOpCode::Mnemonic opcode = is64Bit ? TR::InstOpCode::VFMADD231SDRegRegMem : TR::InstOpCode::VFMADD231SSRegRegMem;
TR::MemoryReference *lhsMR = generateX86MemoryReference(firstChild, cg);

if (memLoadRhs)
{
// b * c + a
TR::MemoryReference *rhsMR = generateX86MemoryReference(thirdChild, cg);
generateRegMemInstruction(TR::InstOpCode::MOVSRegMem(is64Bit), node, result, rhsMR, cg);

midReg = cg->evaluate(secondChild);
memLoadMiddle = false; // No choice but to evaluate;
generateRegRegMemInstruction(opcode, node, result, midReg, lhsMR, cg);
}
else if (memLoadMiddle)
{
opcode = is64Bit ? TR::InstOpCode::VFMADD132SDRegRegMem : TR::InstOpCode::VFMADD132SSRegRegMem;
// fma = a * c + b

TR::MemoryReference *midMR = generateX86MemoryReference(secondChild, cg);
rhsReg = cg->evaluate(thirdChild);

generateRegMemInstruction(TR::InstOpCode::MOVSRegMem(is64Bit), node, result, lhsMR, cg);
generateRegRegMemInstruction(opcode, node, result, rhsReg, midMR, cg);
}
else
{
// fma = b * c + a
midReg = cg->evaluate(secondChild);
rhsReg = cg->evaluate(thirdChild);
generateRegRegInstruction(fpMovRegRegOpcode, node, result, rhsReg, cg);
generateRegRegMemInstruction(opcode, node, result, midReg, lhsMR, cg);
}
}
else if (memLoadMiddle)
{
TR::MemoryReference *midMR = generateX86MemoryReference(secondChild, cg);
lhsReg = cg->evaluate(firstChild);

if (memLoadRhs)
{
// fma = b * a + c
TR::InstOpCode::Mnemonic opcode = is64Bit ? TR::InstOpCode::VFMADD213SDRegRegMem : TR::InstOpCode::VFMADD213SSRegRegMem;
TR::MemoryReference *rhsMR = generateX86MemoryReference(thirdChild, cg);

generateRegMemInstruction(TR::InstOpCode::MOVSRegMem(is64Bit), node, result, midMR, cg);
generateRegRegMemInstruction(opcode, node, result, lhsReg, rhsMR, cg);
}
else
{
// fma = a * c + b
TR::InstOpCode::Mnemonic opcode = is64Bit ? TR::InstOpCode::VFMADD132SDRegRegMem : TR::InstOpCode::VFMADD132SSRegRegMem;
rhsReg = cg->evaluate(thirdChild);

generateRegRegInstruction(fpMovRegRegOpcode, node, result, lhsReg, cg);
generateRegRegMemInstruction(opcode, node, result, rhsReg, midMR, cg);
}
}
else if (memLoadRhs)
{
TR::InstOpCode::Mnemonic opcode = is64Bit ? TR::InstOpCode::VFMADD213SDRegRegMem : TR::InstOpCode::VFMADD213SSRegRegMem;
// fma = b * a + c

TR::MemoryReference *rhsMR = generateX86MemoryReference(thirdChild, cg);
lhsReg = cg->evaluate(firstChild);
midReg = cg->evaluate(secondChild);

generateRegRegInstruction(fpMovRegRegOpcode, node, result, lhsReg, cg);
generateRegRegMemInstruction(opcode, node, result, midReg, rhsMR, cg);
}
else
{
TR::InstOpCode::Mnemonic opcode = is64Bit ? TR::InstOpCode::VFMADD213SDRegRegReg : TR::InstOpCode::VFMADD213SSRegRegReg;
// fma = b * a + c

lhsReg = cg->evaluate(firstChild);
midReg = cg->evaluate(secondChild);
rhsReg = cg->evaluate(thirdChild);

generateRegRegInstruction(fpMovRegRegOpcode, node, result, lhsReg, cg);
generateRegRegRegInstruction(opcode, node, result, midReg, rhsReg, cg);
}

if (memLoadLhs)
{
cg->recursivelyDecReferenceCount(firstChild);
}
else
{
cg->decReferenceCount(firstChild);
}

if (memLoadMiddle)
{
cg->recursivelyDecReferenceCount(secondChild);
}
else
{
cg->decReferenceCount(secondChild);
}

if (memLoadRhs)
{
cg->recursivelyDecReferenceCount(thirdChild);
}
else
{
cg->decReferenceCount(thirdChild);
}

node->setRegister(result);

return result;
}

// Convert serial String.hashCode computation into vectorization copy and implement with SSE instruction
//
// Conversion process example:
Expand Down Expand Up @@ -12133,6 +12277,18 @@ J9::X86::TreeEvaluator::directCallEvaluator(TR::Node *node, TR::CodeGenerator *c
return TR::TreeEvaluator::inlineStringLatin1Inflate(node, cg);
}
break;
case TR::java_lang_Math_fma_F:
case TR::java_lang_Math_fma_D:
case TR::java_lang_StrictMath_fma_F:
case TR::java_lang_StrictMath_fma_D:
{
static bool disableInlineFMA = feGetEnv("TR_DisableInlineFMA") != NULL;

if (!disableInlineFMA && cg->comp()->target().cpu.supportsFeature(OMR_FEATURE_X86_FMA))
return inlineMathFma(node, cg);

break;
}
case TR::jdk_internal_util_ArraysSupport_vectorizedHashCode:
{
if (cg->getSupportsInlineVectorizedHashCode())
Expand Down
1 change: 1 addition & 0 deletions runtime/compiler/x/codegen/J9TreeEvaluator.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -146,6 +146,7 @@ class OMR_EXTENSIBLE TreeEvaluator: public J9::TreeEvaluator
static TR::Register *awrtbarEvaluator(TR::Node *node, TR::CodeGenerator *cg);
static TR::Register *awrtbariEvaluator(TR::Node *node, TR::CodeGenerator *cg);
static TR::Register *inlineStringLatin1Inflate(TR::Node *node, TR::CodeGenerator *cg);
static TR::Register *inlineMathFma(TR::Node* node, TR::CodeGenerator* cg);
static TR::Register *inlineVectorizedHashCode(TR::Node* node, TR::CodeGenerator* cg);
static TR::Register *vectorizedHashCodeReductionHelper(TR::Node* node,
TR::Register **vectorRegisters,
Expand Down
1 change: 1 addition & 0 deletions test/functional/Java9andUp/playlist.xml
Original file line number Diff line number Diff line change
Expand Up @@ -318,6 +318,7 @@
<variations>
<variation>-Xint</variation>
<variation>-Xjit:count=1,disableAsyncCompilation</variation>
<variation>-Xjit:count=1,disableGRA,disableLinkageRegisterAllocation,disableLocalCSE,disableAsyncCompilation</variation>
</variations>
<command>$(JAVA_COMMAND) $(JVM_OPTIONS) \
-cp $(Q)$(RESOURCES_DIR)$(P)$(TESTNG)$(P)$(TEST_RESROOT)$(D)GeneralTest.jar$(P)$(LIB_DIR)$(D)asm-all.jar$(Q) \
Expand Down