cmdLineTester_jvmtitests_hcr_OSRG_nongold_0 rc018 ppc64le segfault in internal tests #19813

cjjdespres · 2024-07-04T18:44:12Z

Failure link

A sample, as it's been failing very frequently in the JITServer nightly tests:

https://hyc-runtimes-jenkins.swg-devops.com/job/Test_openjdk17_j9_extended.functional_ppc64le_linux_jit_Personal/991/
https://hyc-runtimes-jenkins.swg-devops.com/job/Test_openjdk11_j9_extended.functional_ppc64le_linux_jit_Personal/1078/

The failure in cmdLineTester_jvmtitests_hcr_OSRG_nongold_SE80_0 on ppc64le JDK8 in this test from July 4 also fails in rc018 in the same way.

Optional info

I initially noticed this in the JITServer nightly tests, but a 50x grinder without JITAAS had a 43/50 failure rate, so this is not JITServer-specific. I have only seen this failure on ppc64le so far.

Failure output (captured from console output)

        Testing: rc018
        Test start time: 2024/07/04 01:34:28 Pacific Standard Time
        Running command: "/home/jenkins/workspace/Test_openjdk17_j9_extended.functional_ppc64le_linux_jit_Personal_testList_0/jdkbinary/j2sdk-image/bin/java" -XX:+UseJITServer -Xgcpolicy:optthruput -Xdebug -Xrunjdwp:transport=dt_socket,address=8888,server=y,onthrow=no.pkg.foo,launch=echo -Xjit:enableOSR,enableOSROnGuardFailure,count=1,disableAsyncCompilation  -Xdump    -agentlib:jvmtitest=test:rc018 -cp "/home/jenkins/workspace/Test_openjdk17_j9_extended.functional_ppc64le_linux_jit_Personal_testList_0/aqa-tests/TKG/../../jvmtest/functional/cmdLineTests/jvmtitests/jvmtitest.jar:/home/jenkins/workspace/Test_openjdk17_j9_extended.functional_ppc64le_linux_jit_Personal_testList_0/../../testDependency/lib/asm-all.jar" com.ibm.jvmti.tests.util.TestRunner
        Time spent starting: 8 milliseconds
        Time spent executing: 8339 milliseconds
        Test result: FAILED
        Output from test:
         [OUT] *** Testing [1/1]:	testReflectRedefineAtSameTime
         [OUT] starting reflect worker threads
         [OUT] starting to populate java heap
         [ERR] Unhandled exception
         [ERR] Type=Segmentation error vmState=0x00040000
         [ERR] J9Generic_Signal_Number=00000018 Signal_Number=0000000b Error_Value=00000000 Signal_Code=00000001
         [ERR] Handler1=00007FFF9FD518F0 Handler2=00007FFFA41C97C0
         [ERR] R0=00007FFF9F37E780 R1=00007FFFA473C1B0 R2=00007FFF9F76EB00 R3=0000000000000000
         [ERR] R4=00000000001CFE20 R5=0000000000000000 R6=0000000000000000 R7=00007FFF85280074
         [ERR] R8=00007FFF83269B86 R9=00007FFFA00D87A0 R10=00007FFF83269B86 R11=0000000000000000
         [ERR] R12=00007FFF8318F2A8 R13=00007FFFA47468E0 R14=000000000017B700 R15=0000000000095800
         [ERR] R16=00007FFF816E0038 R17=0000000000000054 R18=0000000000000000 R19=0000000089F7B8E8
         [ERR] R20=0000000000827FE1 R21=000000000000009C R22=0000000088D0E0B0 R23=0000000048000001
         [ERR] R24=48000DC548000001 R25=00007FFFA00D8210 R26=00007FFF8318F2A8 R27=0000000000000001
         [ERR] R28=00007FFF85280074 R29=0000000000000000 R30=00007FFFA00D87A0 R31=0000000000000000
         [ERR] NIP=00007FFF9F37E824 MSR=800000000280F033 ORIG_GPR3=D00000000282056C CTR=00007FFF9EDBAD80
         [ERR] LINK=00007FFF9F37E820 XER=0000000000000000 CCR=0000000028242422 SOFTE=0000000000000001
         [ERR] TRAP=0000000000000300 DAR=0000000000000020 dsisr=0000000042000000 RESULT=0000000000000000
         [ERR] FPR0=000000000017ba00 (f: 1554944.000000, d: 7.682444e-318)
         [ERR] FPR1=4056bb7820000000 (f: 536870912.000000, d: 9.092921e+01)
         [ERR] FPR2=0000000000000000 (f: 0.000000, d: 0.000000e+00)
         [ERR] FPR3=3fee666660000000 (f: 1610612736.000000, d: 9.500000e-01)
         [ERR] FPR4=3fce840b4ac4e4d2 (f: 1254417664.000000, d: 2.384047e-01)
         [ERR] FPR5=bfe7154748bef6c8 (f: 1220474624.000000, d: -7.213475e-01)
         [ERR] FPR6=3fe62e42fefa39ef (f: 4277811712.000000, d: 6.931472e-01)
         [ERR] FPR7=bfb1973c5a611ccc (f: 1516313856.000000, d: -6.871392e-02)
         [ERR] FPR8=6552747365745f38 (f: 1702125312.000000, d: 1.196544e+180)
         [ERR] FPR9=6552646e41666564 (f: 1097229696.000000, d: 1.192487e+180)
         [ERR] FPR10=324f5f7463656c66 (f: 1667591296.000000, d: 2.327369e-66)
         [ERR] FPR11=0000000000000000 (f: 0.000000, d: 0.000000e+00)
         [ERR] FPR12=0000000000000000 (f: 0.000000, d: 0.000000e+00)
         [ERR] FPR13=4020800000000000 (f: 0.000000, d: 8.250000e+00)
         [ERR] FPR14=0000000000000000 (f: 0.000000, d: 0.000000e+00)
         [ERR] FPR15=0000000000000000 (f: 0.000000, d: 0.000000e+00)
         [ERR] FPR16=0000000000000000 (f: 0.000000, d: 0.000000e+00)
         [ERR] FPR17=0000000000000000 (f: 0.000000, d: 0.000000e+00)
         [ERR] FPR18=0000000000000000 (f: 0.000000, d: 0.000000e+00)
         [ERR] FPR19=0000000000000000 (f: 0.000000, d: 0.000000e+00)
         [ERR] FPR20=0000000000000000 (f: 0.000000, d: 0.000000e+00)
         [ERR] FPR21=0000000000000000 (f: 0.000000, d: 0.000000e+00)
         [ERR] FPR22=0000000000000000 (f: 0.000000, d: 0.000000e+00)
         [ERR] FPR23=0000000000000000 (f: 0.000000, d: 0.000000e+00)
         [ERR] FPR24=0000000000000000 (f: 0.000000, d: 0.000000e+00)
         [ERR] FPR25=0000000000000000 (f: 0.000000, d: 0.000000e+00)
         [ERR] FPR26=0000000000000000 (f: 0.000000, d: 0.000000e+00)
         [ERR] FPR27=0000000000000000 (f: 0.000000, d: 0.000000e+00)
         [ERR] FPR28=0000000000000000 (f: 0.000000, d: 0.000000e+00)
         [ERR] FPR29=0000000000000000 (f: 0.000000, d: 0.000000e+00)
         [ERR] FPR30=0000000000000000 (f: 0.000000, d: 0.000000e+00)
         [ERR] FPR31=0000000000000000 (f: 0.000000, d: 0.000000e+00)
         [ERR] Module=/home/jenkins/workspace/Test_openjdk17_j9_extended.functional_ppc64le_linux_jit_Personal_testList_0/jdkbinary/j2sdk-image/lib/default/libj9jit29.so
         [ERR] Module_base_address=00007FFF9E7E0000
         [ERR] Target=2_90_20240704_1047 (Linux 4.18.0-553.5.1.el8_10.ppc64le)
         [ERR] CPU=ppc64le (4 logical CPUs) (0x1dcc50000 RAM)
         [ERR] ----------- Stack Backtrace -----------
         [ERR] _ZN3OMR9CodeCache17replaceTrampolineEP20TR_OpaqueMethodBlockPvS3_S3_b+0xe4 (0x00007FFF9F37E824 [libj9jit29.so+0xb9e824])
         [ERR] _ZN3OMR16CodeCacheManager17replaceTrampolineEP20TR_OpaqueMethodBlockPvS3_S3_S3_b+0x94 (0x00007FFF9F381044 [libj9jit29.so+0xba1044])
         [ERR] mcc_replaceTrampoline+0x58 (0x00007FFF9ED65B98 [libj9jit29.so+0x585b98])
         [ERR] _Z15ppcCodePatchingPvS_S_S_S_S_+0x310 (0x00007FFF9EDBB090 [libj9jit29.so+0x5db090])
         [ERR] _ZN3OMR9CodeCache14patchCallPointEP20TR_OpaqueMethodBlockPvS3_S3_+0x100 (0x00007FFF9F37DB50 [libj9jit29.so+0xb9db50])
         [ERR] mcc_callPointPatching_unwrapper+0x50 (0x00007FFF9ED659F0 [libj9jit29.so+0x5859f0])
         [ERR] old_slow_jitCallCFunction+0x54 (0x00007FFF9F435644 [libj9jit29.so+0xc55644])
         [ERR]  (0x00007FFF9F46CF90 [libj9jit29.so+0xc8cf90])
         [ERR] sidecarInvokeReflectMethodImpl+0x508 (0x00007FFF9FD2BD98 [libj9vm29.so+0x1bd98])
         [ERR] sidecarInvokeReflectMethod+0x3c (0x00007FFF9FD2CCEC [libj9vm29.so+0x1ccec])
         [ERR] JVM_InvokeMethod_Impl+0xa8 (0x00007FFF9E17B7B8 [libjclse29.so+0x8b7b8])
         [ERR] JVM_InvokeMethod+0x3c (0x00007FFFA46664BC [libjvm.so+0x264bc])
         [ERR] JVM_InvokeMethod+0x30 (0x00007FFFA47473B0 [libjvm.so+0x73b0])
         [ERR] Java_jdk_internal_reflect_NativeMethodAccessorImpl_invoke0+0x24 (0x00007FFF9E02DC84 [libjava.so+0xdc84])
         [ERR]  (0x00007FFF9FF6C9B8 [libj9vm29.so+0x25c9b8])
         [ERR] ffi_call_int+0xd4 (0x00007FFF9FF6C684 [libj9vm29.so+0x25c684])
         [ERR] debugBytecodeLoopCompressed+0x16f2c (0x00007FFF9FE367EC [libj9vm29.so+0x1267ec])
         [ERR]  (0x00007FFF9FED6118 [libj9vm29.so+0x1c6118])
         [ERR] runCallInMethod+0x258 (0x00007FFF9FD2B528 [libj9vm29.so+0x1b528])
         [ERR] gpProtectedRunCallInMethod+0x54 (0x00007FFF9FD55C04 [libj9vm29.so+0x45c04])
         [ERR] signalProtectAndRunGlue+0x28 (0x00007FFF9FEEA248 [libj9vm29.so+0x1da248])
         [ERR] omrsig_protect+0x3e4 (0x00007FFFA41CAC74 [libj9prt29.so+0x3ac74])
         [ERR] gpProtectAndRun+0xa8 (0x00007FFF9FEEA318 [libj9vm29.so+0x1da318])
         [ERR] gpCheckCallin+0xc4 (0x00007FFF9FD582C4 [libj9vm29.so+0x482c4])
         [ERR] callStaticVoidMethod+0x48 (0x00007FFF9FD55218 [libj9vm29.so+0x45218])
         [ERR] JavaMain+0x1210 (0x00007FFFA4A27E60 [libjli.so+0x7e60])
         [ERR] ThreadJavaMain+0x18 (0x00007FFFA4A2E0F8 [libjli.so+0xe0f8])
         [ERR] start_thread+0xf8 (0x00007FFFA49D9718 [libpthread.so.0+0x9718])
         [ERR] clone+0x74 (0x00007FFFA48BB798 [libc.so.6+0x13b798])

The text was updated successfully, but these errors were encountered:

cjjdespres · 2024-07-04T18:57:51Z

@pshipton @tajila The earliest I've been able to find these failures is the nightly test run of June 27, which tested openj9 commits from June 26 and earlier, so I think this may be related to #19710 and maybe #19778. I'm not sure if the test options need to be adjusted as in #19792 or if it's something else.

cjjdespres · 2024-07-04T19:00:53Z

The version info for the early failure in #19813 (comment) is:

# DETECTED_JAVA_VERSION=openjdk version "11.0.24-internal" 2024-07-16
# OpenJDK Runtime Environment (build 11.0.24-internal+0-adhoc.jenkins.BuildJDK11ppc64lelinuxjitPersonal)
# Eclipse OpenJ9 VM (build master-979ce920c7c, JRE 11 Linux ppc64le-64-Bit Compressed References 20240627_1784 (JIT enabled, AOT enabled)
# OpenJ9   - 979ce920c7c
# OMR      - 47a9d248db0
# JCL      - 35dab526d05 based on jdk-11.0.24+6)

tajila · 2024-07-04T19:01:50Z

@pshipton @tajila The earliest I've been able to find these failures is the nightly test run of June 27, which tested openj9 commits from June 26 and earlier, so I think this may be related to #19710 and maybe #19778. I'm not sure if the test options need to be adjusted as in #19792 or if it's something else.

Ill take a look

tajila · 2024-07-04T19:06:44Z

Grinder with option added, https://hyc-runtimes-jenkins.swg-devops.com/job/Grinder/41855/ - passed

If it passes that will confirm its related to the HCR changes.

However, the crash is unexpected regardless of options specified so that will need investigation.

pshipton · 2024-07-05T11:58:49Z

Also seen in non-jitserver builds as a timeout.
https://hyc-runtimes-jenkins.swg-devops.com/job/Test_openjdk11_j9_extended.functional_ppc64le_linux_testList_1/709/
https://hyc-runtimes-jenkins.swg-devops.com/job/Test_openjdk17_j9_extended.functional_ppc64le_linux_testList_0/626/

Related to eclipse-openj9#19710 Issue eclipse-openj9#19765 Issue eclipse-openj9#19813 Signed-off-by: Peter Shipton <[email protected]>

pshipton · 2024-07-05T14:52:58Z

Created #19819 to add -XX:+EnableExtendedHCR
and #19820 for 0.46

tajila · 2024-07-05T15:50:41Z

@cjjdespres I still think the crash needs to be investigated. The extended HCR options may have just exposed an issue. Disabling extended HCR prevents redefinition from changing the class shape, any attempt to do so should result in a JVMTI error. The fact that it is crashing means an assumption was broken.

I looked at the stacktrace and I couldn't find anything suspicious, but thats probably because Im not very familiar with that code. Do you know why it is crashing? I think we still need to answer this question.

cjjdespres · 2024-07-05T16:53:26Z

I agree that this should be investigated. I think this option influences jvmtiRedefineClasses() and related functions, right? I'm also not at all familiar with that code, or with power.

tajila · 2024-07-05T17:04:43Z

@hzongaro @zl-wang Do you know who can investigate this?

zl-wang · 2024-07-08T12:38:39Z

@tajila could you briefly describe the difference between previous HCR and extended HCR? in particular, method recompilation and its implications. from the stack back-trace, it crashed trying to replace a trampoline (could be many different reasons). presumably, this is across platform, but it happened to be on ppc64le due to certain recompilations by-chance.

zl-wang · 2024-07-08T12:39:41Z

@IBMJimmyk i will ask jimmy to take a further look.

tajila · 2024-07-08T13:35:30Z

@tajila could you briefly describe the difference between previous HCR and extended HCR? in particular, method recompilation and its implications. from the stack back-trace, it crashed trying to replace a trampoline (could be many different reasons). presumably, this is across platform, but it happened to be on ppc64le due to certain recompilations by-chance.

The basic difference is that exended HCR allows one to change the class shape (ie. add/remove fields and methods). This means we dont redefine the J9Class in place, as we do with fast HCR. We create a new J9Class, then we need to do a heap walk to replace all object headers that are affected, update class heirarchies, array component type, etc. This also means that things like the vtable/jitvtable will have a new address, j9methods will also have a new address, etc.

tajila · 2024-07-08T13:36:00Z

WRT "method recompilation and its implications." I'll let @gacholio add anything I may have missed.

zl-wang · 2024-07-08T13:44:11Z

i.e. as part of the HCR hooked actions, every such-typed existing object is re-constructed?

tajila · 2024-07-08T14:00:37Z

i.e. as part of the HCR hooked actions, every such-typed existing object is re-constructed?

Yes, the J9Class in the header of affected object intances are updated to the new J9Class in the extended HCR case.

zl-wang · 2024-07-08T14:11:21Z

Yes, the J9Class in the header of affected object intances are updated to the new J9Class in the extended HCR case.

I meant more than that. since fields can be added or removed, every such object needs to be reconstructed literally. size can be different. it sounds like a newObject() being required. without new size, is the old-object allowed to continue running?

tajila · 2024-07-08T14:20:40Z

I meant more than that. since fields can be added or removed, every such object needs to be reconstructed literally. size can be different. it sounds like a newObject() being required. without new size, is the old-object allowed to continue running?

Sorry, you can add/remove static fields, but not instance fields. So only the instance header needs to be updated.

IBMJimmyk · 2024-07-23T19:57:22Z

I've been trying to take a look at this problem.

rc018 takes the class rc018_testRedefAndReflect_O1 and redefines it with rc018_testRedefAndReflect_O2. The two classes are the same except with different names. To be more specific, other than the class name the only difference in the java files are whitespace changes. The different names doesn't matter since the rc018 test will edit the class bytes of rc018_testRedefAndReflect_O2 to change the name to rc018_testRedefAndReflect_O1 before the redefinition. So the redefinition should be valid with and without extended HCR. But, it seems like something specific to the codepath to handle non-extended HCR is causing a problem.

Attempts to recreate the problem on a local system have been unsuccessful but it fails very regularly (at least 80+%) via Grinders. So I've been looking at the artifacts from one of my grinders (-XX:-EnableExtendedHCR is set):
https://hyc-runtimes-jenkins.swg-devops.com/job/Grinder/42275/

In this, the problem shows up as a hang.

        Testing: rc018
        Test start time: 2024/07/22 12:50:59 Pacific Standard Time
        Running command: "/home/jenkins/workspace/Grinder/jdkbinary/j2sdk-image/bin/java"  -Xgcpolicy:optthruput -Xdebug -Xrunjdwp:transport=dt_socket,address=8888,server=y,onthrow=no.pkg.foo,launch=echo -Xjit:enableOSR,enableOSROnGuardFailure,count=1,disableAsyncCompilation -XX:-EnableExtendedHCR -Xdump    -agentlib:jvmtitest=test:rc018 -cp "/home/jenkins/workspace/Grinder/aqa-tests/TKG/../../jvmtest/functional/cmdLineTests/jvmtitests/jvmtitest.jar:/home/jenkins/workspace/Grinder/../../testDependency/lib/asm-all.jar" com.ibm.jvmti.tests.util.TestRunner
        Time spent starting: 4 milliseconds
        ***[TEST INFO 2024/07/22 12:54:59] ProcessKiller detected a timeout after 240000 milliseconds!***

So I tried looking at what threads were currently running and this one stuck out:

3XMTHREADINFO      "Populating java heap" J9VMThread:0x0000000000241500, omrthread_t:0x000076D7AC2669B0, java/lang/Thread:0x0000000080D9E6F8, state:R, prio=5
3XMJAVALTHREAD            (java/lang/Thread getId:0x19, isDaemon:false)
3XMJAVALTHRCCL            jdk/internal/loader/ClassLoaders$AppClassLoader(0x0000000080D1C860)
3XMTHREADINFO1            (native thread ID:0x19C56C, native priority:0x5, native policy:UNKNOWN, vmstate:CW, vm thread flags:0x00000081)
3XMTHREADINFO2            (native stack address range from:0x000076D789790000, to:0x000076D789810000, size:0x80000)
3XMCPUTIME               CPU usage total: 0.625400978 secs, current category="Application"
3XMHEAPALLOC             Heap bytes allocated since last GC cycle=41922488 (0x27FAFB8)
3XMTHREADINFO3           Java callstack:
4XESTACKTRACE                at com/ibm/jvmti/tests/redefineClasses/rc018$populate.run(rc018.java:59)
3XMTHREADINFO3           Native callstack:
4XENATIVESTACK                (0x000076D7B1E97564 [libj9prt29.so+0x37564])
4XENATIVESTACK                (0x000076D7B1E9E948 [libj9prt29.so+0x3e948])
4XENATIVESTACK                (0x000076D7B1E97B94 [libj9prt29.so+0x37b94])
4XENATIVESTACK                (0x000076D7B1E97184 [libj9prt29.so+0x37184])
4XENATIVESTACK                (0x000076D7B1E9E948 [libj9prt29.so+0x3e948])
4XENATIVESTACK                (0x000076D7B1E97298 [libj9prt29.so+0x37298])
4XENATIVESTACK                (0x000076D7B1E99098 [libj9prt29.so+0x39098])
4XENATIVESTACK               __kernel_sigtramp_rt64+0x0 (0x000076D7B2AC04C8 [linux-vdso64.so.1+0x4c8])
4XENATIVESTACK               pthread_cond_wait+0x19c (0x000076D7B2A22E0C [libpthread.so.0+0x12e0c])
4XENATIVESTACK               omrthread_monitor_wait+0x38c (0x000076D7B1E2B04C [libj9thr29.so+0xb04c])
4XENATIVESTACK                (0x000076D7B1FD7538 [libj9vm29.so+0xa7538])
4XENATIVESTACK                (0x000076D7B1FD7910 [libj9vm29.so+0xa7910])
4XENATIVESTACK                (0x000076D7B1F44344 [libj9vm29.so+0x14344])
4XENATIVESTACK                (0x000076D7B203AAA8 [libj9vm29.so+0x10aaa8])
4XENATIVESTACK                (0x000076D7B21001E0 [libj9vm29.so+0x1d01e0])
4XENATIVESTACK                (0x000076D7B1F4AEE0 [libj9vm29.so+0x1aee0])
4XENATIVESTACK                (0x000076D7B1FE8BF0 [libj9vm29.so+0xb8bf0])
4XENATIVESTACK                (0x000076D7B1E9E948 [libj9prt29.so+0x3e948])
4XENATIVESTACK                (0x000076D7B1FE3EE4 [libj9vm29.so+0xb3ee4])
4XENATIVESTACK                (0x000076D7B1E313E8 [libj9thr29.so+0x113e8])
4XENATIVESTACK                (0x000076D7B2A18838 [libpthread.so.0+0x8838])
4XENATIVESTACK               clone+0x74 (0x000076D7B28ABA44 [libc.so.6+0x14ba44])

What this thread does is create objects for 20 seconds then stops:

        class populate extends Thread {
                long patience = 20000000000L; /* 20 seconds in nanoseconds */

                populate() {
                        super("Populating java heap");
                }

                public void run() {
                        long startTime = System.nanoTime();
                        while (((System.nanoTime() - startTime) < patience)) { //<<-------rc018.java:59 is here
                                Object array[] = new Object[1000];
                        }

                }
        }

So whatever is causing the hang causes this thread to stop as well even though it isn't directly involved with the rc018_testRedefAndReflect_O1 class that is getting redefined. Even after 20 seconds have passed, this thread is still running.

IBMJimmyk · 2024-07-24T20:54:44Z

I ended up jumping to this this test failure since it was a segfault instead of a crash:
https://hyc-runtimes-jenkins.swg-devops.com/job/Grinder/42395/

https://github.com/eclipse/omr/blob/d18121d17c5750b62d057c61a84dd72dd9584e90/compiler/runtime/OMRCodeCache.cpp#L629-L638
The call to findResolvedMethod returns null. oldTrampoline is also null.
The crash is when it uses entry at entry->_info._resolved._currentTrampoline = trampoline;.

These symptoms look very similar to: #12120 (comment)
In that old issue, cpIndex is an 18-bit field. Some places sign extended it to 32bit and some places zero extended it. This caused mismatch where the cpIndex should have matched but didn't. The fix for that problem was checked in a long time ago but maybe I missed a spot or someone else introduced a new variation on the same problem.

Next step is to try adding some debug prints to the locations that are expected to update _resolvedMethodHT and see if that sheds any light on the issue. Based on my old notes, I've identified a bunch of places to added debug prints that can potentially help out.

This bug seems like it connects a bad cpIndex, non-extended HCR and class redefintion. So if anyone knows a place in the code that might be related to all three, it is likely a good place to look as well.

IBMJimmyk · 2024-07-25T21:15:43Z

I'm still trying to figure out what is going on. One thing I discovered is when I run the test locally, the path through replaceTrampoline doesn't seem to be taken which is important because that's where the failure happens.

On failing grinder run with my new trace messages I see this:

#INFO:  zzz replaceTrampoline - method: 0000000000120220, entry: 0000000000000000

It interacts with this lookup:

entry = _resolvedMethodHT->findResolvedMethod(method);

This means at the start of replaceTrampoline, there is a lookup for method 0x120220 and it returns null for entry which leads to the crash.

In my local run, that line is never printed out so there are no calls to replaceTrampoline. Since it never goes there, it seems to avoid the failure condition and the test passes.

Another thing is, on the failing grinder runs, going to replaceTrampoline seems to guarantee failure. I checked 5 runs, in all 5 runs the call out to replaceTrampoline happens once each and it fails in that one try.

So my next step is to figure out why replaceTrampoline is being called. This may lead to discovering what the problem is or at least may make local reproduction possible if I can recreate the conditions that cause replaceTrampoline to be called.

zl-wang · 2024-07-25T21:28:03Z

method 0120220 is one of the methods in the replaced or replacing classes? running into replaceTrampoline is also by chance (it needs to be recompiled, called previous version to trigger). if it belongs to the replaced class, that might be the source reason (it should disappear altogether already).

IBMJimmyk · 2024-07-26T14:03:35Z

(kca) j9m 0x120220
Method   {ClassPath/Name.MethodName}: {java/util/concurrent/ConcurrentHashMap.casTabAt}
                           Signature: ([Ljava/util/concurrent/ConcurrentHashMap$Node;ILjava/util/concurrent/ConcurrentHashMap$Node;Ljava/util/concurrent/ConcurrentHashMap$Node;)Z
                              Access: Static Final
                    J9Class/J9Method: 0x0000000000121000 / 0x0000000000120220
               Compiled Method Start: 0x00007fff720300f0 (127 instructions)
                      ByteCode Start: 0x00007fff8c2378d4 (21 bytes)
                   ROM Constant Pool: 0x00007fff8c236410 (307 entries)
                       Constant Pool: 0x000000000011ee30 (303 entries)

It doesn't seem to be related to the class being replaced at all.

The class being redefined, rc018_testRedefAndReflect_O1, does not use ConcurrentHashMap. It's a simple class that doesn't have any methods. It just initializes a bunch of fields.

The test rc018 doesn't even create objects of type rc018_testRedefAndReflect_O1.

        class ReflectThread extends Thread {
                long patience = 20000000000L; /* 20 seconds in nanoseconds */

                ReflectThread() {
                        super("Reflecting Thread");
                }

                public void run() {
                        long startTime = System.nanoTime();
                        try {
                                while (((System.nanoTime() - startTime) < patience)) {
                                        rc018_testRedefAndReflect_O1.class.getDeclaredFields();
                                }
                        } catch (SecurityException e) {
                                System.out.println("getDeclaredFields hit an security exception");
                        }
                }

        }

The test tries to read the class's fields over and over again and doesn't even check the result.

IBMJimmyk · 2024-07-26T14:15:40Z

From the vlog:

! (warm) java/util/concurrent/ConcurrentHashMap.casTabAt([Ljava/util/concurrent/ConcurrentHashMap$Node;ILjava/util/concurrent/ConcurrentHashMap$Node;Ljava/util/concurrent/ConcurrentHashMap$Node;)Z Q_SZ=1 Q_SZI=1 QW=18 j9m=0000000000120220 time=5190us compilationRecoverableTrampolineFailure memLimit=262144 KB compThreadID=0
+ (warm) java/util/concurrent/ConcurrentHashMap.casTabAt([Ljava/util/concurrent/ConcurrentHashMap$Node;ILjava/util/concurrent/ConcurrentHashMap$Node;Ljava/util/concurrent/ConcurrentHashMap$Node;)Z @ 00007FFF720300F0-00007FFF72030240 OrdinaryMethod - Q_SZ=1 Q_SZI=1 QW=18 j9m=0000000000120220 bcsz=21 sync FSD OSR remote compThreadID=0 CpuLoad=321%(80%avg) JvmCpu=300%
#INFO:  zzz replaceTrampoline - method: 0000000000120220, entry: 0000000000000000

It looks like the ConcurrentHashMap.casTabAt compilation first fails due to compilationRecoverableTrampolineFailure and then is successful. It then calls replaceTrampoline to fix the trampolines and it doesn't work. casTabAt can not be found in the _resolvedMethodHT.

zl-wang · 2024-07-26T14:16:48Z

find out how many versions this method has been compiled;
from core file, find out the call-site where replaceTrampoline is triggered and crashed;
which version is it calling right now and whether that is going through trampoline;

these may lead us to question why it is not in the calling codeCache's hashTable

zl-wang · 2024-07-26T14:23:31Z

HashTable entry (unresolved or resolved) should be established by caller(s) when the caller(s) are compiled, rather than by itself. Trampoline is created at compile-time or runtime, depending on the callee has been compiled at that time or not (of course, whether it is reachable by bl instruction).

zl-wang · 2024-07-26T14:28:52Z

not exactly sure what compilationRecoverableTrampolineFailure means. find that out, and see if we can figure something out.

zl-wang · 2024-07-26T14:30:15Z

HashTable entry could be created at runtime as well (from unresolved becoming resolved) ... just FYI.

IBMJimmyk · 2024-07-28T15:02:40Z

I made a build with more debug info a got a new corefile. The failure is still similar.

#INFO:  zzz replaceTrampoline - _resolvedMethodHT: 00007FFFB80D8770, method: 0000000000120840, entry: 0000000000000000

It's still a null entry after lookup into _resolvedMethodHT while in replaceTrampoline.
Method 0x120840 is ConcurrentHashMap.helpTransfer. It has two failed attempts at compilation and one successful one.

! (warm) java/util/concurrent/ConcurrentHashMap.helpTransfer([Ljava/util/concurrent/ConcurrentHashMap$Node;Ljava/util/concurrent/ConcurrentHashMap$Node;)[Ljava/util/concurrent/ConcurrentHashMap$Node; Q_SZ=0 Q_SZI=0 QW=12 j9m=0000000000120840 time=8336us compilationInterrupted memLimit=262144 KB compThreadID=0
! (warm) java/util/concurrent/ConcurrentHashMap.helpTransfer([Ljava/util/concurrent/ConcurrentHashMap$Node;Ljava/util/concurrent/ConcurrentHashMap$Node;)[Ljava/util/concurrent/ConcurrentHashMap$Node; Q_SZ=0 Q_SZI=0 QW=12 j9m=0000000000120840 time=11933us compilationRecoverableTrampolineFailure memLimit=262144 KB compThreadID=0
+ (warm) java/util/concurrent/ConcurrentHashMap.helpTransfer([Ljava/util/concurrent/ConcurrentHashMap$Node;Ljava/util/concurrent/ConcurrentHashMap$Node;)[Ljava/util/concurrent/ConcurrentHashMap$Node; @ 00007FFF9DAB00F4-00007FFF9DAB040C OrdinaryMethod - Q_SZ=0 Q_SZI=0 QW=12 j9m=0000000000120840 bcsz=124 sync FSD OSR remote compThreadID=0 CpuLoad=350%(87%avg) JvmCpu=316%

method, callSite and newStartPC are passed into patchCallPoint:
https://github.com/eclipse/omr/blob/1a724b50baacb7ad7173c75fe9385e3439f5d4e7/compiler/runtime/OMRCodeCache.cpp#L731-L734
Their values are:

(kca) (0x2806e0)/llx
%8 = 0x00000000002806e0: 0000000000120840 //method
(kca) (0x2806e0+8)/llx
%9 = 0x00000000002806e8: 00007fff9b8b684c //callSite
(kca) (0x2806e0+0x10)/llx
%1 = 0x00000000002806f0: 00007fff9dab00f4 //newStartPC

method matches the 0x120840 from the vlog output.

callSite is a location inside ConcurrentHashMap.putVal. It's a call to ConcurrentHashMap.helpTransfer.

0x7fff9b8b6824 {java/.../ConcurrentHashMap.putVal} +861               00009e60 ori       r30, r4, 0 <<< +474
0x7fff9b8b6828 {java/.../ConcurrentHashMap.putVal} +862               0000c462 ori       r4, r22, 0
0x7fff9b8b682c {java/.../ConcurrentHashMap.putVal} +863               00007661 ori       r22, r11, 0
0x7fff9b8b6830 {java/.../ConcurrentHashMap.putVal} +864               0000b960 ori       r25, r5, 0
0x7fff9b8b6834 {java/.../ConcurrentHashMap.putVal} +865               c001cef8 std       r6, 0x1c0(r14)
0x7fff9b8b6838 {java/.../ConcurrentHashMap.putVal} +866               a801eef8 std       r7, 0x1a8(r14)
0x7fff9b8b683c {java/.../ConcurrentHashMap.putVal} +867               b0010ef9 std       r8, 0x1b0(r14)
0x7fff9b8b6840 {java/.../ConcurrentHashMap.putVal} +868               c8012ef9 std       r9, 0x1c8(r14)
0x7fff9b8b6844 {java/.../ConcurrentHashMap.putVal} +869               08026ee8 ld        r3, 0x208(r14)
0x7fff9b8b6848 {java/.../ConcurrentHashMap.putVal} +870               0000e562 ori       r5, r23, 0
0x7fff9b8b684c {java/.../ConcurrentHashMap.putVal} +871  -1:120       c90e0048 bl        0x7fff9b8b7714 U>> +1817 Snippet-> Trampoline {libj9jit29.so}{endproc._interpreterSyncGPR3StaticGlue} +0    // invokespecial 56 {java/.../ConcurrentHashMap.helpTransfer([Ljava/.../shMcurrentHashMap$Node;Ljava/.../ConcurrentHashMap$Node;)[Ljava/.../ConcurrentHashMap$Node;}

That trampoline goes to the interpreter:

(kca) (0x7fff9b8b7714)/5i
0x7fff9b8b7714 {java/.../ConcurrentHashMap.putVal} +1817               10006ef8 std       r3, 0x10(r14) <<< +871
0x7fff9b8b7718 {java/.../ConcurrentHashMap.putVal} +1818               08008ef8 std       r4, 8(r14)
0x7fff9b8b771c {java/.../ConcurrentHashMap.putVal} +1819               0000aef8 std       r5, 0(r14)
0x7fff9b8b7720 {java/.../ConcurrentHashMap.putVal} +1820               81811f48 bl        0x7fff9baaf8a0 Trampoline {libj9jit29.so}{endproc._interpreterSyncGPR3StaticGlue} +0
0x7fff9b8b7724 {java/.../ConcurrentHashMap.putVal} +1821               50688b9b stb       r28, 0x6850(r11) -> {java/.../ConcurrentHashMap.putVal} +872

That's because from the vlog, ConcurrentHashMap.putVal has one compilation and was compiled before ConcurrentHashMap.helpTransfer.

+ (warm) java/util/concurrent/ConcurrentHashMap.putVal(Ljava/lang/Object;Ljava/lang/Object;Z)Ljava/lang/Object; @ 00007FFF9B8B5AB0-00007FFF9B8B7908 OrdinaryMethod - Q_SZ=0 Q_SZI=0 QW=12 j9m=00000000001203C0 bcsz=432 sync FSD OSR remote compThreadID=0 CpuLoad=104%(26%avg) JvmCpu=60%
...
! (warm) java/util/concurrent/ConcurrentHashMap.helpTransfer([Ljava/util/concurrent/ConcurrentHashMap$Node;Ljava/util/concurrent/ConcurrentHashMap$Node;)[Ljava/util/concurrent/ConcurrentHashMap$Node; Q_SZ=0 Q_SZI=0 QW=12 j9m=0000000000120840 time=8336us compilationInterrupted memLimit=262144 KB compThreadID=0
! (warm) java/util/concurrent/ConcurrentHashMap.helpTransfer([Ljava/util/concurrent/ConcurrentHashMap$Node;Ljava/util/concurrent/ConcurrentHashMap$Node;)[Ljava/util/concurrent/ConcurrentHashMap$Node; Q_SZ=0 Q_SZI=0 QW=12 j9m=0000000000120840 time=11933us compilationRecoverableTrampolineFailure memLimit=262144 KB compThreadID=0
+ (warm) java/util/concurrent/ConcurrentHashMap.helpTransfer([Ljava/util/concurrent/ConcurrentHashMap$Node;Ljava/util/concurrent/ConcurrentHashMap$Node;)[Ljava/util/concurrent/ConcurrentHashMap$Node; @ 00007FFF9DAB00F4-00007FFF9DAB040C OrdinaryMethod - Q_SZ=0 Q_SZI=0 QW=12 j9m=0000000000120840 bcsz=124 sync FSD OSR remote compThreadID=0 CpuLoad=350%(87%avg) JvmCpu=316%

newStartPC is the start of the newly compiled ConcurrentHashMap.helpTransfer. It matches the start address in the vlog output as well.

(kca) (0x7fff9dab00f4)/8i
0x7fff9dab00f4 {java/.../ConcurrentHashMap.helpTransfer} +0                 10006ee8 ld        r3, 0x10(r14)
0x7fff9dab00f8 {java/.../ConcurrentHashMap.helpTransfer} +1                 08008ee8 ld        r4, 8(r14)
0x7fff9dab00fc {java/.../ConcurrentHashMap.helpTransfer} +2                 0000aee8 ld        r5, 0(r14)
0x7fff9dab0100 {java/.../ConcurrentHashMap.helpTransfer} +3                 10006ef8 std       r3, 0x10(r14)
0x7fff9dab0104 {java/.../ConcurrentHashMap.helpTransfer} +4                 08008ef8 std       r4, 8(r14)
0x7fff9dab0108 {java/.../ConcurrentHashMap.helpTransfer} +5                 0000aef8 std       r5, 0(r14)
0x7fff9dab010c {java/.../ConcurrentHashMap.helpTransfer} +6                 50006fe9 ld        r11, 0x50(r15) J9VMThread.stackOverflowMark // stack or async check
0x7fff9dab0110 {java/.../ConcurrentHashMap.helpTransfer} +7                 60ffce39 addi      r14, r14, -0xa0

zl-wang · 2024-07-28T15:50:58Z

Good info. Missing HT entry happened in the codeCache where putVal() was compiled into. From its current status, we are not sure yet whether the call (to helpTransfer) was originally unresolved (to begin with) or only interpreted. Look down the snippet whether there were cp/cpindex in it to tell. That in turn indicated whether the resolved HT entry (if it had existed) should be established (created) at compile-time (putVal) or runtime (during picBuilder resolution .. from unresolved HT to resolved HT), although I highly suspected it was resolved to begin with (from same class). If that is the case, i cannot think of a scenario why resolved HT entry is not created at compile-time (because we knew we are calling a static method), such that we cannot look it up (also making sure the current codeCache searched is the putVal one). There is no possibility of runtime error causing it. Then, the only weird thing to me is: both compilations were marked as sync FSD OSR remote ... what is the implication?

zl-wang · 2024-07-28T15:51:37Z

maybe that is the problem ... is remote meaning JIT-server? maybe HT entry missed in that case somehow.

IBMJimmyk · 2024-07-28T16:34:40Z

From the same vlog:

#INFO:  zzz reserveResolvedTrampoline - _resolvedMethodHT: 00007FFFB80D8770, method: 0000000000120840, entry: 0000000000000000
#INFO:  zzz addResolvedMethod - _resolvedMethodHT: 00007FFFB80D8770, method: 0000000000120840, entry: 00007FFFB80D9A70
+ (warm) java/util/concurrent/ConcurrentHashMap.putVal(Ljava/lang/Object;Ljava/lang/Object;Z)Ljava/lang/Object; @ 00007FFF9B8B5AB0-00007FFF9B8B7908 OrdinaryMethod - Q_SZ=0 Q_SZI=0 QW=12 j9m=00000000001203C0 bcsz=432 sync FSD OSR remote compThreadID=0 CpuLoad=104%(26%avg) JvmCpu=60%
...
#INFO:  zzz resetTrampolines - _resolvedMethodHT: 00007FFFB80D8770
...
! (warm) java/util/concurrent/ConcurrentHashMap.helpTransfer([Ljava/util/concurrent/ConcurrentHashMap$Node;Ljava/util/concurrent/ConcurrentHashMap$Node;)[Ljava/util/concurrent/ConcurrentHashMap$Node; Q_SZ=0 Q_SZI=0 QW=12 j9m=0000000000120840 time=8336us compilationInterrupted memLimit=262144 KB compThreadID=0
! (warm) java/util/concurrent/ConcurrentHashMap.helpTransfer([Ljava/util/concurrent/ConcurrentHashMap$Node;Ljava/util/concurrent/ConcurrentHashMap$Node;)[Ljava/util/concurrent/ConcurrentHashMap$Node; Q_SZ=0 Q_SZI=0 QW=12 j9m=0000000000120840 time=11933us compilationRecoverableTrampolineFailure memLimit=262144 KB compThreadID=0
+ (warm) java/util/concurrent/ConcurrentHashMap.helpTransfer([Ljava/util/concurrent/ConcurrentHashMap$Node;Ljava/util/concurrent/ConcurrentHashMap$Node;)[Ljava/util/concurrent/ConcurrentHashMap$Node; @ 00007FFF9DAB00F4-00007FFF9DAB040C OrdinaryMethod - Q_SZ=0 Q_SZI=0 QW=12 j9m=0000000000120840 bcsz=124 sync FSD OSR remote compThreadID=0 CpuLoad=350%(87%avg) JvmCpu=316%
#INFO:  zzz patchCallPoint - _resolvedMethodHT: 00007FFFB80D8770, method: 0000000000120840, entry: 0000000000000000
#INFO:  zzz replaceTrampoline - _resolvedMethodHT: 00007FFFB80D8770, method: 0000000000120840, entry: 0000000000000000

https://github.com/IBMJimmyk/omr/blob/06aeef119cfbdcf7542d916f8935beed4dd50347/compiler/runtime/OMRCodeCache.cpp#L562-L577
reserveResolvedTrampoline is called with method 0x120840, the entry in _resolvedMethodHT for this method is null so addResolvedMethod is called to add the entry.
https://github.com/IBMJimmyk/omr/blob/06aeef119cfbdcf7542d916f8935beed4dd50347/compiler/runtime/OMRCodeCache.cpp#L958-L969
This happens around the time ConcurrentHashMap.putVal is compiled.

Sometime later, resetTrampolines is called on _resolvedMethodHT:
https://github.com/IBMJimmyk/openj9/blob/906b86e12a14cab6b697d9306bb8cb7c2facb3f7/runtime/compiler/runtime/J9CodeCache.cpp#L739-L742
I believe this clears out all of the entries in _resolvedMethodHT. I'm not sure why resetTrampolines is called.

A bit later on, ConcurrentHashMap.helpTransfer finishes compiling and needs to patch the trampoline.
patchCallPoint is called but the entry is null for method 0x120840 in _resolvedMethodHT. This leads to the crash.

So my question is, is something supposed to create a new _resolvedMethodHT entry for method 0x120840 or is the call to resetTrampolines unexpected?

As for the remote tag, I'm not sure what it means at this moment. I'll try asking people more familiar with JIT server if they have an idea.

IBMJimmyk · 2024-07-28T17:12:21Z

I didn't realize that I had the JITAAS flag set on my grinders until now. It is supposed to fail even without that flag so I started a new grinder without it. The new grinder has a failed result already and looks to be a similar failure. I'll try looking at this one to see if anything is different.

zl-wang · 2024-07-28T17:29:42Z

JITAAS means JIT-Server?
anyway, you should look into when/under what conditions resetTrampolines is called. It sounds to me that is the root issue of missing HT entry. I am not familiar with it specifically, but I had impression: when the current compilation found out the designated codeCache had no capacity to house the JIT-code anymore, it needs to switch to another codeCache. Then, the current compilation's trampoline reservations (HT entries) are thrown away and re-reserve them against the new codeCache. Not sure if that amounts to resetTrampolines. I don't think resetTrampolines existed when I did these things. It must be added later for some reason.

IBMJimmyk · 2024-07-28T17:43:03Z

Yes it does. The JITAAS option causes the -XX:+UseJITServer option to be set when running the test.

Looking into resetTrampolines sounds like a good idea.

IBMJimmyk · 2024-07-29T20:42:58Z

The rc018 repeatedly defines classes. As a result it goes through jitClassesRedefined:

openj9/runtime/compiler/control/HookedByTheJit.cpp

Line 2413 in 48f557e

    
           void jitClassesRedefined(J9VMThread * currentThread, UDATA classCount, J9JITRedefinedClass *classList, UDATA extensionsUsed)

jitClassesRedefined checks for FSD here:

openj9/runtime/compiler/control/HookedByTheJit.cpp

Line 2456 in 48f557e

if (!TR::Options::getCmdLineOptions()->getOption(TR_FullSpeedDebug))

FSD is enabled so all the trampolines gets deleted here:

openj9/runtime/compiler/control/HookedByTheJit.cpp

Line 2531 in 48f557e

TR::CodeCacheManager::instance()->onFSDDecompile();

I added a debug line to print the flag to the verbose log to confirm that the FSD flag is set.

If FSD was disabled, I think it would have gone here:

openj9/runtime/compiler/control/HookedByTheJit.cpp

Lines 2482 to 2483 in 48f557e

    
           TR::CodeCacheManager::instance()->onClassRedefinition(reinterpret_cast<TR_OpaqueMethodBlock *>(staleMethod), 
        
                                                                 reinterpret_cast<TR_OpaqueMethodBlock *>(freshMethod));

onClassRedefinition removes the old method from _resolvedMethodHT and adds in the new one.

With all this, I am now able to reproduce the problem locally with a small test:

import java.io.*;
import java.lang.instrument.*;

public class HCRTrampoline {

    byte[] classByteArray;

    public static void main (String[] args) {
        HCRTrampoline t = new HCRTrampoline();
        try {
           t.runTest();
        } catch (Exception e) {
            System.out.println("***FAILED*** - Unable to find class");
        }
        System.out.println("Done");
    }

    HCRTrampoline() {
        try {
            File classFile = new File("InitObject.class");
            FileInputStream inputStream = new FileInputStream(classFile);

            classByteArray = new byte[(int)classFile.length()];
            inputStream.read(classByteArray);
        } catch (Exception e) {
            System.out.println("***FAILED*** - Unable to set up testing.");
            System.exit(0);
        }
    }

    private void runTest() throws Exception {
        for (int i = 0; i < 15; i++) {
            callA(i);
        }
        RedefAgent.instru.redefineClasses(new ClassDefinition(InitObject.class, classByteArray));
        for (int i = 0; i < 150; i++) {
            callA(i);
        }
    }

    private int callA(int a) {
        if (a >= 5) {
            a = callB(a);
        }
        return a;
    }

    private int callB(int b) {
        b = b + 1;
        return b;
    }
}

Run with:

java -javaagent:RedefAgent.jar -Xjit:"stressTrampolines,forceFullSpeedDebug,limit={*HCRTrampoline*},optlevel=warm,count=10,disableinlining,disableasynccompilation,disableDynamicLoopTransfer,compilationThreads=1,verbose,vlog=trace/trampoline.vlog" -Xshareclasses:none -Xnoaot HCRTrampoline

The test runs callA until a compilation is triggered for callA.
callB is intentionally called a fewer number of times since I don't want it to be compiled yet.
Inlining is disabled since I don't want callB to be inlined.
stressTrampolines forces the creation of a trampoline.
forceFullSpeedDebug forces FSD mode.

After callA is compiled, redefineClasses is called to redefine InitObject.class with itself. I just wanted to trigger a method recompilation.

callA is called a few more time since I now want callB to get compiled. This tries to patch the trampoline in callA and crashes.

The verbose log looks like this:

#INFO:  zzz reserveResolvedTrampoline - _resolvedMethodHT: 0000758AC82063C0, method: 00000000002A4848, entry: 0000000000000000
#INFO:  zzz addResolvedMethod - _resolvedMethodHT: 0000758AC82063C0, method: 00000000002A4848, entry: 0000758AC8206D10
#INFO:  zzz getResolvedTrampoline current - _resolvedMethodHT: 0000758AC82063C0, method: 00000000002A4848, compMethod: HCRTrampoline.callA(I)I
+ (warm) HCRTrampoline.callA(I)I @ 0000758A9C140078-0000758A9C14015C OrdinaryMethod - Q_SZ=1 Q_SZI=1 QW=12 j9m=00000000002A4828 bcsz=13 sync FSD OSR compThreadID=0 CpuLoad=7%(0%avg) JvmCpu=0%
#INFO:  zzz jitClassesRedefined call resetCodeCache - FSD flag: 1
#INFO:  zzz onFSDDecompile call resetCodeCache - _resolvedMethodHT: 0000758AC82063C0
#INFO:  zzz resetTrampolines - _resolvedMethodHT: 0000758AC82063C0
+ (warm) HCRTrampoline.callB(I)I @ 0000758A9C1401F8-0000758A9C140240 OrdinaryMethod - Q_SZ=0 Q_SZI=0 QW=2 j9m=00000000002A4848 bcsz=6 sync FSD OSR compThreadID=0 CpuLoad=7%(0%avg) JvmCpu=0%
#INFO:  zzz patchCallPoint - _resolvedMethodHT: 0000758AC82063C0, method: 00000000002A4848, entry: 0000000000000000
#INFO:  zzz replaceTrampoline - _resolvedMethodHT: 0000758AC82063C0, method: 00000000002A4848, entry: 0000000000000000

The compilation of callA gets method 2A4848(callB) added to _resolvedMethodHT.
The class redefinition under FSD causes the resetTrampolines call to empty out _resolvedMethodHT.
When replaceTrampoline tries to get entry from method 2A4848, null is returned.
This is a segfault inside replaceTrampoline.

FSD mode throwing out the trampolines seems to be intentional.
So the question is, how are things supposed to be handled from there? Is something else supposed to fill in _resolvedMethodHT so the trampoline can be fixed or is the existing compilation of callA supposed to be thrown out?

zl-wang · 2024-07-29T21:07:01Z

throwing out all trampoline reservations without throwing out all reservation creators as well will just not work. it is wrong to do that to begin with. my impression with FSD decompile might be out of dated, but i remembered when decompile is triggered, everything should be thrown out (essentially all codecache(s) are cleared ... starting as if the run is about to begin from JIT perspective). @tajila and @gacholio please chime in.

gacholio · 2024-07-30T18:26:07Z

From the VM side, FSD HCR causes all running methods to be tagged for decompilation and all compiled methods are marked for retranslation. I assume the discard/reclaim of the code cache is done by jitClassesRedefined.

IBMJimmyk · 2024-07-30T18:50:22Z

I was asking @jdmpapin about this and he pointed me towards these lines of code:

openj9/runtime/util/hshelp.c

Lines 4226 to 4229 in bd018c0

    
           if (extensionsEnabled) { 
        
           	/* Toss the whole code cache */ 
        
           	jitConfig->jitHotswapOccurred(currentThread); 
        
           }

extensionsEnabled used to always be true if FSD is enabled. There's even a comment above the method that mentions this. I believe this is no longer the case after the switch to disabling Extended HCR by default. Enabling Extended HCR (-XX:+EnableExtendedHCR) causes the problem to stop occurring which in line with what happened with the original test.

As a quick test, I tried adding a check for J9_FSD_ENABLED(vm) so the code cache is also now tossed if the FSD option is set and this also makes the problem stop.

In both cases, callB no longer attempts to patch the trampoline in callA.

zl-wang · 2024-07-30T19:36:14Z

Fixing this cross-platform issue is not a problem anymore ...
The remaining question beyond this issue itself is: should we try to clear/recycle the codecache(s)? currently, it was obviously intended to do, but it wasn't done (only dealing with trampolines now). sure, if we want to plug this hole, another issue is needed.

IBMJimmyk · 2024-07-31T17:17:32Z

I opened a PR for the fix:
#19949

jitClassRedefineEvent now checks for both extensions or FSD and dump the code cache is both cases.

I ran a Grinder with my change and all 5 runs passes:
https://hyc-runtimes-jenkins.swg-devops.com/job/Grinder/42515/

I was also asked why extended HCR is being disabled by default in the first place since that's what seems to have caused the bug. I believe it traces back to this: #17857

It was one step in the process of trying to remove extended HCR altogether.

jdmpapin · 2024-07-31T19:03:19Z

There are maintenance benefits to removing extended HCR, (edit: some of) which I described in that issue, but I think the more proximal reason for the change is that with extended HCR the static fields will move on redefinition, which interferes with an ongoing effort to simplify unsafe accesses

hzongaro · 2024-09-15T22:40:08Z

@IBMJimmyk wrote:

I opened a PR for the fix:
#19949

Does that resolve the problem? Can this issue be closed or is there still some problem remaining?

IBMJimmyk · 2024-09-16T18:38:37Z

This can be closed. The PR fixes the problem. I was not able to reproduce the problem after the fix and have not heard of any other issues.

cjjdespres added the test failure label Jul 4, 2024

pshipton added this to the Release 0.46 (Java 8, 11, 17, 21, 22) July refresh milestone Jul 5, 2024

pshipton added a commit to pshipton/openj9 that referenced this issue Jul 5, 2024

Add -XX:+EnableExtendedHCR to jvmtitests_hcr rc018, rtc001

32db9dc

Related to eclipse-openj9#19710 Issue eclipse-openj9#19765 Issue eclipse-openj9#19813 Signed-off-by: Peter Shipton <[email protected]>

pshipton added a commit to pshipton/openj9 that referenced this issue Jul 5, 2024

Add -XX:+EnableExtendedHCR to jvmtitests_hcr rc018, rtc001

b74221b

Related to eclipse-openj9#19710 Issue eclipse-openj9#19765 Issue eclipse-openj9#19813 Signed-off-by: Peter Shipton <[email protected]>

pshipton mentioned this issue Jul 5, 2024

Add -XX:+EnableExtendedHCR to jvmtitests_hcr rc018, rtc001 #19819

Merged

pshipton added a commit to pshipton/openj9 that referenced this issue Jul 5, 2024

Add -XX:+EnableExtendedHCR to jvmtitests_hcr rc018, rtc001

115150a

Related to eclipse-openj9#19710 Issue eclipse-openj9#19765 Issue eclipse-openj9#19813 Signed-off-by: Peter Shipton <[email protected]>

pshipton mentioned this issue Jul 5, 2024

(0.46) Add -XX:+EnableExtendedHCR to addtional tests #19820

Merged

tajila mentioned this issue Jul 5, 2024

cmdLineTester_jvmtitests_hcr_OSRG_nongold rtc001 crash vmState=0x000501ff #19765

Closed

pshipton modified the milestones: Release 0.46 (Java 8, 11, 17, 21, 22) July refresh, Release 0.48 (Java 8, 11, 17, 21, 23) October refresh Jul 5, 2024

tajila added the comp:jit label Jul 9, 2024

IBMJimmyk mentioned this issue Jul 31, 2024

Adds FSD check to jitClassRedefineEvent #19949

Merged

jdmpapin closed this as completed Sep 16, 2024

jdmpapin mentioned this issue Nov 15, 2024

Segmentation-Error during code change in debug session #20229

Closed

cmdLineTester_jvmtitests_hcr_OSRG_nongold_0 rc018 ppc64le segfault in internal tests #19813

cmdLineTester_jvmtitests_hcr_OSRG_nongold_0 rc018 ppc64le segfault in internal tests #19813

Comments

cjjdespres commented Jul 4, 2024

Failure link

Optional info

Failure output (captured from console output)

cjjdespres commented Jul 4, 2024

cjjdespres commented Jul 4, 2024

tajila commented Jul 4, 2024

tajila commented Jul 4, 2024 • edited by pshipton Loading

pshipton commented Jul 5, 2024 • edited Loading

pshipton commented Jul 5, 2024

tajila commented Jul 5, 2024 • edited Loading

cjjdespres commented Jul 5, 2024

tajila commented Jul 5, 2024

zl-wang commented Jul 8, 2024

zl-wang commented Jul 8, 2024

tajila commented Jul 8, 2024

tajila commented Jul 8, 2024 • edited Loading

zl-wang commented Jul 8, 2024

tajila commented Jul 8, 2024

zl-wang commented Jul 8, 2024

tajila commented Jul 8, 2024

IBMJimmyk commented Jul 23, 2024

IBMJimmyk commented Jul 24, 2024

IBMJimmyk commented Jul 25, 2024

zl-wang commented Jul 25, 2024

IBMJimmyk commented Jul 26, 2024

IBMJimmyk commented Jul 26, 2024

zl-wang commented Jul 26, 2024

zl-wang commented Jul 26, 2024 • edited Loading

zl-wang commented Jul 26, 2024

zl-wang commented Jul 26, 2024

IBMJimmyk commented Jul 28, 2024

zl-wang commented Jul 28, 2024

zl-wang commented Jul 28, 2024 • edited Loading

IBMJimmyk commented Jul 28, 2024

IBMJimmyk commented Jul 28, 2024

zl-wang commented Jul 28, 2024

IBMJimmyk commented Jul 28, 2024

IBMJimmyk commented Jul 29, 2024

zl-wang commented Jul 29, 2024

gacholio commented Jul 30, 2024 • edited Loading

IBMJimmyk commented Jul 30, 2024

zl-wang commented Jul 30, 2024

IBMJimmyk commented Jul 31, 2024

jdmpapin commented Jul 31, 2024 • edited Loading

hzongaro commented Sep 15, 2024

IBMJimmyk commented Sep 16, 2024

tajila commented Jul 4, 2024 •

edited by pshipton

Loading

pshipton commented Jul 5, 2024 •

edited

Loading

tajila commented Jul 5, 2024 •

edited

Loading

tajila commented Jul 8, 2024 •

edited

Loading

zl-wang commented Jul 26, 2024 •

edited

Loading

zl-wang commented Jul 28, 2024 •

edited

Loading

gacholio commented Jul 30, 2024 •

edited

Loading

jdmpapin commented Jul 31, 2024 •

edited

Loading