Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmdLineTester_jvmtitests_hcr_OSRG_nongold_0 rc018 ppc64le segfault in internal tests #19813

Closed
cjjdespres opened this issue Jul 4, 2024 · 43 comments

Comments

@cjjdespres
Copy link
Contributor

Failure link

A sample, as it's been failing very frequently in the JITServer nightly tests:

https://hyc-runtimes-jenkins.swg-devops.com/job/Test_openjdk17_j9_extended.functional_ppc64le_linux_jit_Personal/991/
https://hyc-runtimes-jenkins.swg-devops.com/job/Test_openjdk11_j9_extended.functional_ppc64le_linux_jit_Personal/1078/

The failure in cmdLineTester_jvmtitests_hcr_OSRG_nongold_SE80_0 on ppc64le JDK8 in this test from July 4 also fails in rc018 in the same way.

Optional info

I initially noticed this in the JITServer nightly tests, but a 50x grinder without JITAAS had a 43/50 failure rate, so this is not JITServer-specific. I have only seen this failure on ppc64le so far.

Failure output (captured from console output)

        Testing: rc018
        Test start time: 2024/07/04 01:34:28 Pacific Standard Time
        Running command: "/home/jenkins/workspace/Test_openjdk17_j9_extended.functional_ppc64le_linux_jit_Personal_testList_0/jdkbinary/j2sdk-image/bin/java" -XX:+UseJITServer -Xgcpolicy:optthruput -Xdebug -Xrunjdwp:transport=dt_socket,address=8888,server=y,onthrow=no.pkg.foo,launch=echo -Xjit:enableOSR,enableOSROnGuardFailure,count=1,disableAsyncCompilation  -Xdump    -agentlib:jvmtitest=test:rc018 -cp "/home/jenkins/workspace/Test_openjdk17_j9_extended.functional_ppc64le_linux_jit_Personal_testList_0/aqa-tests/TKG/../../jvmtest/functional/cmdLineTests/jvmtitests/jvmtitest.jar:/home/jenkins/workspace/Test_openjdk17_j9_extended.functional_ppc64le_linux_jit_Personal_testList_0/../../testDependency/lib/asm-all.jar" com.ibm.jvmti.tests.util.TestRunner
        Time spent starting: 8 milliseconds
        Time spent executing: 8339 milliseconds
        Test result: FAILED
        Output from test:
         [OUT] *** Testing [1/1]:	testReflectRedefineAtSameTime
         [OUT] starting reflect worker threads
         [OUT] starting to populate java heap
         [ERR] Unhandled exception
         [ERR] Type=Segmentation error vmState=0x00040000
         [ERR] J9Generic_Signal_Number=00000018 Signal_Number=0000000b Error_Value=00000000 Signal_Code=00000001
         [ERR] Handler1=00007FFF9FD518F0 Handler2=00007FFFA41C97C0
         [ERR] R0=00007FFF9F37E780 R1=00007FFFA473C1B0 R2=00007FFF9F76EB00 R3=0000000000000000
         [ERR] R4=00000000001CFE20 R5=0000000000000000 R6=0000000000000000 R7=00007FFF85280074
         [ERR] R8=00007FFF83269B86 R9=00007FFFA00D87A0 R10=00007FFF83269B86 R11=0000000000000000
         [ERR] R12=00007FFF8318F2A8 R13=00007FFFA47468E0 R14=000000000017B700 R15=0000000000095800
         [ERR] R16=00007FFF816E0038 R17=0000000000000054 R18=0000000000000000 R19=0000000089F7B8E8
         [ERR] R20=0000000000827FE1 R21=000000000000009C R22=0000000088D0E0B0 R23=0000000048000001
         [ERR] R24=48000DC548000001 R25=00007FFFA00D8210 R26=00007FFF8318F2A8 R27=0000000000000001
         [ERR] R28=00007FFF85280074 R29=0000000000000000 R30=00007FFFA00D87A0 R31=0000000000000000
         [ERR] NIP=00007FFF9F37E824 MSR=800000000280F033 ORIG_GPR3=D00000000282056C CTR=00007FFF9EDBAD80
         [ERR] LINK=00007FFF9F37E820 XER=0000000000000000 CCR=0000000028242422 SOFTE=0000000000000001
         [ERR] TRAP=0000000000000300 DAR=0000000000000020 dsisr=0000000042000000 RESULT=0000000000000000
         [ERR] FPR0=000000000017ba00 (f: 1554944.000000, d: 7.682444e-318)
         [ERR] FPR1=4056bb7820000000 (f: 536870912.000000, d: 9.092921e+01)
         [ERR] FPR2=0000000000000000 (f: 0.000000, d: 0.000000e+00)
         [ERR] FPR3=3fee666660000000 (f: 1610612736.000000, d: 9.500000e-01)
         [ERR] FPR4=3fce840b4ac4e4d2 (f: 1254417664.000000, d: 2.384047e-01)
         [ERR] FPR5=bfe7154748bef6c8 (f: 1220474624.000000, d: -7.213475e-01)
         [ERR] FPR6=3fe62e42fefa39ef (f: 4277811712.000000, d: 6.931472e-01)
         [ERR] FPR7=bfb1973c5a611ccc (f: 1516313856.000000, d: -6.871392e-02)
         [ERR] FPR8=6552747365745f38 (f: 1702125312.000000, d: 1.196544e+180)
         [ERR] FPR9=6552646e41666564 (f: 1097229696.000000, d: 1.192487e+180)
         [ERR] FPR10=324f5f7463656c66 (f: 1667591296.000000, d: 2.327369e-66)
         [ERR] FPR11=0000000000000000 (f: 0.000000, d: 0.000000e+00)
         [ERR] FPR12=0000000000000000 (f: 0.000000, d: 0.000000e+00)
         [ERR] FPR13=4020800000000000 (f: 0.000000, d: 8.250000e+00)
         [ERR] FPR14=0000000000000000 (f: 0.000000, d: 0.000000e+00)
         [ERR] FPR15=0000000000000000 (f: 0.000000, d: 0.000000e+00)
         [ERR] FPR16=0000000000000000 (f: 0.000000, d: 0.000000e+00)
         [ERR] FPR17=0000000000000000 (f: 0.000000, d: 0.000000e+00)
         [ERR] FPR18=0000000000000000 (f: 0.000000, d: 0.000000e+00)
         [ERR] FPR19=0000000000000000 (f: 0.000000, d: 0.000000e+00)
         [ERR] FPR20=0000000000000000 (f: 0.000000, d: 0.000000e+00)
         [ERR] FPR21=0000000000000000 (f: 0.000000, d: 0.000000e+00)
         [ERR] FPR22=0000000000000000 (f: 0.000000, d: 0.000000e+00)
         [ERR] FPR23=0000000000000000 (f: 0.000000, d: 0.000000e+00)
         [ERR] FPR24=0000000000000000 (f: 0.000000, d: 0.000000e+00)
         [ERR] FPR25=0000000000000000 (f: 0.000000, d: 0.000000e+00)
         [ERR] FPR26=0000000000000000 (f: 0.000000, d: 0.000000e+00)
         [ERR] FPR27=0000000000000000 (f: 0.000000, d: 0.000000e+00)
         [ERR] FPR28=0000000000000000 (f: 0.000000, d: 0.000000e+00)
         [ERR] FPR29=0000000000000000 (f: 0.000000, d: 0.000000e+00)
         [ERR] FPR30=0000000000000000 (f: 0.000000, d: 0.000000e+00)
         [ERR] FPR31=0000000000000000 (f: 0.000000, d: 0.000000e+00)
         [ERR] Module=/home/jenkins/workspace/Test_openjdk17_j9_extended.functional_ppc64le_linux_jit_Personal_testList_0/jdkbinary/j2sdk-image/lib/default/libj9jit29.so
         [ERR] Module_base_address=00007FFF9E7E0000
         [ERR] Target=2_90_20240704_1047 (Linux 4.18.0-553.5.1.el8_10.ppc64le)
         [ERR] CPU=ppc64le (4 logical CPUs) (0x1dcc50000 RAM)
         [ERR] ----------- Stack Backtrace -----------
         [ERR] _ZN3OMR9CodeCache17replaceTrampolineEP20TR_OpaqueMethodBlockPvS3_S3_b+0xe4 (0x00007FFF9F37E824 [libj9jit29.so+0xb9e824])
         [ERR] _ZN3OMR16CodeCacheManager17replaceTrampolineEP20TR_OpaqueMethodBlockPvS3_S3_S3_b+0x94 (0x00007FFF9F381044 [libj9jit29.so+0xba1044])
         [ERR] mcc_replaceTrampoline+0x58 (0x00007FFF9ED65B98 [libj9jit29.so+0x585b98])
         [ERR] _Z15ppcCodePatchingPvS_S_S_S_S_+0x310 (0x00007FFF9EDBB090 [libj9jit29.so+0x5db090])
         [ERR] _ZN3OMR9CodeCache14patchCallPointEP20TR_OpaqueMethodBlockPvS3_S3_+0x100 (0x00007FFF9F37DB50 [libj9jit29.so+0xb9db50])
         [ERR] mcc_callPointPatching_unwrapper+0x50 (0x00007FFF9ED659F0 [libj9jit29.so+0x5859f0])
         [ERR] old_slow_jitCallCFunction+0x54 (0x00007FFF9F435644 [libj9jit29.so+0xc55644])
         [ERR]  (0x00007FFF9F46CF90 [libj9jit29.so+0xc8cf90])
         [ERR] sidecarInvokeReflectMethodImpl+0x508 (0x00007FFF9FD2BD98 [libj9vm29.so+0x1bd98])
         [ERR] sidecarInvokeReflectMethod+0x3c (0x00007FFF9FD2CCEC [libj9vm29.so+0x1ccec])
         [ERR] JVM_InvokeMethod_Impl+0xa8 (0x00007FFF9E17B7B8 [libjclse29.so+0x8b7b8])
         [ERR] JVM_InvokeMethod+0x3c (0x00007FFFA46664BC [libjvm.so+0x264bc])
         [ERR] JVM_InvokeMethod+0x30 (0x00007FFFA47473B0 [libjvm.so+0x73b0])
         [ERR] Java_jdk_internal_reflect_NativeMethodAccessorImpl_invoke0+0x24 (0x00007FFF9E02DC84 [libjava.so+0xdc84])
         [ERR]  (0x00007FFF9FF6C9B8 [libj9vm29.so+0x25c9b8])
         [ERR] ffi_call_int+0xd4 (0x00007FFF9FF6C684 [libj9vm29.so+0x25c684])
         [ERR] debugBytecodeLoopCompressed+0x16f2c (0x00007FFF9FE367EC [libj9vm29.so+0x1267ec])
         [ERR]  (0x00007FFF9FED6118 [libj9vm29.so+0x1c6118])
         [ERR] runCallInMethod+0x258 (0x00007FFF9FD2B528 [libj9vm29.so+0x1b528])
         [ERR] gpProtectedRunCallInMethod+0x54 (0x00007FFF9FD55C04 [libj9vm29.so+0x45c04])
         [ERR] signalProtectAndRunGlue+0x28 (0x00007FFF9FEEA248 [libj9vm29.so+0x1da248])
         [ERR] omrsig_protect+0x3e4 (0x00007FFFA41CAC74 [libj9prt29.so+0x3ac74])
         [ERR] gpProtectAndRun+0xa8 (0x00007FFF9FEEA318 [libj9vm29.so+0x1da318])
         [ERR] gpCheckCallin+0xc4 (0x00007FFF9FD582C4 [libj9vm29.so+0x482c4])
         [ERR] callStaticVoidMethod+0x48 (0x00007FFF9FD55218 [libj9vm29.so+0x45218])
         [ERR] JavaMain+0x1210 (0x00007FFFA4A27E60 [libjli.so+0x7e60])
         [ERR] ThreadJavaMain+0x18 (0x00007FFFA4A2E0F8 [libjli.so+0xe0f8])
         [ERR] start_thread+0xf8 (0x00007FFFA49D9718 [libpthread.so.0+0x9718])
         [ERR] clone+0x74 (0x00007FFFA48BB798 [libc.so.6+0x13b798])
@cjjdespres
Copy link
Contributor Author

@pshipton @tajila The earliest I've been able to find these failures is the nightly test run of June 27, which tested openj9 commits from June 26 and earlier, so I think this may be related to #19710 and maybe #19778. I'm not sure if the test options need to be adjusted as in #19792 or if it's something else.

@cjjdespres
Copy link
Contributor Author

The version info for the early failure in #19813 (comment) is:

# DETECTED_JAVA_VERSION=openjdk version "11.0.24-internal" 2024-07-16
# OpenJDK Runtime Environment (build 11.0.24-internal+0-adhoc.jenkins.BuildJDK11ppc64lelinuxjitPersonal)
# Eclipse OpenJ9 VM (build master-979ce920c7c, JRE 11 Linux ppc64le-64-Bit Compressed References 20240627_1784 (JIT enabled, AOT enabled)
# OpenJ9   - 979ce920c7c
# OMR      - 47a9d248db0
# JCL      - 35dab526d05 based on jdk-11.0.24+6)

@tajila
Copy link
Contributor

tajila commented Jul 4, 2024

@pshipton @tajila The earliest I've been able to find these failures is the nightly test run of June 27, which tested openj9 commits from June 26 and earlier, so I think this may be related to #19710 and maybe #19778. I'm not sure if the test options need to be adjusted as in #19792 or if it's something else.

Ill take a look

@tajila
Copy link
Contributor

tajila commented Jul 4, 2024

Grinder with option added, https://hyc-runtimes-jenkins.swg-devops.com/job/Grinder/41855/ - passed

If it passes that will confirm its related to the HCR changes.

However, the crash is unexpected regardless of options specified so that will need investigation.

@pshipton
Copy link
Member

pshipton commented Jul 5, 2024

Created #19819 to add -XX:+EnableExtendedHCR
and #19820 for 0.46

@tajila
Copy link
Contributor

tajila commented Jul 5, 2024

@cjjdespres I still think the crash needs to be investigated. The extended HCR options may have just exposed an issue. Disabling extended HCR prevents redefinition from changing the class shape, any attempt to do so should result in a JVMTI error. The fact that it is crashing means an assumption was broken.

I looked at the stacktrace and I couldn't find anything suspicious, but thats probably because Im not very familiar with that code. Do you know why it is crashing? I think we still need to answer this question.

@cjjdespres
Copy link
Contributor Author

I agree that this should be investigated. I think this option influences jvmtiRedefineClasses() and related functions, right? I'm also not at all familiar with that code, or with power.

@tajila
Copy link
Contributor

tajila commented Jul 5, 2024

@hzongaro @zl-wang Do you know who can investigate this?

@zl-wang
Copy link
Contributor

zl-wang commented Jul 8, 2024

@tajila could you briefly describe the difference between previous HCR and extended HCR? in particular, method recompilation and its implications. from the stack back-trace, it crashed trying to replace a trampoline (could be many different reasons). presumably, this is across platform, but it happened to be on ppc64le due to certain recompilations by-chance.

@zl-wang
Copy link
Contributor

zl-wang commented Jul 8, 2024

@IBMJimmyk i will ask jimmy to take a further look.

@tajila
Copy link
Contributor

tajila commented Jul 8, 2024

@tajila could you briefly describe the difference between previous HCR and extended HCR? in particular, method recompilation and its implications. from the stack back-trace, it crashed trying to replace a trampoline (could be many different reasons). presumably, this is across platform, but it happened to be on ppc64le due to certain recompilations by-chance.

The basic difference is that exended HCR allows one to change the class shape (ie. add/remove fields and methods). This means we dont redefine the J9Class in place, as we do with fast HCR. We create a new J9Class, then we need to do a heap walk to replace all object headers that are affected, update class heirarchies, array component type, etc. This also means that things like the vtable/jitvtable will have a new address, j9methods will also have a new address, etc.

@tajila
Copy link
Contributor

tajila commented Jul 8, 2024

WRT "method recompilation and its implications." I'll let @gacholio add anything I may have missed.

@zl-wang
Copy link
Contributor

zl-wang commented Jul 8, 2024

i.e. as part of the HCR hooked actions, every such-typed existing object is re-constructed?

@tajila
Copy link
Contributor

tajila commented Jul 8, 2024

i.e. as part of the HCR hooked actions, every such-typed existing object is re-constructed?

Yes, the J9Class in the header of affected object intances are updated to the new J9Class in the extended HCR case.

@zl-wang
Copy link
Contributor

zl-wang commented Jul 8, 2024

Yes, the J9Class in the header of affected object intances are updated to the new J9Class in the extended HCR case.

I meant more than that. since fields can be added or removed, every such object needs to be reconstructed literally. size can be different. it sounds like a newObject() being required. without new size, is the old-object allowed to continue running?

@tajila
Copy link
Contributor

tajila commented Jul 8, 2024

I meant more than that. since fields can be added or removed, every such object needs to be reconstructed literally. size can be different. it sounds like a newObject() being required. without new size, is the old-object allowed to continue running?

Sorry, you can add/remove static fields, but not instance fields. So only the instance header needs to be updated.

@tajila tajila added the comp:jit label Jul 9, 2024
@IBMJimmyk
Copy link
Contributor

I've been trying to take a look at this problem.

rc018 takes the class rc018_testRedefAndReflect_O1 and redefines it with rc018_testRedefAndReflect_O2. The two classes are the same except with different names. To be more specific, other than the class name the only difference in the java files are whitespace changes. The different names doesn't matter since the rc018 test will edit the class bytes of rc018_testRedefAndReflect_O2 to change the name to rc018_testRedefAndReflect_O1 before the redefinition. So the redefinition should be valid with and without extended HCR. But, it seems like something specific to the codepath to handle non-extended HCR is causing a problem.

Attempts to recreate the problem on a local system have been unsuccessful but it fails very regularly (at least 80+%) via Grinders. So I've been looking at the artifacts from one of my grinders (-XX:-EnableExtendedHCR is set):
https://hyc-runtimes-jenkins.swg-devops.com/job/Grinder/42275/

In this, the problem shows up as a hang.

        Testing: rc018
        Test start time: 2024/07/22 12:50:59 Pacific Standard Time
        Running command: "/home/jenkins/workspace/Grinder/jdkbinary/j2sdk-image/bin/java"  -Xgcpolicy:optthruput -Xdebug -Xrunjdwp:transport=dt_socket,address=8888,server=y,onthrow=no.pkg.foo,launch=echo -Xjit:enableOSR,enableOSROnGuardFailure,count=1,disableAsyncCompilation -XX:-EnableExtendedHCR -Xdump    -agentlib:jvmtitest=test:rc018 -cp "/home/jenkins/workspace/Grinder/aqa-tests/TKG/../../jvmtest/functional/cmdLineTests/jvmtitests/jvmtitest.jar:/home/jenkins/workspace/Grinder/../../testDependency/lib/asm-all.jar" com.ibm.jvmti.tests.util.TestRunner
        Time spent starting: 4 milliseconds
        ***[TEST INFO 2024/07/22 12:54:59] ProcessKiller detected a timeout after 240000 milliseconds!***

So I tried looking at what threads were currently running and this one stuck out:

3XMTHREADINFO      "Populating java heap" J9VMThread:0x0000000000241500, omrthread_t:0x000076D7AC2669B0, java/lang/Thread:0x0000000080D9E6F8, state:R, prio=5
3XMJAVALTHREAD            (java/lang/Thread getId:0x19, isDaemon:false)
3XMJAVALTHRCCL            jdk/internal/loader/ClassLoaders$AppClassLoader(0x0000000080D1C860)
3XMTHREADINFO1            (native thread ID:0x19C56C, native priority:0x5, native policy:UNKNOWN, vmstate:CW, vm thread flags:0x00000081)
3XMTHREADINFO2            (native stack address range from:0x000076D789790000, to:0x000076D789810000, size:0x80000)
3XMCPUTIME               CPU usage total: 0.625400978 secs, current category="Application"
3XMHEAPALLOC             Heap bytes allocated since last GC cycle=41922488 (0x27FAFB8)
3XMTHREADINFO3           Java callstack:
4XESTACKTRACE                at com/ibm/jvmti/tests/redefineClasses/rc018$populate.run(rc018.java:59)
3XMTHREADINFO3           Native callstack:
4XENATIVESTACK                (0x000076D7B1E97564 [libj9prt29.so+0x37564])
4XENATIVESTACK                (0x000076D7B1E9E948 [libj9prt29.so+0x3e948])
4XENATIVESTACK                (0x000076D7B1E97B94 [libj9prt29.so+0x37b94])
4XENATIVESTACK                (0x000076D7B1E97184 [libj9prt29.so+0x37184])
4XENATIVESTACK                (0x000076D7B1E9E948 [libj9prt29.so+0x3e948])
4XENATIVESTACK                (0x000076D7B1E97298 [libj9prt29.so+0x37298])
4XENATIVESTACK                (0x000076D7B1E99098 [libj9prt29.so+0x39098])
4XENATIVESTACK               __kernel_sigtramp_rt64+0x0 (0x000076D7B2AC04C8 [linux-vdso64.so.1+0x4c8])
4XENATIVESTACK               pthread_cond_wait+0x19c (0x000076D7B2A22E0C [libpthread.so.0+0x12e0c])
4XENATIVESTACK               omrthread_monitor_wait+0x38c (0x000076D7B1E2B04C [libj9thr29.so+0xb04c])
4XENATIVESTACK                (0x000076D7B1FD7538 [libj9vm29.so+0xa7538])
4XENATIVESTACK                (0x000076D7B1FD7910 [libj9vm29.so+0xa7910])
4XENATIVESTACK                (0x000076D7B1F44344 [libj9vm29.so+0x14344])
4XENATIVESTACK                (0x000076D7B203AAA8 [libj9vm29.so+0x10aaa8])
4XENATIVESTACK                (0x000076D7B21001E0 [libj9vm29.so+0x1d01e0])
4XENATIVESTACK                (0x000076D7B1F4AEE0 [libj9vm29.so+0x1aee0])
4XENATIVESTACK                (0x000076D7B1FE8BF0 [libj9vm29.so+0xb8bf0])
4XENATIVESTACK                (0x000076D7B1E9E948 [libj9prt29.so+0x3e948])
4XENATIVESTACK                (0x000076D7B1FE3EE4 [libj9vm29.so+0xb3ee4])
4XENATIVESTACK                (0x000076D7B1E313E8 [libj9thr29.so+0x113e8])
4XENATIVESTACK                (0x000076D7B2A18838 [libpthread.so.0+0x8838])
4XENATIVESTACK               clone+0x74 (0x000076D7B28ABA44 [libc.so.6+0x14ba44])

What this thread does is create objects for 20 seconds then stops:

        class populate extends Thread {
                long patience = 20000000000L; /* 20 seconds in nanoseconds */

                populate() {
                        super("Populating java heap");
                }

                public void run() {
                        long startTime = System.nanoTime();
                        while (((System.nanoTime() - startTime) < patience)) { //<<-------rc018.java:59 is here
                                Object array[] = new Object[1000];
                        }

                }
        }

So whatever is causing the hang causes this thread to stop as well even though it isn't directly involved with the rc018_testRedefAndReflect_O1 class that is getting redefined. Even after 20 seconds have passed, this thread is still running.

@IBMJimmyk
Copy link
Contributor

I ended up jumping to this this test failure since it was a segfault instead of a crash:
https://hyc-runtimes-jenkins.swg-devops.com/job/Grinder/42395/

https://github.com/eclipse/omr/blob/d18121d17c5750b62d057c61a84dd72dd9584e90/compiler/runtime/OMRCodeCache.cpp#L629-L638
The call to findResolvedMethod returns null. oldTrampoline is also null.
The crash is when it uses entry at entry->_info._resolved._currentTrampoline = trampoline;.

These symptoms look very similar to: #12120 (comment)
In that old issue, cpIndex is an 18-bit field. Some places sign extended it to 32bit and some places zero extended it. This caused mismatch where the cpIndex should have matched but didn't. The fix for that problem was checked in a long time ago but maybe I missed a spot or someone else introduced a new variation on the same problem.

Next step is to try adding some debug prints to the locations that are expected to update _resolvedMethodHT and see if that sheds any light on the issue. Based on my old notes, I've identified a bunch of places to added debug prints that can potentially help out.

This bug seems like it connects a bad cpIndex, non-extended HCR and class redefintion. So if anyone knows a place in the code that might be related to all three, it is likely a good place to look as well.

@IBMJimmyk
Copy link
Contributor

I'm still trying to figure out what is going on. One thing I discovered is when I run the test locally, the path through replaceTrampoline doesn't seem to be taken which is important because that's where the failure happens.

On failing grinder run with my new trace messages I see this:

#INFO:  zzz replaceTrampoline - method: 0000000000120220, entry: 0000000000000000

It interacts with this lookup:

entry = _resolvedMethodHT->findResolvedMethod(method);

This means at the start of replaceTrampoline, there is a lookup for method 0x120220 and it returns null for entry which leads to the crash.

In my local run, that line is never printed out so there are no calls to replaceTrampoline. Since it never goes there, it seems to avoid the failure condition and the test passes.

Another thing is, on the failing grinder runs, going to replaceTrampoline seems to guarantee failure. I checked 5 runs, in all 5 runs the call out to replaceTrampoline happens once each and it fails in that one try.

So my next step is to figure out why replaceTrampoline is being called. This may lead to discovering what the problem is or at least may make local reproduction possible if I can recreate the conditions that cause replaceTrampoline to be called.

@zl-wang
Copy link
Contributor

zl-wang commented Jul 25, 2024

method 0120220 is one of the methods in the replaced or replacing classes? running into replaceTrampoline is also by chance (it needs to be recompiled, called previous version to trigger). if it belongs to the replaced class, that might be the source reason (it should disappear altogether already).

@IBMJimmyk
Copy link
Contributor

(kca) j9m 0x120220
Method   {ClassPath/Name.MethodName}: {java/util/concurrent/ConcurrentHashMap.casTabAt}
                           Signature: ([Ljava/util/concurrent/ConcurrentHashMap$Node;ILjava/util/concurrent/ConcurrentHashMap$Node;Ljava/util/concurrent/ConcurrentHashMap$Node;)Z
                              Access: Static Final
                    J9Class/J9Method: 0x0000000000121000 / 0x0000000000120220
               Compiled Method Start: 0x00007fff720300f0 (127 instructions)
                      ByteCode Start: 0x00007fff8c2378d4 (21 bytes)
                   ROM Constant Pool: 0x00007fff8c236410 (307 entries)
                       Constant Pool: 0x000000000011ee30 (303 entries)

It doesn't seem to be related to the class being replaced at all.

The class being redefined, rc018_testRedefAndReflect_O1, does not use ConcurrentHashMap. It's a simple class that doesn't have any methods. It just initializes a bunch of fields.

The test rc018 doesn't even create objects of type rc018_testRedefAndReflect_O1.

        class ReflectThread extends Thread {
                long patience = 20000000000L; /* 20 seconds in nanoseconds */

                ReflectThread() {
                        super("Reflecting Thread");
                }

                public void run() {
                        long startTime = System.nanoTime();
                        try {
                                while (((System.nanoTime() - startTime) < patience)) {
                                        rc018_testRedefAndReflect_O1.class.getDeclaredFields();
                                }
                        } catch (SecurityException e) {
                                System.out.println("getDeclaredFields hit an security exception");
                        }
                }

        }

The test tries to read the class's fields over and over again and doesn't even check the result.

@IBMJimmyk
Copy link
Contributor

From the vlog:

! (warm) java/util/concurrent/ConcurrentHashMap.casTabAt([Ljava/util/concurrent/ConcurrentHashMap$Node;ILjava/util/concurrent/ConcurrentHashMap$Node;Ljava/util/concurrent/ConcurrentHashMap$Node;)Z Q_SZ=1 Q_SZI=1 QW=18 j9m=0000000000120220 time=5190us compilationRecoverableTrampolineFailure memLimit=262144 KB compThreadID=0
+ (warm) java/util/concurrent/ConcurrentHashMap.casTabAt([Ljava/util/concurrent/ConcurrentHashMap$Node;ILjava/util/concurrent/ConcurrentHashMap$Node;Ljava/util/concurrent/ConcurrentHashMap$Node;)Z @ 00007FFF720300F0-00007FFF72030240 OrdinaryMethod - Q_SZ=1 Q_SZI=1 QW=18 j9m=0000000000120220 bcsz=21 sync FSD OSR remote compThreadID=0 CpuLoad=321%(80%avg) JvmCpu=300%
#INFO:  zzz replaceTrampoline - method: 0000000000120220, entry: 0000000000000000

It looks like the ConcurrentHashMap.casTabAt compilation first fails due to compilationRecoverableTrampolineFailure and then is successful. It then calls replaceTrampoline to fix the trampolines and it doesn't work. casTabAt can not be found in the _resolvedMethodHT.

@zl-wang
Copy link
Contributor

zl-wang commented Jul 26, 2024

  1. find out how many versions this method has been compiled;
  2. from core file, find out the call-site where replaceTrampoline is triggered and crashed;
  3. which version is it calling right now and whether that is going through trampoline;

these may lead us to question why it is not in the calling codeCache's hashTable

@zl-wang
Copy link
Contributor

zl-wang commented Jul 26, 2024

HashTable entry (unresolved or resolved) should be established by caller(s) when the caller(s) are compiled, rather than by itself. Trampoline is created at compile-time or runtime, depending on the callee has been compiled at that time or not (of course, whether it is reachable by bl instruction).

@zl-wang
Copy link
Contributor

zl-wang commented Jul 26, 2024

not exactly sure what compilationRecoverableTrampolineFailure means. find that out, and see if we can figure something out.

@zl-wang
Copy link
Contributor

zl-wang commented Jul 26, 2024

HashTable entry could be created at runtime as well (from unresolved becoming resolved) ... just FYI.

@IBMJimmyk
Copy link
Contributor

I made a build with more debug info a got a new corefile. The failure is still similar.

#INFO:  zzz replaceTrampoline - _resolvedMethodHT: 00007FFFB80D8770, method: 0000000000120840, entry: 0000000000000000

It's still a null entry after lookup into _resolvedMethodHT while in replaceTrampoline.
Method 0x120840 is ConcurrentHashMap.helpTransfer. It has two failed attempts at compilation and one successful one.

! (warm) java/util/concurrent/ConcurrentHashMap.helpTransfer([Ljava/util/concurrent/ConcurrentHashMap$Node;Ljava/util/concurrent/ConcurrentHashMap$Node;)[Ljava/util/concurrent/ConcurrentHashMap$Node; Q_SZ=0 Q_SZI=0 QW=12 j9m=0000000000120840 time=8336us compilationInterrupted memLimit=262144 KB compThreadID=0
! (warm) java/util/concurrent/ConcurrentHashMap.helpTransfer([Ljava/util/concurrent/ConcurrentHashMap$Node;Ljava/util/concurrent/ConcurrentHashMap$Node;)[Ljava/util/concurrent/ConcurrentHashMap$Node; Q_SZ=0 Q_SZI=0 QW=12 j9m=0000000000120840 time=11933us compilationRecoverableTrampolineFailure memLimit=262144 KB compThreadID=0
+ (warm) java/util/concurrent/ConcurrentHashMap.helpTransfer([Ljava/util/concurrent/ConcurrentHashMap$Node;Ljava/util/concurrent/ConcurrentHashMap$Node;)[Ljava/util/concurrent/ConcurrentHashMap$Node; @ 00007FFF9DAB00F4-00007FFF9DAB040C OrdinaryMethod - Q_SZ=0 Q_SZI=0 QW=12 j9m=0000000000120840 bcsz=124 sync FSD OSR remote compThreadID=0 CpuLoad=350%(87%avg) JvmCpu=316%

method, callSite and newStartPC are passed into patchCallPoint:
https://github.com/eclipse/omr/blob/1a724b50baacb7ad7173c75fe9385e3439f5d4e7/compiler/runtime/OMRCodeCache.cpp#L731-L734
Their values are:

(kca) (0x2806e0)/llx
%8 = 0x00000000002806e0: 0000000000120840 //method
(kca) (0x2806e0+8)/llx
%9 = 0x00000000002806e8: 00007fff9b8b684c //callSite
(kca) (0x2806e0+0x10)/llx
%1 = 0x00000000002806f0: 00007fff9dab00f4 //newStartPC

method matches the 0x120840 from the vlog output.

callSite is a location inside ConcurrentHashMap.putVal. It's a call to ConcurrentHashMap.helpTransfer.

0x7fff9b8b6824 {java/.../ConcurrentHashMap.putVal} +861               00009e60 ori       r30, r4, 0 <<< +474
0x7fff9b8b6828 {java/.../ConcurrentHashMap.putVal} +862               0000c462 ori       r4, r22, 0
0x7fff9b8b682c {java/.../ConcurrentHashMap.putVal} +863               00007661 ori       r22, r11, 0
0x7fff9b8b6830 {java/.../ConcurrentHashMap.putVal} +864               0000b960 ori       r25, r5, 0
0x7fff9b8b6834 {java/.../ConcurrentHashMap.putVal} +865               c001cef8 std       r6, 0x1c0(r14)
0x7fff9b8b6838 {java/.../ConcurrentHashMap.putVal} +866               a801eef8 std       r7, 0x1a8(r14)
0x7fff9b8b683c {java/.../ConcurrentHashMap.putVal} +867               b0010ef9 std       r8, 0x1b0(r14)
0x7fff9b8b6840 {java/.../ConcurrentHashMap.putVal} +868               c8012ef9 std       r9, 0x1c8(r14)
0x7fff9b8b6844 {java/.../ConcurrentHashMap.putVal} +869               08026ee8 ld        r3, 0x208(r14)
0x7fff9b8b6848 {java/.../ConcurrentHashMap.putVal} +870               0000e562 ori       r5, r23, 0
0x7fff9b8b684c {java/.../ConcurrentHashMap.putVal} +871  -1:120       c90e0048 bl        0x7fff9b8b7714 U>> +1817 Snippet-> Trampoline {libj9jit29.so}{endproc._interpreterSyncGPR3StaticGlue} +0    // invokespecial 56 {java/.../ConcurrentHashMap.helpTransfer([Ljava/.../shMcurrentHashMap$Node;Ljava/.../ConcurrentHashMap$Node;)[Ljava/.../ConcurrentHashMap$Node;}

That trampoline goes to the interpreter:

(kca) (0x7fff9b8b7714)/5i
0x7fff9b8b7714 {java/.../ConcurrentHashMap.putVal} +1817               10006ef8 std       r3, 0x10(r14) <<< +871
0x7fff9b8b7718 {java/.../ConcurrentHashMap.putVal} +1818               08008ef8 std       r4, 8(r14)
0x7fff9b8b771c {java/.../ConcurrentHashMap.putVal} +1819               0000aef8 std       r5, 0(r14)
0x7fff9b8b7720 {java/.../ConcurrentHashMap.putVal} +1820               81811f48 bl        0x7fff9baaf8a0 Trampoline {libj9jit29.so}{endproc._interpreterSyncGPR3StaticGlue} +0
0x7fff9b8b7724 {java/.../ConcurrentHashMap.putVal} +1821               50688b9b stb       r28, 0x6850(r11) -> {java/.../ConcurrentHashMap.putVal} +872

That's because from the vlog, ConcurrentHashMap.putVal has one compilation and was compiled before ConcurrentHashMap.helpTransfer.

+ (warm) java/util/concurrent/ConcurrentHashMap.putVal(Ljava/lang/Object;Ljava/lang/Object;Z)Ljava/lang/Object; @ 00007FFF9B8B5AB0-00007FFF9B8B7908 OrdinaryMethod - Q_SZ=0 Q_SZI=0 QW=12 j9m=00000000001203C0 bcsz=432 sync FSD OSR remote compThreadID=0 CpuLoad=104%(26%avg) JvmCpu=60%
...
! (warm) java/util/concurrent/ConcurrentHashMap.helpTransfer([Ljava/util/concurrent/ConcurrentHashMap$Node;Ljava/util/concurrent/ConcurrentHashMap$Node;)[Ljava/util/concurrent/ConcurrentHashMap$Node; Q_SZ=0 Q_SZI=0 QW=12 j9m=0000000000120840 time=8336us compilationInterrupted memLimit=262144 KB compThreadID=0
! (warm) java/util/concurrent/ConcurrentHashMap.helpTransfer([Ljava/util/concurrent/ConcurrentHashMap$Node;Ljava/util/concurrent/ConcurrentHashMap$Node;)[Ljava/util/concurrent/ConcurrentHashMap$Node; Q_SZ=0 Q_SZI=0 QW=12 j9m=0000000000120840 time=11933us compilationRecoverableTrampolineFailure memLimit=262144 KB compThreadID=0
+ (warm) java/util/concurrent/ConcurrentHashMap.helpTransfer([Ljava/util/concurrent/ConcurrentHashMap$Node;Ljava/util/concurrent/ConcurrentHashMap$Node;)[Ljava/util/concurrent/ConcurrentHashMap$Node; @ 00007FFF9DAB00F4-00007FFF9DAB040C OrdinaryMethod - Q_SZ=0 Q_SZI=0 QW=12 j9m=0000000000120840 bcsz=124 sync FSD OSR remote compThreadID=0 CpuLoad=350%(87%avg) JvmCpu=316%

newStartPC is the start of the newly compiled ConcurrentHashMap.helpTransfer. It matches the start address in the vlog output as well.

(kca) (0x7fff9dab00f4)/8i
0x7fff9dab00f4 {java/.../ConcurrentHashMap.helpTransfer} +0                 10006ee8 ld        r3, 0x10(r14)
0x7fff9dab00f8 {java/.../ConcurrentHashMap.helpTransfer} +1                 08008ee8 ld        r4, 8(r14)
0x7fff9dab00fc {java/.../ConcurrentHashMap.helpTransfer} +2                 0000aee8 ld        r5, 0(r14)
0x7fff9dab0100 {java/.../ConcurrentHashMap.helpTransfer} +3                 10006ef8 std       r3, 0x10(r14)
0x7fff9dab0104 {java/.../ConcurrentHashMap.helpTransfer} +4                 08008ef8 std       r4, 8(r14)
0x7fff9dab0108 {java/.../ConcurrentHashMap.helpTransfer} +5                 0000aef8 std       r5, 0(r14)
0x7fff9dab010c {java/.../ConcurrentHashMap.helpTransfer} +6                 50006fe9 ld        r11, 0x50(r15) J9VMThread.stackOverflowMark // stack or async check
0x7fff9dab0110 {java/.../ConcurrentHashMap.helpTransfer} +7                 60ffce39 addi      r14, r14, -0xa0

@zl-wang
Copy link
Contributor

zl-wang commented Jul 28, 2024

Good info. Missing HT entry happened in the codeCache where putVal() was compiled into. From its current status, we are not sure yet whether the call (to helpTransfer) was originally unresolved (to begin with) or only interpreted. Look down the snippet whether there were cp/cpindex in it to tell. That in turn indicated whether the resolved HT entry (if it had existed) should be established (created) at compile-time (putVal) or runtime (during picBuilder resolution .. from unresolved HT to resolved HT), although I highly suspected it was resolved to begin with (from same class). If that is the case, i cannot think of a scenario why resolved HT entry is not created at compile-time (because we knew we are calling a static method), such that we cannot look it up (also making sure the current codeCache searched is the putVal one). There is no possibility of runtime error causing it. Then, the only weird thing to me is: both compilations were marked as sync FSD OSR remote ... what is the implication?

@zl-wang
Copy link
Contributor

zl-wang commented Jul 28, 2024

maybe that is the problem ... is remote meaning JIT-server? maybe HT entry missed in that case somehow.

@IBMJimmyk
Copy link
Contributor

From the same vlog:

#INFO:  zzz reserveResolvedTrampoline - _resolvedMethodHT: 00007FFFB80D8770, method: 0000000000120840, entry: 0000000000000000
#INFO:  zzz addResolvedMethod - _resolvedMethodHT: 00007FFFB80D8770, method: 0000000000120840, entry: 00007FFFB80D9A70
+ (warm) java/util/concurrent/ConcurrentHashMap.putVal(Ljava/lang/Object;Ljava/lang/Object;Z)Ljava/lang/Object; @ 00007FFF9B8B5AB0-00007FFF9B8B7908 OrdinaryMethod - Q_SZ=0 Q_SZI=0 QW=12 j9m=00000000001203C0 bcsz=432 sync FSD OSR remote compThreadID=0 CpuLoad=104%(26%avg) JvmCpu=60%
...
#INFO:  zzz resetTrampolines - _resolvedMethodHT: 00007FFFB80D8770
...
! (warm) java/util/concurrent/ConcurrentHashMap.helpTransfer([Ljava/util/concurrent/ConcurrentHashMap$Node;Ljava/util/concurrent/ConcurrentHashMap$Node;)[Ljava/util/concurrent/ConcurrentHashMap$Node; Q_SZ=0 Q_SZI=0 QW=12 j9m=0000000000120840 time=8336us compilationInterrupted memLimit=262144 KB compThreadID=0
! (warm) java/util/concurrent/ConcurrentHashMap.helpTransfer([Ljava/util/concurrent/ConcurrentHashMap$Node;Ljava/util/concurrent/ConcurrentHashMap$Node;)[Ljava/util/concurrent/ConcurrentHashMap$Node; Q_SZ=0 Q_SZI=0 QW=12 j9m=0000000000120840 time=11933us compilationRecoverableTrampolineFailure memLimit=262144 KB compThreadID=0
+ (warm) java/util/concurrent/ConcurrentHashMap.helpTransfer([Ljava/util/concurrent/ConcurrentHashMap$Node;Ljava/util/concurrent/ConcurrentHashMap$Node;)[Ljava/util/concurrent/ConcurrentHashMap$Node; @ 00007FFF9DAB00F4-00007FFF9DAB040C OrdinaryMethod - Q_SZ=0 Q_SZI=0 QW=12 j9m=0000000000120840 bcsz=124 sync FSD OSR remote compThreadID=0 CpuLoad=350%(87%avg) JvmCpu=316%
#INFO:  zzz patchCallPoint - _resolvedMethodHT: 00007FFFB80D8770, method: 0000000000120840, entry: 0000000000000000
#INFO:  zzz replaceTrampoline - _resolvedMethodHT: 00007FFFB80D8770, method: 0000000000120840, entry: 0000000000000000

https://github.com/IBMJimmyk/omr/blob/06aeef119cfbdcf7542d916f8935beed4dd50347/compiler/runtime/OMRCodeCache.cpp#L562-L577
reserveResolvedTrampoline is called with method 0x120840, the entry in _resolvedMethodHT for this method is null so addResolvedMethod is called to add the entry.
https://github.com/IBMJimmyk/omr/blob/06aeef119cfbdcf7542d916f8935beed4dd50347/compiler/runtime/OMRCodeCache.cpp#L958-L969
This happens around the time ConcurrentHashMap.putVal is compiled.

Sometime later, resetTrampolines is called on _resolvedMethodHT:
https://github.com/IBMJimmyk/openj9/blob/906b86e12a14cab6b697d9306bb8cb7c2facb3f7/runtime/compiler/runtime/J9CodeCache.cpp#L739-L742
I believe this clears out all of the entries in _resolvedMethodHT. I'm not sure why resetTrampolines is called.

A bit later on, ConcurrentHashMap.helpTransfer finishes compiling and needs to patch the trampoline.
patchCallPoint is called but the entry is null for method 0x120840 in _resolvedMethodHT. This leads to the crash.

So my question is, is something supposed to create a new _resolvedMethodHT entry for method 0x120840 or is the call to resetTrampolines unexpected?

As for the remote tag, I'm not sure what it means at this moment. I'll try asking people more familiar with JIT server if they have an idea.

@IBMJimmyk
Copy link
Contributor

I didn't realize that I had the JITAAS flag set on my grinders until now. It is supposed to fail even without that flag so I started a new grinder without it. The new grinder has a failed result already and looks to be a similar failure. I'll try looking at this one to see if anything is different.

@zl-wang
Copy link
Contributor

zl-wang commented Jul 28, 2024

JITAAS means JIT-Server?
anyway, you should look into when/under what conditions resetTrampolines is called. It sounds to me that is the root issue of missing HT entry. I am not familiar with it specifically, but I had impression: when the current compilation found out the designated codeCache had no capacity to house the JIT-code anymore, it needs to switch to another codeCache. Then, the current compilation's trampoline reservations (HT entries) are thrown away and re-reserve them against the new codeCache. Not sure if that amounts to resetTrampolines. I don't think resetTrampolines existed when I did these things. It must be added later for some reason.

@IBMJimmyk
Copy link
Contributor

Yes it does. The JITAAS option causes the -XX:+UseJITServer option to be set when running the test.

Looking into resetTrampolines sounds like a good idea.

@IBMJimmyk
Copy link
Contributor

The rc018 repeatedly defines classes. As a result it goes through jitClassesRedefined:

void jitClassesRedefined(J9VMThread * currentThread, UDATA classCount, J9JITRedefinedClass *classList, UDATA extensionsUsed)

jitClassesRedefined checks for FSD here:

if (!TR::Options::getCmdLineOptions()->getOption(TR_FullSpeedDebug))

FSD is enabled so all the trampolines gets deleted here:

TR::CodeCacheManager::instance()->onFSDDecompile();

I added a debug line to print the flag to the verbose log to confirm that the FSD flag is set.

If FSD was disabled, I think it would have gone here:

TR::CodeCacheManager::instance()->onClassRedefinition(reinterpret_cast<TR_OpaqueMethodBlock *>(staleMethod),
reinterpret_cast<TR_OpaqueMethodBlock *>(freshMethod));

onClassRedefinition removes the old method from _resolvedMethodHT and adds in the new one.

With all this, I am now able to reproduce the problem locally with a small test:

import java.io.*;
import java.lang.instrument.*;

public class HCRTrampoline {

    byte[] classByteArray;

    public static void main (String[] args) {
        HCRTrampoline t = new HCRTrampoline();
        try {
           t.runTest();
        } catch (Exception e) {
            System.out.println("***FAILED*** - Unable to find class");
        }
        System.out.println("Done");
    }

    HCRTrampoline() {
        try {
            File classFile = new File("InitObject.class");
            FileInputStream inputStream = new FileInputStream(classFile);

            classByteArray = new byte[(int)classFile.length()];
            inputStream.read(classByteArray);
        } catch (Exception e) {
            System.out.println("***FAILED*** - Unable to set up testing.");
            System.exit(0);
        }
    }

    private void runTest() throws Exception {
        for (int i = 0; i < 15; i++) {
            callA(i);
        }
        RedefAgent.instru.redefineClasses(new ClassDefinition(InitObject.class, classByteArray));
        for (int i = 0; i < 150; i++) {
            callA(i);
        }
    }

    private int callA(int a) {
        if (a >= 5) {
            a = callB(a);
        }
        return a;
    }

    private int callB(int b) {
        b = b + 1;
        return b;
    }
}

Run with:

java -javaagent:RedefAgent.jar -Xjit:"stressTrampolines,forceFullSpeedDebug,limit={*HCRTrampoline*},optlevel=warm,count=10,disableinlining,disableasynccompilation,disableDynamicLoopTransfer,compilationThreads=1,verbose,vlog=trace/trampoline.vlog" -Xshareclasses:none -Xnoaot HCRTrampoline

The test runs callA until a compilation is triggered for callA.
callB is intentionally called a fewer number of times since I don't want it to be compiled yet.
Inlining is disabled since I don't want callB to be inlined.
stressTrampolines forces the creation of a trampoline.
forceFullSpeedDebug forces FSD mode.

After callA is compiled, redefineClasses is called to redefine InitObject.class with itself. I just wanted to trigger a method recompilation.

callA is called a few more time since I now want callB to get compiled. This tries to patch the trampoline in callA and crashes.

The verbose log looks like this:

#INFO:  zzz reserveResolvedTrampoline - _resolvedMethodHT: 0000758AC82063C0, method: 00000000002A4848, entry: 0000000000000000
#INFO:  zzz addResolvedMethod - _resolvedMethodHT: 0000758AC82063C0, method: 00000000002A4848, entry: 0000758AC8206D10
#INFO:  zzz getResolvedTrampoline current - _resolvedMethodHT: 0000758AC82063C0, method: 00000000002A4848, compMethod: HCRTrampoline.callA(I)I
+ (warm) HCRTrampoline.callA(I)I @ 0000758A9C140078-0000758A9C14015C OrdinaryMethod - Q_SZ=1 Q_SZI=1 QW=12 j9m=00000000002A4828 bcsz=13 sync FSD OSR compThreadID=0 CpuLoad=7%(0%avg) JvmCpu=0%
#INFO:  zzz jitClassesRedefined call resetCodeCache - FSD flag: 1
#INFO:  zzz onFSDDecompile call resetCodeCache - _resolvedMethodHT: 0000758AC82063C0
#INFO:  zzz resetTrampolines - _resolvedMethodHT: 0000758AC82063C0
+ (warm) HCRTrampoline.callB(I)I @ 0000758A9C1401F8-0000758A9C140240 OrdinaryMethod - Q_SZ=0 Q_SZI=0 QW=2 j9m=00000000002A4848 bcsz=6 sync FSD OSR compThreadID=0 CpuLoad=7%(0%avg) JvmCpu=0%
#INFO:  zzz patchCallPoint - _resolvedMethodHT: 0000758AC82063C0, method: 00000000002A4848, entry: 0000000000000000
#INFO:  zzz replaceTrampoline - _resolvedMethodHT: 0000758AC82063C0, method: 00000000002A4848, entry: 0000000000000000

The compilation of callA gets method 2A4848(callB) added to _resolvedMethodHT.
The class redefinition under FSD causes the resetTrampolines call to empty out _resolvedMethodHT.
When replaceTrampoline tries to get entry from method 2A4848, null is returned.
This is a segfault inside replaceTrampoline.

FSD mode throwing out the trampolines seems to be intentional.
So the question is, how are things supposed to be handled from there? Is something else supposed to fill in _resolvedMethodHT so the trampoline can be fixed or is the existing compilation of callA supposed to be thrown out?

@zl-wang
Copy link
Contributor

zl-wang commented Jul 29, 2024

throwing out all trampoline reservations without throwing out all reservation creators as well will just not work. it is wrong to do that to begin with. my impression with FSD decompile might be out of dated, but i remembered when decompile is triggered, everything should be thrown out (essentially all codecache(s) are cleared ... starting as if the run is about to begin from JIT perspective). @tajila and @gacholio please chime in.

@gacholio
Copy link
Contributor

gacholio commented Jul 30, 2024

From the VM side, FSD HCR causes all running methods to be tagged for decompilation and all compiled methods are marked for retranslation. I assume the discard/reclaim of the code cache is done by jitClassesRedefined.

@IBMJimmyk
Copy link
Contributor

I was asking @jdmpapin about this and he pointed me towards these lines of code:

openj9/runtime/util/hshelp.c

Lines 4226 to 4229 in bd018c0

if (extensionsEnabled) {
/* Toss the whole code cache */
jitConfig->jitHotswapOccurred(currentThread);
}

extensionsEnabled used to always be true if FSD is enabled. There's even a comment above the method that mentions this. I believe this is no longer the case after the switch to disabling Extended HCR by default. Enabling Extended HCR (-XX:+EnableExtendedHCR) causes the problem to stop occurring which in line with what happened with the original test.

As a quick test, I tried adding a check for J9_FSD_ENABLED(vm) so the code cache is also now tossed if the FSD option is set and this also makes the problem stop.

In both cases, callB no longer attempts to patch the trampoline in callA.

@zl-wang
Copy link
Contributor

zl-wang commented Jul 30, 2024

Fixing this cross-platform issue is not a problem anymore ...
The remaining question beyond this issue itself is: should we try to clear/recycle the codecache(s)? currently, it was obviously intended to do, but it wasn't done (only dealing with trampolines now). sure, if we want to plug this hole, another issue is needed.

@IBMJimmyk
Copy link
Contributor

I opened a PR for the fix:
#19949

jitClassRedefineEvent now checks for both extensions or FSD and dump the code cache is both cases.

I ran a Grinder with my change and all 5 runs passes:
https://hyc-runtimes-jenkins.swg-devops.com/job/Grinder/42515/

I was also asked why extended HCR is being disabled by default in the first place since that's what seems to have caused the bug. I believe it traces back to this: #17857

It was one step in the process of trying to remove extended HCR altogether.

@jdmpapin
Copy link
Contributor

jdmpapin commented Jul 31, 2024

There are maintenance benefits to removing extended HCR, (edit: some of) which I described in that issue, but I think the more proximal reason for the change is that with extended HCR the static fields will move on redefinition, which interferes with an ongoing effort to simplify unsafe accesses

@hzongaro
Copy link
Member

@IBMJimmyk wrote:

I opened a PR for the fix:
#19949

Does that resolve the problem? Can this issue be closed or is there still some problem remaining?

@IBMJimmyk
Copy link
Contributor

This can be closed. The PR fixes the problem. I was not able to reproduce the problem after the fix and have not heard of any other issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants