-
Notifications
You must be signed in to change notification settings - Fork 737
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cmdLineTester_jvmtitests_hcr_OSRG_nongold_0 rc018 ppc64le segfault in internal tests #19813
Comments
@pshipton @tajila The earliest I've been able to find these failures is the nightly test run of June 27, which tested openj9 commits from June 26 and earlier, so I think this may be related to #19710 and maybe #19778. I'm not sure if the test options need to be adjusted as in #19792 or if it's something else. |
The version info for the early failure in #19813 (comment) is:
|
Ill take a look |
Grinder with option added, https://hyc-runtimes-jenkins.swg-devops.com/job/Grinder/41855/ - passed If it passes that will confirm its related to the HCR changes. However, the crash is unexpected regardless of options specified so that will need investigation. |
Related to eclipse-openj9#19710 Issue eclipse-openj9#19765 Issue eclipse-openj9#19813 Signed-off-by: Peter Shipton <[email protected]>
Related to eclipse-openj9#19710 Issue eclipse-openj9#19765 Issue eclipse-openj9#19813 Signed-off-by: Peter Shipton <[email protected]>
Related to eclipse-openj9#19710 Issue eclipse-openj9#19765 Issue eclipse-openj9#19813 Signed-off-by: Peter Shipton <[email protected]>
@cjjdespres I still think the crash needs to be investigated. The extended HCR options may have just exposed an issue. Disabling extended HCR prevents redefinition from changing the class shape, any attempt to do so should result in a JVMTI error. The fact that it is crashing means an assumption was broken. I looked at the stacktrace and I couldn't find anything suspicious, but thats probably because Im not very familiar with that code. Do you know why it is crashing? I think we still need to answer this question. |
I agree that this should be investigated. I think this option influences |
@tajila could you briefly describe the difference between previous HCR and extended HCR? in particular, method recompilation and its implications. from the stack back-trace, it crashed trying to replace a trampoline (could be many different reasons). presumably, this is across platform, but it happened to be on ppc64le due to certain recompilations by-chance. |
@IBMJimmyk i will ask jimmy to take a further look. |
The basic difference is that exended HCR allows one to change the class shape (ie. add/remove fields and methods). This means we dont redefine the J9Class in place, as we do with fast HCR. We create a new J9Class, then we need to do a heap walk to replace all object headers that are affected, update class heirarchies, array component type, etc. This also means that things like the vtable/jitvtable will have a new address, j9methods will also have a new address, etc. |
WRT |
i.e. as part of the HCR hooked actions, every such-typed existing object is re-constructed? |
Yes, the J9Class in the header of affected object intances are updated to the new J9Class in the extended HCR case. |
I meant more than that. since fields can be added or removed, every such object needs to be reconstructed literally. size can be different. it sounds like a newObject() being required. without new size, is the old-object allowed to continue running? |
Sorry, you can add/remove static fields, but not instance fields. So only the instance header needs to be updated. |
I've been trying to take a look at this problem. rc018 takes the class Attempts to recreate the problem on a local system have been unsuccessful but it fails very regularly (at least 80+%) via Grinders. So I've been looking at the artifacts from one of my grinders ( In this, the problem shows up as a hang.
So I tried looking at what threads were currently running and this one stuck out:
What this thread does is create objects for 20 seconds then stops:
So whatever is causing the hang causes this thread to stop as well even though it isn't directly involved with the |
I ended up jumping to this this test failure since it was a segfault instead of a crash: https://github.com/eclipse/omr/blob/d18121d17c5750b62d057c61a84dd72dd9584e90/compiler/runtime/OMRCodeCache.cpp#L629-L638 These symptoms look very similar to: #12120 (comment) Next step is to try adding some debug prints to the locations that are expected to update This bug seems like it connects a bad |
I'm still trying to figure out what is going on. One thing I discovered is when I run the test locally, the path through On failing grinder run with my new trace messages I see this:
It interacts with this lookup:
This means at the start of In my local run, that line is never printed out so there are no calls to Another thing is, on the failing grinder runs, going to So my next step is to figure out why |
method 0120220 is one of the methods in the replaced or replacing classes? running into |
It doesn't seem to be related to the class being replaced at all. The class being redefined, The test
The test tries to read the class's fields over and over again and doesn't even check the result. |
From the vlog:
It looks like the |
these may lead us to question why it is not in the calling codeCache's hashTable |
HashTable entry (unresolved or resolved) should be established by caller(s) when the caller(s) are compiled, rather than by itself. Trampoline is created at compile-time or runtime, depending on the callee has been compiled at that time or not (of course, whether it is reachable by bl instruction). |
not exactly sure what |
HashTable entry could be created at runtime as well (from unresolved becoming resolved) ... just FYI. |
I made a build with more debug info a got a new corefile. The failure is still similar.
It's still a null entry after lookup into
That trampoline goes to the interpreter:
That's because from the vlog,
|
Good info. Missing HT entry happened in the codeCache where |
maybe that is the problem ... is |
From the same vlog:
https://github.com/IBMJimmyk/omr/blob/06aeef119cfbdcf7542d916f8935beed4dd50347/compiler/runtime/OMRCodeCache.cpp#L562-L577 Sometime later, A bit later on, So my question is, is something supposed to create a new As for the |
I didn't realize that I had the |
JITAAS means JIT-Server? |
Yes it does. The Looking into |
The rc018 repeatedly defines classes. As a result it goes through
FSD is enabled so all the trampolines gets deleted here:
I added a debug line to print the flag to the verbose log to confirm that the FSD flag is set. If FSD was disabled, I think it would have gone here: openj9/runtime/compiler/control/HookedByTheJit.cpp Lines 2482 to 2483 in 48f557e
onClassRedefinition removes the old method from _resolvedMethodHT and adds in the new one.
With all this, I am now able to reproduce the problem locally with a small test:
Run with:
The test runs After
The verbose log looks like this:
The compilation of FSD mode throwing out the trampolines seems to be intentional. |
throwing out all trampoline reservations without throwing out all reservation creators as well will just not work. it is wrong to do that to begin with. my impression with FSD decompile might be out of dated, but i remembered when decompile is triggered, everything should be thrown out (essentially all codecache(s) are cleared ... starting as if the run is about to begin from JIT perspective). @tajila and @gacholio please chime in. |
From the VM side, FSD HCR causes all running methods to be tagged for decompilation and all compiled methods are marked for retranslation. I assume the discard/reclaim of the code cache is done by |
I was asking @jdmpapin about this and he pointed me towards these lines of code: Lines 4226 to 4229 in bd018c0
As a quick test, I tried adding a check for In both cases, |
Fixing this cross-platform issue is not a problem anymore ... |
I opened a PR for the fix:
I ran a Grinder with my change and all 5 runs passes: I was also asked why extended HCR is being disabled by default in the first place since that's what seems to have caused the bug. I believe it traces back to this: #17857 It was one step in the process of trying to remove extended HCR altogether. |
There are maintenance benefits to removing extended HCR, (edit: some of) which I described in that issue, but I think the more proximal reason for the change is that with extended HCR the static fields will move on redefinition, which interferes with an ongoing effort to simplify unsafe accesses |
@IBMJimmyk wrote:
Does that resolve the problem? Can this issue be closed or is there still some problem remaining? |
This can be closed. The PR fixes the problem. I was not able to reproduce the problem after the fix and have not heard of any other issues. |
Failure link
A sample, as it's been failing very frequently in the JITServer nightly tests:
https://hyc-runtimes-jenkins.swg-devops.com/job/Test_openjdk17_j9_extended.functional_ppc64le_linux_jit_Personal/991/
https://hyc-runtimes-jenkins.swg-devops.com/job/Test_openjdk11_j9_extended.functional_ppc64le_linux_jit_Personal/1078/
The failure in
cmdLineTester_jvmtitests_hcr_OSRG_nongold_SE80_0
on ppc64le JDK8 in this test from July 4 also fails in rc018 in the same way.Optional info
I initially noticed this in the JITServer nightly tests, but a 50x grinder without JITAAS had a 43/50 failure rate, so this is not JITServer-specific. I have only seen this failure on
ppc64le
so far.Failure output (captured from console output)
The text was updated successfully, but these errors were encountered: