-
Notifications
You must be signed in to change notification settings - Fork 771
Description
We have had multiple cases recently for crashes on Windows 11. It looks like there may have been a change in Windows behaviour in a recent update that is exposing the problem, but I think the fundamental issue is with our code.
The crashes occur in ntdll!_chkstk which attempts to read from every 4k page starting at the value held for the stack in the thread's TEB and finishing at the current value of rsp less a value passed in rax. From what I can gather the purpose of this function is two-fold - if the new value of (rsp-rax) is still within the threads permitted total stack size a failed read (from hitting a guard page) triggers an increase in the currently allocated stack size, while if (rsp-rax) is outside of the permitted stack area it results in the program being terminated with error code c00000fd (which can be seen either in the Windows event log or by loading the core dump into windbg). In our case the value in rsp is that of the Java stack pointer, which is typically very far from the C stack and thus is guaranteed to cause a crash if tested by _chkstk.
I'm currently working three cases with this symptom, two of which are now well characterized and the third is waiting for data.
For the first case the call stack is
Child-SP RetAddr Call Site
00 00000000`00fac508 00007fff`84f162dd ntdll!_chkstk+0x37
01 00000000`00fac520 00007fff`84ec2eda ntdll!RtlpWalkFrameChain+0x13d
02 00000000`00fac670 00007fff`84ec2e52 ntdll!RtlWalkFrameChain+0x2a
03 00000000`00fac6a0 00007fff`84ee49ea ntdll!RtlCaptureStackBackTrace+0x42
04 00000000`00fac6d0 00007fff`84ed9c22 ntdll!RtlStdLogStackTrace+0x4a
05 00000000`00fac810 00007fff`84ed7c79 ntdll!RtlpAddDebugInfoToCriticalSection+0x132
06 00000000`00fac880 00007fff`1582485f ntdll!RtlInitializeCriticalSection+0x69
07 (Inline Function) --------`-------- j9thr29!monitor_allocate+0x6c [c:\temp\bld_100666\bld_win_x86-64_cmprssptrs\omr\thread\common\omrthread.c @ 3576]
08 00000000`00fac8f0 00007fff`15822528 j9thr29!monitor_alloc_and_init+0x9f [c:\temp\bld_100666\bld_win_x86-64_cmprssptrs\omr\thread\common\omrthread.c @ 3789]
09 00000000`00fac930 00007ffe`ee88b787 j9thr29!omrthread_monitor_init_with_name+0x18 [c:\temp\bld_100666\bld_win_x86-64_cmprssptrs\omr\thread\common\omrthread.c @ 3413]
0a 00000000`00fac970 00007ffe`ee8372e6 j9vm29!monitorTableAt+0x247 [c:\temp\bld_100666\bld_win_x86-64_cmprssptrs\vm\montable.c @ 289]
0b (Inline Function) --------`-------- j9vm29!VM_ObjectMonitor::inlineGetLockAddress+0x2a [c:\temp\bld_100666\bld_win_x86-64_cmprssptrs\oti\ObjectMonitor.hpp @ 82]
0c 00000000`00faca90 00007ffe`e560ec0d j9vm29!objectMonitorEnterNonBlocking+0x46 [c:\temp\bld_100666\bld_win_x86-64_cmprssptrs\vm\ObjectMonitor.cpp @ 334]
0d (Inline Function) --------`-------- j9jit29!fast_jitMonitorEnterImpl+0x13 [c:\temp\bld_100666\bld_win_x86-64_cmprssptrs\codert_vm\cnathelp.cpp @ 1696]
0e 00000000`00facb00 00007ffe`e55f063b j9jit29!fast_jitMonitorEntry+0x1d [c:\temp\bld_100666\bld_win_x86-64_cmprssptrs\codert_vm\cnathelp.cpp @ 3742]
0f 00000000`00facb30 00000000`ffe71178 j9jit29!jitMonitorEntry+0xb [c:\temp\bld_100666\bld_win_x86-64_cmprssptrs\codert_vm\xnathelp.asm @ 2209]
I confirmed that jitMonitorEntry was called from {java/util/jar/JarVerifier.processEntry} +4517 and the entire call stack is operating on the Java stack.
For the second case the call stack is:
# Child-SP RetAddr Call Site
00 00000000`004ebd28 00007ffc`3d61811d ntdll!_chkstk+0x37
01 00000000`004ebd40 00007ffc`3d6301ba ntdll!RtlpWalkFrameChain+0x13d
02 00000000`004ebe90 00007ffc`3a28fe87 ntdll!RtlWalkFrameChain+0x2a
03 00000000`004ebec0 00007ffc`3a2f4179 InProcessClient64+0x8fe87 C:\Program Files\SentinelOne...
04 00000000`004ec450 00007ffc`3a290cfc InProcessClient64+0xf4179
05 00000000`004ec4b0 00007ffc`3a290d39 InProcessClient64+0x90cfc
06 00000000`004ec550 00007ffc`3a290bc2 InProcessClient64+0x90d39
07 00000000`004ec580 00007ffc`3d61c2f4 InProcessClient64+0x90bc2
08 00000000`004ec670 00007ffc`03b353cc ntdll!RtlLeaveCriticalSection+0x2f4
09 00000000`004ec6e0 00007ffb`b7c6b569 j9thr29!monitor_exit+0x73c [c:\workspace\openjdk-build\workspace\build\src\omr\thread\common\omrthread.c @ 4366]
0a 00000000`004ec760 00007ffb`b787b8fc j9vm29!objectMonitorExit+0x819 [c:\workspace\openjdk-build\workspace\build\src\openj9\runtime\vm\monhelpers.c @ 241]
0b (Inline Function) --------`-------- j9jit29!fast_jitMonitorExitImpl+0x32 [c:\workspace\openjdk-build\workspace\build\src\openj9\runtime\codert_vm\cnathelp.cpp @ 1837]
0c 00000000`004ec7e0 00007ffb`b785c3fb j9jit29!fast_jitMethodMonitorExit+0x3c [c:\workspace\openjdk-build\workspace\build\src\openj9\runtime\codert_vm\cnathelp.cpp @ 3741]
0d 00000000`004ec810 00000000`0037ea00 j9jit29!jitMethodMonitorExit+0xb [c:\workspace\openjdk-build\workspace\build\src\build\windows-x86_64-server-release\vm\runtime\codert_vm\xnathelp.s @ 2270]
I confirmed that jitMethodMonitorExit was called from {org/eclipse/osgi/framework/eventmgr/EventManager$EventThread.getNextEvent} +840 and the call stack is running on the Java SP. However, the stack region for this thread has been overrun during this call sequence, presumably risking data corruption even if we had not crashed in chkstk.
Of note, the crashes in chkstk seem to be dependent on an additional factor required to trigger them. In the first case the problem was demonstrated by enabling user mode stack trace via gflags. In the second case the presence of the SentinelOne security software has triggered the failure.
For the first case there is also a demonstrated interaction with Windows Control Flow Guard - it is possible to prevent the crashes by explicitly forcing CFG off for the java executable. It is not clear why this should be the case, and it's possible that the underlying Windows change was not intentional.
These last points not withstanding, it seems untenable to risk allowing calls to Windows system functions to occur on the Java stack when we can't tell how much stack space they will require, nor guarantee that chkstk won't be called.
Update: I now have data for the third case. It's another triggered by SentinelOne during a call to jitMethodMonitorExit. The calling jit'd method was different in this case, being {com/ibm/mq/pcf/event/PCFMonitorAgent.refresh}
# Child-SP RetAddr Call Site
00 00000000`01731788 00007ffc`ec5962ad ntdll!_chkstk+0x37
01 00000000`017317a0 00007ffc`ec542eda ntdll!RtlpWalkFrameChain+0x13d
02 00000000`017318f0 00007ffc`e91a39a6 ntdll!RtlWalkFrameChain+0x2a
03 00000000`01731920 00007ffc`e9215a49 InProcessClient64+0x939a6 from C:\Program Files\SentinelOne...
04 00000000`01731f20 00007ffc`e91a353b InProcessClient64+0x105a49
05 00000000`01731f80 00007ffc`e91a48a3 InProcessClient64+0x9353b
06 00000000`01731ff0 00007ffc`e91a48d8 InProcessClient64+0x948a3
07 00000000`01732030 00007ffc`e91a47c3 InProcessClient64+0x948d8
08 00000000`01732070 00007ffc`ec59a484 InProcessClient64+0x947c3
09 00000000`01732130 00007ffc`c00c5307 ntdll!RtlLeaveCriticalSection+0x2f4
0a 00000000`017321a0 00007ffc`3846a7d9 j9thr29!monitor_exit+0x707 [c:\workspace\openjdk-build\workspace\build\src\omr\thread\common\omrthread.c @ 4366]
0b 00000000`01732210 00007ffc`36ef4c88 j9vm29!objectMonitorExit+0x7b9 [c:\workspace\openjdk-build\workspace\build\src\openj9\runtime\vm\monhelpers.c @ 223]
0c (Inline Function) --------`-------- j9jit29!fast_jitMonitorExitImpl+0x2e [c:\workspace\openjdk-build\workspace\build\src\openj9\runtime\codert_vm\cnathelp.cpp @ 1817]
0d 00000000`01732290 00007ffc`36ed5e6b j9jit29!fast_jitMethodMonitorExit+0x38 [c:\workspace\openjdk-build\workspace\build\src\openj9\runtime\codert_vm\cnathelp.cpp @ 3711]
0e 00000000`017322c0 00000000`01239e00 j9jit29!jitMethodMonitorExit+0xb [c:\workspace\openjdk-build\workspace\build\src\build\windows-x86_64-server-release\vm\runtime\codert_vm\xnathelp.s @ 2270]