Skip to content

Commit 537d607

Browse files
authored
JIT: revise approach for x64 OSR epilogs (#65609)
Rework x64 OSR so OSR methods have standard epilogs. Details in attached doc.
1 parent c6829b9 commit 537d607

File tree

6 files changed

+652
-52
lines changed

6 files changed

+652
-52
lines changed
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,247 @@
1+
## OSR x64 Epilog Redesign
2+
3+
### Problem
4+
5+
The current x64 OSR epilog generation creates "non-canonical"
6+
epilogs. While the code sequences are correct, the windows x64
7+
unwinder depends on code generators to produce canonical epilogs, so
8+
that the unwinder can reliably detect when an IP is within an epilog.
9+
10+
The windows x64 unwind info has no data whatsoever on epilogs, so this
11+
sort of implicit epilog detection is necessary. The unwinder
12+
disassembles the code at starting at the IP to deduce if the IP is
13+
within an epilog. Only very specific sequences of instructions are
14+
expected, and anything unexpected causes the unwinder to deduce
15+
that the IP is not in an epilog.
16+
17+
The canonical epilog is a single RSP adjust followed by some number of
18+
non-volatile integer register POPs, and then a RET or JMP. Non-volatile float
19+
registers are restored outside the epilog via MOVs.
20+
21+
OSR methods currently generate the following kind of epilog. It is
22+
non-canonical because of the second RSP adjustment, whose purpose is
23+
to remove the Tier0 frame from the stack.
24+
25+
```asm
26+
add rsp, 120 ;; pop OSR contribution to frame
27+
pop rbx ;; restore non-volatile regs (callee-saves)
28+
pop rsi
29+
pop rdi
30+
pop r12
31+
pop r13
32+
pop r14
33+
pop r15
34+
pop rbp
35+
add rsp, 472 ;; pop Tier0 contribution to frame
36+
pop rbp ;; second RBP restore (see below)
37+
ret
38+
```
39+
40+
These non-canonical OSR epilogs break the x64 unwinder's "in epilog"
41+
detection and also break epilog unwind. This leads to assertions and
42+
bugs during thread suspension, when suspended threads are in the
43+
middle of OSR epilogs, and to broken stack traces when walking the
44+
stack for diagnostic purposes (debugging or sampling).
45+
46+
The CLR (mostly?) tries to avoid suspending threads in epilog, but it
47+
does this by suspending the thread and then calling into the os
48+
unwinder to determine if a thread is in an epilog. The non-canonical
49+
OSR epilogs break thread suspension.
50+
51+
So it is imperative that the x64 OSR epilog sequence be one that the
52+
OS unwinder can reliably recognize as an epilog. It is also beneficial
53+
(though perhaps not mandatory) to be able to unwind from such epilogs; this
54+
improves diagnostic stackwalking accuracy and allows hijacking to
55+
work normally during epilogs, if needed.
56+
57+
Arm64 unwind codes are more flexible and the OSR epilogs we generate
58+
today do not cause any known problems.
59+
60+
### Solution
61+
62+
If the OSR method is required to have a canonical epilog, a single
63+
RSP adjust must remove both the OSR and Tier0 frames. This implies any
64+
and all nonvolatile integer register saves must be stored at the root of the
65+
Tier0 frame so that they can be properly restored by the OSR epilog
66+
via POPs after the single RSP adjustment.
67+
68+
Generally speaking, the Tier0 and OSR methods will not save the same
69+
set of non-volatile registers, and there is no way for the Tier0
70+
method to know which registers the OSR methods might want to save.
71+
72+
Thus we will require that any Tier0 method with patchpoints must
73+
reserve the maximum sized area for integer registers (8 regs * 8 bytes
74+
on Windows, 64 bytes). The Tier0 method will only use the part it
75+
needs. The rest will be unused unless we end up creating an OSR
76+
method. OSR methods will save any additional nonvolatile registers
77+
they use in this area in their prologs.
78+
79+
OSR method epilogs will then adjust the SP to remove both the OSR and
80+
Tier0 frames, setting RSP to the appropriate offset into the save
81+
area, so that the epilog can pop all the saved nonvolatile registers and
82+
return. This gives OSR methods a canonical epilog.
83+
84+
That fixes the epilogs. But we must now also ensure that all this can
85+
be handled properly in the OSR prolog, so that in-prolog and in-body
86+
unwind are still viable.
87+
88+
A typical prolog would PUSH the non-volatiles it wants to save, but
89+
on entry, the OSR method's RSP is pointing below the Tier0 frame,
90+
and so is located well below the save area. So PUSHing is not possible.
91+
92+
Instead, the OSR method will use MOVs to save nonvolatile
93+
registers. Luckily, the x64 unwind format has support describing saves
94+
done via MOVs instead of PUSHes via `UWOP_SAVE_NONVOL` (added for supporting
95+
shrink-wrapping). We will use these codes to describe the callee save actions
96+
in the OSR prolog.
97+
98+
This new unwind code uses the established frame pointer (for x64 OSR this
99+
is always RSP) and so integer callee saves must be saved only after any
100+
RSP adjustments are made. This means in an OSR frame prolog the SP adjustment
101+
happens first, then the (additional) callee saves are saved. We need
102+
to take some care to ensure no callee save is trashed during the SP
103+
adjustment (which may be more than just an add, say if stack probing is needed).
104+
105+
### Work Needed
106+
107+
* Update the Tier0 method to allocate a maximally sized integer save area.
108+
109+
* OSR method prolog and unwind fixes
110+
* To express the fact that some callee saves were saved by the Tier0
111+
method, the OSR method will first issue a phantom (unwind only, offset 0)
112+
series of pushes for those callee saves.
113+
* Next the OSR method will do a phantom SP adjust to account for the
114+
remainder of the Tier0 frame and any SP adjustment done by the patchpoint
115+
transition code.
116+
* Since the Tier0 method is always an RBP frame and always saves RBP at the
117+
top of the register save area, the OSR method does not need to save RBP, and
118+
RBP can be restored from the Tier0 save. But (for RBP OSR frames) the x64
119+
OSR prolog must still set up a proper frame chain. So it will load from RBP
120+
(into a scratch register) and push the result to establish a proper value
121+
for RBP-based frame chaining. The OSR method is invoked with the Tier0 RBP,
122+
so this load/push fetches the Tier0 caller RBP and stores it in a slot on
123+
the OSR frame. This sets up a redundant copy of the saved RBP that does not
124+
need to undone on method exit.
125+
* Next the OSR prolog will establish its final RSP.
126+
* Finally the OSR method will save any remaining callee saves, using MOV
127+
instructions and `UWOP_NONVOL_SAVE` unwind records.
128+
* Nonvolatile float (xmm) registers continue to be stored via MOVs
129+
done after the int callee saves and RSP adjust -- their save area can be
130+
disjoint from the integer save area. Thus XMM registers can be saved to and
131+
restored from space on the OSR frame (otherwise the Tier0 frame would
132+
need to reserve another 160 bytes (windows) to hold possible OSR XMM
133+
saves). We do not yet take advantage of the fact that Tier0 methods
134+
may have also saved XMMs so that the OSR method may only need to save
135+
a subset.
136+
137+
### Example
138+
139+
Here is an example contrasting the new and old approaches on a test case.
140+
141+
#### Old Approach
142+
```asm
143+
;; Tier0 prolog
144+
145+
55 push rbp
146+
56 push rsi
147+
4883EC38 sub rsp, 56
148+
488D6C2440 lea rbp, [rsp+40H]
149+
150+
;; Tier0 epilog
151+
152+
4883C438 add rsp, 56
153+
5E pop rsi
154+
5D pop rbp
155+
C3 ret
156+
157+
;; Tier0 unwind
158+
159+
CodeOffset: 0x06 UnwindOp: UWOP_ALLOC_SMALL (2) OpInfo: 6 * 8 + 8 = 56 = 0x38
160+
CodeOffset: 0x02 UnwindOp: UWOP_PUSH_NONVOL (0) OpInfo: rsi (6)
161+
CodeOffset: 0x01 UnwindOp: UWOP_PUSH_NONVOL (0) OpInfo: rbp (5)
162+
163+
;; OSR prolog
164+
165+
57 push rdi
166+
56 push rsi // redundant
167+
4883EC28 sub rsp, 40
168+
169+
;; OSR epilog (non-standard)
170+
171+
4883C428 add rsp, 40
172+
5E pop rsi
173+
5F pop rdi
174+
4883C448 add rsp, 72
175+
5D pop rbp
176+
C3 ret
177+
178+
;; OSR unwind
179+
180+
CodeOffset: 0x06 UnwindOp: UWOP_ALLOC_SMALL (2) OpInfo: 4 * 8 + 8 = 40 = 0x28
181+
CodeOffset: 0x02 UnwindOp: UWOP_PUSH_NONVOL (0) OpInfo: rsi (6)
182+
CodeOffset: 0x01 UnwindOp: UWOP_PUSH_NONVOL (0) OpInfo: rdi (7)
183+
184+
;; "phantom unwind" records at offset 0 (Tier0 actions)
185+
186+
CodeOffset: 0x00 UnwindOp: UWOP_ALLOC_SMALL (2) OpInfo: 8 * 8 + 8 = 72 = 0x48
187+
CodeOffset: 0x00 UnwindOp: UWOP_PUSH_NONVOL (0) OpInfo: rbp (5)
188+
```
189+
190+
#### New Approach
191+
192+
Note how the OSR method only saves RDI in its prolog, as RSI was already saved.
193+
And this save happens *after* RSP is updated in the OSR frame.
194+
Restore of RDI in unwind uses `UWOP_SAVE_NONVOL`.
195+
```asm
196+
;; Tier0 prolog
197+
198+
55 push rbp
199+
56 push rsi
200+
4883EC68 sub rsp, 104 // leave room for OSR
201+
488D6C2470 lea rbp, [rsp+70H]
202+
203+
;; Tier0 epilog
204+
205+
4883C468 add rsp, 104
206+
5E pop rsi
207+
5D pop rbp
208+
C3 ret
209+
210+
;; Tier0 unwind
211+
212+
CodeOffset: 0x06 UnwindOp: UWOP_ALLOC_SMALL (2) OpInfo: 12 * 8 + 8 = 104 = 0x68
213+
CodeOffset: 0x02 UnwindOp: UWOP_PUSH_NONVOL (0) OpInfo: rsi (6)
214+
CodeOffset: 0x01 UnwindOp: UWOP_PUSH_NONVOL (0) OpInfo: rbp (5)
215+
216+
;; OSR prolog
217+
218+
4883EC38 sub rsp, 56
219+
4889BC24A0000000 mov qword ptr [rsp+A0H], rdi
220+
221+
;; OSR epilog (standard)
222+
223+
4881C4A0000000 add rsp, 160
224+
5F pop rdi
225+
5E pop rsi
226+
5D pop rbp
227+
C3 ret
228+
229+
;; OSR unwind
230+
231+
CodeOffset: 0x0C UnwindOp: UWOP_SAVE_NONVOL (4) OpInfo: rdi (7)
232+
Scaled Small Offset: 20 * 8 = 160 = 0x000A0
233+
CodeOffset: 0x04 UnwindOp: UWOP_ALLOC_SMALL (2) OpInfo: 6 * 8 + 8 = 56 = 0x38
234+
235+
;; "phantom unwind" records at offset 0 (Tier0 actions)
236+
237+
CodeOffset: 0x00 UnwindOp: UWOP_ALLOC_SMALL (2) OpInfo: 13 * 8 + 8 = 112 = 0x70
238+
CodeOffset: 0x00 UnwindOp: UWOP_PUSH_NONVOL (0) OpInfo: rsi (6)
239+
CodeOffset: 0x00 UnwindOp: UWOP_PUSH_NONVOL (0) OpInfo: rbp (5)
240+
```
241+
242+
### Notes
243+
244+
* We are not changing arm64 OSR at this time, it still uses the "old plan". Non-standard epilogs are handled on arm64 via epilog unwind codes.
245+
246+
* The OSR frame still reserves space for callee saves on its frame, despite
247+
not saving them there.

src/coreclr/jit/codegen.h

+9
Original file line numberDiff line numberDiff line change
@@ -327,6 +327,11 @@ class CodeGen final : public CodeGenInterface
327327
void genPushCalleeSavedRegisters();
328328
#endif
329329

330+
#if defined(TARGET_AMD64)
331+
void genOSRRecordTier0CalleeSavedRegistersAndFrame();
332+
void genOSRSaveRemainingCalleeSavedRegisters();
333+
#endif // TARGET_AMD64
334+
330335
void genAllocLclFrame(unsigned frameSize, regNumber initReg, bool* pInitRegZeroed, regMaskTP maskArgRegsLiveIn);
331336

332337
void genPoisonFrame(regMaskTP bbRegLiveIn);
@@ -475,6 +480,10 @@ class CodeGen final : public CodeGenInterface
475480

476481
void genPopCalleeSavedRegisters(bool jmpEpilog = false);
477482

483+
#if defined(TARGET_XARCH)
484+
unsigned genPopCalleeSavedRegistersFromMask(regMaskTP rsPopRegs);
485+
#endif // !defined(TARGET_XARCH)
486+
478487
#endif // !defined(TARGET_ARM64)
479488

480489
//

0 commit comments

Comments
 (0)