|
| 1 | +## OSR x64 Epilog Redesign |
| 2 | + |
| 3 | +### Problem |
| 4 | + |
| 5 | +The current x64 OSR epilog generation creates "non-canonical" |
| 6 | +epilogs. While the code sequences are correct, the windows x64 |
| 7 | +unwinder depends on code generators to produce canonical epilogs, so |
| 8 | +that the unwinder can reliably detect when an IP is within an epilog. |
| 9 | + |
| 10 | +The windows x64 unwind info has no data whatsoever on epilogs, so this |
| 11 | +sort of implicit epilog detection is necessary. The unwinder |
| 12 | +disassembles the code at starting at the IP to deduce if the IP is |
| 13 | +within an epilog. Only very specific sequences of instructions are |
| 14 | +expected, and anything unexpected causes the unwinder to deduce |
| 15 | +that the IP is not in an epilog. |
| 16 | + |
| 17 | +The canonical epilog is a single RSP adjust followed by some number of |
| 18 | +non-volatile integer register POPs, and then a RET or JMP. Non-volatile float |
| 19 | +registers are restored outside the epilog via MOVs. |
| 20 | + |
| 21 | +OSR methods currently generate the following kind of epilog. It is |
| 22 | +non-canonical because of the second RSP adjustment, whose purpose is |
| 23 | +to remove the Tier0 frame from the stack. |
| 24 | + |
| 25 | +```asm |
| 26 | + add rsp, 120 ;; pop OSR contribution to frame |
| 27 | + pop rbx ;; restore non-volatile regs (callee-saves) |
| 28 | + pop rsi |
| 29 | + pop rdi |
| 30 | + pop r12 |
| 31 | + pop r13 |
| 32 | + pop r14 |
| 33 | + pop r15 |
| 34 | + pop rbp |
| 35 | + add rsp, 472 ;; pop Tier0 contribution to frame |
| 36 | + pop rbp ;; second RBP restore (see below) |
| 37 | + ret |
| 38 | +``` |
| 39 | + |
| 40 | +These non-canonical OSR epilogs break the x64 unwinder's "in epilog" |
| 41 | +detection and also break epilog unwind. This leads to assertions and |
| 42 | +bugs during thread suspension, when suspended threads are in the |
| 43 | +middle of OSR epilogs, and to broken stack traces when walking the |
| 44 | +stack for diagnostic purposes (debugging or sampling). |
| 45 | + |
| 46 | +The CLR (mostly?) tries to avoid suspending threads in epilog, but it |
| 47 | +does this by suspending the thread and then calling into the os |
| 48 | +unwinder to determine if a thread is in an epilog. The non-canonical |
| 49 | +OSR epilogs break thread suspension. |
| 50 | + |
| 51 | +So it is imperative that the x64 OSR epilog sequence be one that the |
| 52 | +OS unwinder can reliably recognize as an epilog. It is also beneficial |
| 53 | +(though perhaps not mandatory) to be able to unwind from such epilogs; this |
| 54 | +improves diagnostic stackwalking accuracy and allows hijacking to |
| 55 | +work normally during epilogs, if needed. |
| 56 | + |
| 57 | +Arm64 unwind codes are more flexible and the OSR epilogs we generate |
| 58 | +today do not cause any known problems. |
| 59 | + |
| 60 | +### Solution |
| 61 | + |
| 62 | +If the OSR method is required to have a canonical epilog, a single |
| 63 | +RSP adjust must remove both the OSR and Tier0 frames. This implies any |
| 64 | +and all nonvolatile integer register saves must be stored at the root of the |
| 65 | +Tier0 frame so that they can be properly restored by the OSR epilog |
| 66 | +via POPs after the single RSP adjustment. |
| 67 | + |
| 68 | +Generally speaking, the Tier0 and OSR methods will not save the same |
| 69 | +set of non-volatile registers, and there is no way for the Tier0 |
| 70 | +method to know which registers the OSR methods might want to save. |
| 71 | + |
| 72 | +Thus we will require that any Tier0 method with patchpoints must |
| 73 | +reserve the maximum sized area for integer registers (8 regs * 8 bytes |
| 74 | +on Windows, 64 bytes). The Tier0 method will only use the part it |
| 75 | +needs. The rest will be unused unless we end up creating an OSR |
| 76 | +method. OSR methods will save any additional nonvolatile registers |
| 77 | +they use in this area in their prologs. |
| 78 | + |
| 79 | +OSR method epilogs will then adjust the SP to remove both the OSR and |
| 80 | +Tier0 frames, setting RSP to the appropriate offset into the save |
| 81 | +area, so that the epilog can pop all the saved nonvolatile registers and |
| 82 | +return. This gives OSR methods a canonical epilog. |
| 83 | + |
| 84 | +That fixes the epilogs. But we must now also ensure that all this can |
| 85 | +be handled properly in the OSR prolog, so that in-prolog and in-body |
| 86 | +unwind are still viable. |
| 87 | + |
| 88 | +A typical prolog would PUSH the non-volatiles it wants to save, but |
| 89 | +on entry, the OSR method's RSP is pointing below the Tier0 frame, |
| 90 | +and so is located well below the save area. So PUSHing is not possible. |
| 91 | + |
| 92 | +Instead, the OSR method will use MOVs to save nonvolatile |
| 93 | +registers. Luckily, the x64 unwind format has support describing saves |
| 94 | +done via MOVs instead of PUSHes via `UWOP_SAVE_NONVOL` (added for supporting |
| 95 | +shrink-wrapping). We will use these codes to describe the callee save actions |
| 96 | +in the OSR prolog. |
| 97 | + |
| 98 | +This new unwind code uses the established frame pointer (for x64 OSR this |
| 99 | +is always RSP) and so integer callee saves must be saved only after any |
| 100 | +RSP adjustments are made. This means in an OSR frame prolog the SP adjustment |
| 101 | +happens first, then the (additional) callee saves are saved. We need |
| 102 | +to take some care to ensure no callee save is trashed during the SP |
| 103 | +adjustment (which may be more than just an add, say if stack probing is needed). |
| 104 | + |
| 105 | +### Work Needed |
| 106 | + |
| 107 | +* Update the Tier0 method to allocate a maximally sized integer save area. |
| 108 | + |
| 109 | +* OSR method prolog and unwind fixes |
| 110 | + * To express the fact that some callee saves were saved by the Tier0 |
| 111 | +method, the OSR method will first issue a phantom (unwind only, offset 0) |
| 112 | +series of pushes for those callee saves. |
| 113 | + * Next the OSR method will do a phantom SP adjust to account for the |
| 114 | +remainder of the Tier0 frame and any SP adjustment done by the patchpoint |
| 115 | +transition code. |
| 116 | + * Since the Tier0 method is always an RBP frame and always saves RBP at the |
| 117 | + top of the register save area, the OSR method does not need to save RBP, and |
| 118 | + RBP can be restored from the Tier0 save. But (for RBP OSR frames) the x64 |
| 119 | + OSR prolog must still set up a proper frame chain. So it will load from RBP |
| 120 | + (into a scratch register) and push the result to establish a proper value |
| 121 | + for RBP-based frame chaining. The OSR method is invoked with the Tier0 RBP, |
| 122 | + so this load/push fetches the Tier0 caller RBP and stores it in a slot on |
| 123 | + the OSR frame. This sets up a redundant copy of the saved RBP that does not |
| 124 | + need to undone on method exit. |
| 125 | + * Next the OSR prolog will establish its final RSP. |
| 126 | + * Finally the OSR method will save any remaining callee saves, using MOV |
| 127 | + instructions and `UWOP_NONVOL_SAVE` unwind records. |
| 128 | + * Nonvolatile float (xmm) registers continue to be stored via MOVs |
| 129 | + done after the int callee saves and RSP adjust -- their save area can be |
| 130 | + disjoint from the integer save area. Thus XMM registers can be saved to and |
| 131 | + restored from space on the OSR frame (otherwise the Tier0 frame would |
| 132 | + need to reserve another 160 bytes (windows) to hold possible OSR XMM |
| 133 | + saves). We do not yet take advantage of the fact that Tier0 methods |
| 134 | + may have also saved XMMs so that the OSR method may only need to save |
| 135 | + a subset. |
| 136 | + |
| 137 | +### Example |
| 138 | + |
| 139 | +Here is an example contrasting the new and old approaches on a test case. |
| 140 | + |
| 141 | +#### Old Approach |
| 142 | +```asm |
| 143 | +;; Tier0 prolog |
| 144 | +
|
| 145 | + 55 push rbp |
| 146 | + 56 push rsi |
| 147 | + 4883EC38 sub rsp, 56 |
| 148 | + 488D6C2440 lea rbp, [rsp+40H] |
| 149 | +
|
| 150 | +;; Tier0 epilog |
| 151 | +
|
| 152 | + 4883C438 add rsp, 56 |
| 153 | + 5E pop rsi |
| 154 | + 5D pop rbp |
| 155 | + C3 ret |
| 156 | +
|
| 157 | +;; Tier0 unwind |
| 158 | +
|
| 159 | + CodeOffset: 0x06 UnwindOp: UWOP_ALLOC_SMALL (2) OpInfo: 6 * 8 + 8 = 56 = 0x38 |
| 160 | + CodeOffset: 0x02 UnwindOp: UWOP_PUSH_NONVOL (0) OpInfo: rsi (6) |
| 161 | + CodeOffset: 0x01 UnwindOp: UWOP_PUSH_NONVOL (0) OpInfo: rbp (5) |
| 162 | +
|
| 163 | +;; OSR prolog |
| 164 | +
|
| 165 | + 57 push rdi |
| 166 | + 56 push rsi // redundant |
| 167 | + 4883EC28 sub rsp, 40 |
| 168 | +
|
| 169 | +;; OSR epilog (non-standard) |
| 170 | +
|
| 171 | + 4883C428 add rsp, 40 |
| 172 | + 5E pop rsi |
| 173 | + 5F pop rdi |
| 174 | + 4883C448 add rsp, 72 |
| 175 | + 5D pop rbp |
| 176 | + C3 ret |
| 177 | +
|
| 178 | +;; OSR unwind |
| 179 | +
|
| 180 | + CodeOffset: 0x06 UnwindOp: UWOP_ALLOC_SMALL (2) OpInfo: 4 * 8 + 8 = 40 = 0x28 |
| 181 | + CodeOffset: 0x02 UnwindOp: UWOP_PUSH_NONVOL (0) OpInfo: rsi (6) |
| 182 | + CodeOffset: 0x01 UnwindOp: UWOP_PUSH_NONVOL (0) OpInfo: rdi (7) |
| 183 | +
|
| 184 | + ;; "phantom unwind" records at offset 0 (Tier0 actions) |
| 185 | +
|
| 186 | + CodeOffset: 0x00 UnwindOp: UWOP_ALLOC_SMALL (2) OpInfo: 8 * 8 + 8 = 72 = 0x48 |
| 187 | + CodeOffset: 0x00 UnwindOp: UWOP_PUSH_NONVOL (0) OpInfo: rbp (5) |
| 188 | +``` |
| 189 | + |
| 190 | +#### New Approach |
| 191 | + |
| 192 | +Note how the OSR method only saves RDI in its prolog, as RSI was already saved. |
| 193 | +And this save happens *after* RSP is updated in the OSR frame. |
| 194 | +Restore of RDI in unwind uses `UWOP_SAVE_NONVOL`. |
| 195 | +```asm |
| 196 | +;; Tier0 prolog |
| 197 | +
|
| 198 | + 55 push rbp |
| 199 | + 56 push rsi |
| 200 | + 4883EC68 sub rsp, 104 // leave room for OSR |
| 201 | + 488D6C2470 lea rbp, [rsp+70H] |
| 202 | +
|
| 203 | +;; Tier0 epilog |
| 204 | +
|
| 205 | + 4883C468 add rsp, 104 |
| 206 | + 5E pop rsi |
| 207 | + 5D pop rbp |
| 208 | + C3 ret |
| 209 | +
|
| 210 | +;; Tier0 unwind |
| 211 | +
|
| 212 | + CodeOffset: 0x06 UnwindOp: UWOP_ALLOC_SMALL (2) OpInfo: 12 * 8 + 8 = 104 = 0x68 |
| 213 | + CodeOffset: 0x02 UnwindOp: UWOP_PUSH_NONVOL (0) OpInfo: rsi (6) |
| 214 | + CodeOffset: 0x01 UnwindOp: UWOP_PUSH_NONVOL (0) OpInfo: rbp (5) |
| 215 | +
|
| 216 | +;; OSR prolog |
| 217 | +
|
| 218 | + 4883EC38 sub rsp, 56 |
| 219 | + 4889BC24A0000000 mov qword ptr [rsp+A0H], rdi |
| 220 | +
|
| 221 | +;; OSR epilog (standard) |
| 222 | +
|
| 223 | + 4881C4A0000000 add rsp, 160 |
| 224 | + 5F pop rdi |
| 225 | + 5E pop rsi |
| 226 | + 5D pop rbp |
| 227 | + C3 ret |
| 228 | +
|
| 229 | +;; OSR unwind |
| 230 | +
|
| 231 | + CodeOffset: 0x0C UnwindOp: UWOP_SAVE_NONVOL (4) OpInfo: rdi (7) |
| 232 | + Scaled Small Offset: 20 * 8 = 160 = 0x000A0 |
| 233 | + CodeOffset: 0x04 UnwindOp: UWOP_ALLOC_SMALL (2) OpInfo: 6 * 8 + 8 = 56 = 0x38 |
| 234 | +
|
| 235 | + ;; "phantom unwind" records at offset 0 (Tier0 actions) |
| 236 | +
|
| 237 | + CodeOffset: 0x00 UnwindOp: UWOP_ALLOC_SMALL (2) OpInfo: 13 * 8 + 8 = 112 = 0x70 |
| 238 | + CodeOffset: 0x00 UnwindOp: UWOP_PUSH_NONVOL (0) OpInfo: rsi (6) |
| 239 | + CodeOffset: 0x00 UnwindOp: UWOP_PUSH_NONVOL (0) OpInfo: rbp (5) |
| 240 | +``` |
| 241 | + |
| 242 | +### Notes |
| 243 | + |
| 244 | +* We are not changing arm64 OSR at this time, it still uses the "old plan". Non-standard epilogs are handled on arm64 via epilog unwind codes. |
| 245 | + |
| 246 | +* The OSR frame still reserves space for callee saves on its frame, despite |
| 247 | +not saving them there. |
0 commit comments