buslock, irq_stack, tdp_page_fault: bulk of updates

freelancer-leon · freelancer-leon · commit 86696e615e0e · 2023-03-29T17:57:14.000+08:00
Signed-off-by: Leon &lt;zhiquan.lee@gmail.com&gt;
diff --git a/arch/intel/buslock.md b/arch/intel/buslock.md
@@ -0,0 +1,69 @@
+# Bus Lock
+
+## Bus Lock 探测和处理
+### Split Lock 带来的问题
+* **Split lock** 是其操作数跨越两个高速缓存行的任何原子操作。由于操作数跨越两个缓存行并且操作必须是原子的，因此系统在 CPU 访问两个缓存行时锁定总线
+* **Bus lock**：获取 bus lock 是通过对写回 (WB) 内存的 split locked 访问或对非 WB 存储器的任何锁定的访问来实现的
+  * 这通常比高速缓存行中的原子操作慢数千个周期
+  * 它还会破坏其他内核的性能并使整个系统性能极大地降低
+### Split lock 的探测
+* Intel 处理器可能支持以下一种或两种硬件机制来检测 split lock 和 bus lock
+#### `#AC` 探测 Split Lock
+* 从 Tremont Atom 开始，CPU split lock 操作可能会在尝试拆分锁定操作时引发对齐检查 (`#AC`) 异常
+* Split lock 产生 `#AC` 异常由 `TEST_CTRL` MSR 的第 `29` 位，*split lock 探测位* 来启用
+#### `#DB` 探测 Bus Lock
+* 某些 CPU 具有在用户指令获取 bus lock 并 **执行后** 通过 `#DB` trap 通知内核的能力
+* 这允许内核终止应用程序或强制执行以达到节流的目的
+### 软件方面的处理
+* 内核 `#AC` 和 `#DB` 处理程序根据内核参数 `split_lock_detect` 处理 bus lock。以下是不同选项的摘要：
+
+`split_lock_detect=` | split lock 引起的 `#AC` | bus lock 导致的 `#DB`
+---------------------|------------------------|-----------------------
+`off`                | 什么都不做              | 什么都不做
+`warn`（缺省）        | 当支持这两个功能时，每个任务警告一次内核 OOPs 并禁用未来的检查，在 `#AC` 中警告 | 每个任务警告一次并继续运行
+`fatal`              | 当支持这两个功能时，内核 OOPs，并向用户态发送 `SIGBUS`，在 `#AC` 中产生 fatal 错误 | 向用户态发送 `SIGBUS`
+`ratelimit:N (0 < N <= 1000)` | 什么都不做 | 将系统范围内的 bus lock 速率限制为每秒 N 次总线锁定，并在总线锁定时发出警告
+
+## Bus Lock 调试异常
+* `DB6` 寄存器 bit `11` 在 Bus Lock Trap 异常时由处理器清零
+* 当使用 `DEBUGCTL`（`IA32_DEBUGCTL` MSR `0x1D9`）的第 `2` 位启用 Bus Lock Trap 时，任何导致总线锁定的指令（主要是使用 `LOCK` 前缀在不可缓存内存上执行内存原子操作的指令）将清除 `DR6` 的第 `11` 位并导致 trap 类型为 `#DB` 的异常
+  * **注意**：处理器不会以其他方式设置或清除此位
+  * 为避免在辩别调试异常时出现混淆，软件调试异常处理程序应在返回被中断任务之前将此位设置为 `1`
+* 在不支持 Bus Lock Trap 异常的处理器上，`DR6` 的第 `11` 位是只读位，其作用与第 `10:4` 位相同，都是 `1`
+* CPU 使用 `CPUID.(EAX=7, ECX=0).ECX[24]` 枚举对该特性的支持，设置为 `1` 表示支持
+* 当 `CPL > 0` 时，硬件仅生成用于 bus lock 检测的`#DB`，以避免在处理第一个 `#DB` 时来自多个 bus lock 的嵌套 `#DB`
+* *Breakpoint* 和 *bus lock* 都可以在同一指令中触发 `#DB` trap，处理它们的顺序由内核 `#DB` 处理程序选择
+* 在 `/proc/cpuinfo` 中的 CPU feature 标志是 `bus_lock_detect`
+
+* 启用 split lock 的调用路径
+```c
+early_identify_cpu()
+-> sld_setup()
+   -> split_lock_setup()
+      -> __split_lock_setup() //设置 split lock 的 CPU feature
+   -> sld_state_setup() //得到 split_lock_detect= 的设置
+   -> sld_state_show() //在启动时打印 split lock 和 bus lock 的支持情况
+```
+* 启用 bus lock 的调用路径
+```c
+identify_cpu()
+-> this_cpu->c_init(c)
+=> init_intel()
+   -> split_lock_init()
+   -> bus_lock_init()
+```
+
+### Bus Lock 调试异常虚拟化的支持
+* VM exit 时设置 VMCS 的 guest-state area 中的 pending debug exception 字段的第 `11` 位，以指示总线锁定调试异常 pending 但未交付
+* 设置此位的 VM exit 也会设置该字段的第 `12` 位（VM exit 还设置第 `12` 位以指示至少遇到一个数据或 I/O 断点并在 `DR7` 中启用，或者发生与 RTM transactional regions 的高级调试相关的调试异常。）
+* 启用后，如果处理器检测到一个或多个 bus lock 是在 VMX non-root operation 执行期间导致的，则处理器会生成一个 VM exit， exit reason 为 `74`
+  * 这种 VM exit 类似于陷阱，在执行获取总线锁的指令后交付
+  * 如果此 VM exit 的交付被更高优先级的 VM exit 抢占，则 VMCS 中 exit reason 字段的第 `26` 位设置为 `1`
+* VMM 可以通过设置 secondary processor-based 执行控制的第 `30` 位，以在 VMX non-root operation 中获取的总线锁时发生 VM exit
+* 处理器通过设置 `IA32_VMX_PROCBASED_CTLS2` MSR 的第 `62` 位来枚举对此控制的的支持，设置为 `1` 表示支持
+
+## References
+- [Bus lock detection and handling — The Linux Kernel documentation](https://docs.kernel.org/x86/buslock.html)
+- [x86 debug register - Wikipedia](https://en.wikipedia.org/wiki/X86_debug_register)
+- [Intel Instruction Set Extension Chapter 9](https://software.intel.com/content/dam/develop/public/us/en/documents/architecture-instruction-set-extensions-programming-reference.pdf)
+- [[PATCH v6 0_3] x86_bus_lock Enable bus lock detection - Fenghua Yu](https://lore.kernel.org/all/20210322135325.682257-1-fenghua.yu@intel.com/#r)
diff --git a/kernel/iommu.md b/kernel/iommu.md
@@ -0,0 +1,7 @@
+# References
+- [Linux x86-64 IOMMU详解（一）——IOMMU简介](https://blog.csdn.net/qq_34719392/article/details/114834467)
+- [Linux x86-64 IOMMU详解（二）——SWIOTLB（软件 IOMMU）](https://blog.csdn.net/qq_34719392/article/details/114873284)
+- [Linux x86-64 IOMMU详解（三）——Intel IOMMU（硬件 IOMMU）的功能与基本原理](https://blog.csdn.net/qq_34719392/article/details/115374606)
+- [Linux x86-64 IOMMU详解（四）——启用Intel IOMMU 的配置](https://blog.csdn.net/qq_34719392/article/details/116153505)
+- [Linux x86-64 IOMMU详解（五）——Intel IOMMU 初始化流程](https://blog.csdn.net/qq_34719392/article/details/117563480)
+- [Linux x86-64 IOMMU详解（六）——Intel IOMMU 参与下的 DMA Coherent Mapping 流程](https://blog.csdn.net/qq_34719392/article/details/117699839)
diff --git a/kernel/irq_x86-64.md b/kernel/irq_x86-64.md
@@ -161,8 +161,8 @@ irq_entries_start
   * [[patch V2 00_13] x86_irq_64 Inline irq stack switching](https://lore.kernel.org/all/20210209234041.127454039@linutronix.de/)
   * [x86/entry: Convert system vectors to irq stack macro](https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=569dd8b4eb7ef666b467c41b8e8e4f2820d07f67)
   * 对于用户程序被中断或者已经有中断正在 per-CPU 的中断栈上被处理的情况，直接调用`__common_interrupt()`
-    * 中断的是用户程序，用的应该是 trampoline stack
-    * 已经有中断正在  per-CPU 中断栈上被处理，则使用 *当前内核栈*
+    * 中断的是用户程序，第一站是 trampoline stack，但随即切换到进程内核栈上去处理中断
+    * 已经有中断正在 per-CPU 中断栈上被处理，则使用 *当前内核栈*
 
 ## 异常栈
 * 对于无需进行 privilege-level 变化的情况，比如异常发生时 CPU 运行在内核态
@@ -174,8 +174,147 @@ irq_entries_start
 ## Privilege-level 发生变化时的栈
 * 对于 privilege-level 变化的情况，比如异常或中断发生时 CPU 运行在用户态，handler 要使用的栈的 segment selector 和 stack pointer 是从当前执行任务的 TSS 中获得的。
   * 对于 x86-64 Linux 这个栈由`cpu_tss_rw.x86_tss.sp0`指示，也就是 CPU entry trampoline stack
-  * 对于中断，内核会随后切换到中断栈上去处理中断
-  * 对于异常，就接着在 trampoline stack 上去处理异常
+  * 对于中断，内核会随后切换到中断栈上去处理中断（commit 569dd8b4eb7e 后有变化，见上面，也切换到进程内核栈上去处理中断）
+  * 对于异常，随即切换到进程内核栈上去处理异常
+
+```cpp
+/* Device interrupts common/spurious */
+DECLARE_IDTENTRY_IRQ(X86_TRAP_OTHER,    common_interrupt);
+
+#ifndef __ASSEMBLY__ //对于 C 代码包含该宏
+
+#define DECLARE_IDTENTRY_IRQ(vector, func)              \
+    DECLARE_IDTENTRY_ERRORCODE(vector, func)
+
+#define DECLARE_IDTENTRY_ERRORCODE(vector, func)            \
+    asmlinkage void asm_##func(void);               \
+    asmlinkage void xen_asm_##func(void);               \
+    __visible void func(struct pt_regs *regs, unsigned long error_code)
+
+#else //对于汇编代码包含该宏
+/* Entries for common/spurious (device) interrupts */
+#define DECLARE_IDTENTRY_IRQ(vector, func)              \
+    idtentry_irq vector func
+#endif
+```
+* 中断处理函数的入口的汇编宏 `idtentry_irq`
+```c
+/*
+ * Interrupt entry/exit.
+ *
+ + The interrupt stubs push (vector) onto the stack, which is the error_code
+ * position of idtentry exceptions, and jump to one of the two idtentry points
+ * (common/spurious).
+ *
+ * common_interrupt is a hotpath, align it to a cache line
+ */
+.macro idtentry_irq vector cfunc
+    .p2align CONFIG_X86_L1_CACHE_SHIFT
+    idtentry \vector asm_\cfunc \cfunc has_error_code=1
+.endm
+
+.macro idtentry vector asmsym cfunc has_error_code:req
+SYM_CODE_START(\asmsym)
+
+    .if \vector == X86_TRAP_BP //vector = X86_TRAP_OTHER，进不来
+        /* #BP advances %rip to the next instruction */
+        UNWIND_HINT_IRET_REGS offset=\has_error_code*8 signal=0
+    .else
+        UNWIND_HINT_IRET_REGS offset=\has_error_code*8
+    .endif
+
+    ENDBR
+    ASM_CLAC
+    cld
+
+    .if \has_error_code == 0
+        pushq   $-1         /* ORIG_RAX: no syscall to restart */
+    .endif
+
+    .if \vector == X86_TRAP_BP //vector = X86_TRAP_OTHER，进不来
+        /*
+         * If coming from kernel space, create a 6-word gap to allow the
+         * int3 handler to emulate a call instruction.
+         */
+        testb   $3, CS-ORIG_RAX(%rsp)
+        jnz .Lfrom_usermode_no_gap_\@
+        .rept   6
+        pushq   5*8(%rsp)
+        .endr
+        UNWIND_HINT_IRET_REGS offset=8
+.Lfrom_usermode_no_gap_\@:
+    .endif
+
+    idtentry_body \cfunc \has_error_code
+
+_ASM_NOKPROBE(\asmsym)
+SYM_CODE_END(\asmsym)
+.endm
+
+/**
+ * idtentry_body - Macro to emit code calling the C function
+ * @cfunc:      C function to be called
+ * @has_error_code: Hardware pushed error code on stack
+ */
+.macro idtentry_body cfunc has_error_code:req
+
+    /*
+     * Call error_entry() and switch to the task stack if from userspace.
+     *
+     * When in XENPV, it is already in the task stack, and it can't fault
+     * for native_iret() nor native_load_gs_index() since XENPV uses its
+     * own pvops for IRET and load_gs_index().  And it doesn't need to
+     * switch the CR3.  So it can skip invoking error_entry().
+     */
+    ALTERNATIVE "call error_entry; movq %rax, %rsp", \
+            "call xen_error_entry", X86_FEATURE_XENPV
+
+    ENCODE_FRAME_POINTER
+    UNWIND_HINT_REGS
+
+    movq    %rsp, %rdi          /* pt_regs pointer into 1st argument*/
+
+    .if \has_error_code == 1
+        movq    ORIG_RAX(%rsp), %rsi    /* get error code into 2nd argument*/
+        movq    $-1, ORIG_RAX(%rsp) /* no syscall to restart */
+    .endif
+
+    call    \cfunc //调用 common_interrupt
+
+    /* For some configurations \cfunc ends up being a noreturn. */
+    REACHABLE
+
+    jmp error_return
+.endm
+```
+* `sync_regs()` 帮忙找到进程内核栈，真正地将栈切换到进程内核栈在 `call error_entry; movq %rax, %rsp`
+```cpp
+/*
+ * Help handler running on a per-cpu (IST or entry trampoline) stack
+ * to switch to the normal thread stack if the interrupted code was in
+ * user mode. The actual stack switch is done in entry_64.S
+ */
+asmlinkage __visible noinstr struct pt_regs *sync_regs(struct pt_regs *eregs)
+{
+    struct pt_regs *regs = (struct pt_regs *)this_cpu_read(pcpu_hot.top_of_stack) - 1;
+    if (regs != eregs)
+        *regs = *eregs;
+    return regs;
+}
+```
+* `idtentry_body`里切换到进程内核栈，`sync_regs()` 的返回值即 `call error_entry` 的返回值，放在 `$rax`
+```cpp
+call error_entry
+   PUSH_AND_CLEAR_REGS save_ret=1 //寄存器压栈
+   testb   $3, CS+8(%rsp) //中断/异常发生在内核态 or 用户态？
+   jz  .Lerror_kernelspace //内核态不走下面
+   /* Put us onto the real thread stack. */
+   jmp sync_regs //帮忙找到进程内核栈，该函数的返回就是 error_entry 的返回
+   ...
+.Lerror_kernelspace:
+   ...
+movq %rax, %rsp //sync_regs() 的返回值即 error_entry 的返回值，真正地将栈切换到进程内核栈
+```
 
 > When the processor performs a call to the exception- or interrupt-handler procedure:
 > * If the handler procedure is going to be executed at a numerically lower privilege level, a stack switch occurs.
@@ -195,7 +334,7 @@ irq_entries_start
 >
 > -- SDM, Vol. 3A, 6.14.2 64-Bit Mode Stack Frame
 
-* 为什么异常处理发生 oops 时看到的的栈有时是`0xfffffexxxxxxxxxx`有时是`0xffffc90000000000 ~ ffffe8ffffffffff`（对于 5 级页表是`0xffa0000000000000 ~ 0xffd1ffffffffffff`）？
+* 为什么异常处理发生 OOPs 时看到的的栈有时是`0xfffffexxxxxxxxxx`有时是`0xffffc90000000000 ~ ffffe8ffffffffff`（对于 5 级页表是`0xffa0000000000000 ~ 0xffd1ffffffffffff`）？
   * `0xfffffexxxxxxxxxx` 是 cpu_entry_area mapping 的范围，说明此时使用的是 CPU entry trampoline stack，异常发生时 CPU 在运行的是用户态的程序；
   * `0xffffc90000000000 ~ ffffe8ffffffffff` 是 vmalloc/ioremap space 的范围，说明当前内核启用了`CONFIG_VMAP_STACK`，进程内核栈是通过`vmalloc`分配的，异常发生时处于内核态；
   * 对于一些出错的极端情况，甚至有可能看到异常在使用中断栈或者 IST 栈。
@@ -380,6 +519,17 @@ struct cea_exception_stacks {
 };
 ```
 
+### 用户态发生 `#MC` 和 `#DB` 时的栈
+* 虽然 `#MC` 和 `#DB` 被 Linux 定义为使用 IST 栈，但发生在用户态时，会通过软件把它切换到进程内核栈上去处理
+```c
+# define DECLARE_IDTENTRY_MCE(vector, func)             \
+    idtentry_mce_db vector asm_##func func
+
+# define DECLARE_IDTENTRY_DEBUG(vector, func)               \
+    idtentry_mce_db vector asm_##func func
+```
+* `idtentry_mce_db` 会判断，如果是发生在用户态，用的是汇编宏 `idtentry_body`，之前已经展示过了它是怎么切换栈的了
+
 ## CPU Entry Trampoline Stack
 * 出于安全的目的，trampoline stack 被引入以支持 x86 KAISER
   * https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=7f2590a110b837af5679d08fc25c6227c5a8c497
diff --git a/kernel/sched/sched_linux.md b/kernel/sched/sched_linux.md
@@ -1016,3 +1016,5 @@ again:
 * http://linuxperf.com/?p=42
 * [TIF_NEED_RESCHED: Why Is It Needed](http://www.linuxinternals.org/blog/2016/03/20/tif-need-resched-why-is-it-needed/)
 * [What Does an Idle CPU Do?](http://duartes.org/gustavo/blog/post/what-does-an-idle-cpu-do/)
+- [LWN：Linux 新的 EEVDF 调度器！](https://mp.weixin.qq.com/s/MqAzzGU8JCV90wUUWUJbyQ)
+- [An EEVDF CPU scheduler for Linux](https://lwn.net/Articles/925371/)
diff --git a/kernel/trace/ftrace.md b/kernel/trace/ftrace.md
@@ -815,5 +815,7 @@ echo '-:dosysopen1' > /sys/kernel/debug/tracing/kprobe_events
 * [Debugging the kernel using Ftrace - part 1](https://lwn.net/Articles/365835/)
 * [Debugging the kernel using Ftrace - part 2](https://lwn.net/Articles/366796/)
 * [Secrets of the Ftrace function tracer](https://lwn.net/Articles/370423/)
-* [Kprobe-based Event Tracing — The Linux Kernel documentation](https://www.kernel.org/doc/html/latest/trace/kprobetrace.html)
+* [ftrace - Function Tracer — The Linux Kernel documentation](https://www.kernel.org/doc/html/latest/trace/ftrace.html)
+* [Event Tracing — The Linux Kernel documentation](https://www.kernel.org/doc/html/latest/trace/events.html)
 * [Kernel Probes (Kprobes) — The Linux Kernel documentation](https://www.kernel.org/doc/html/latest/trace/kprobes.html)
+* [Kprobe-based Event Tracing — The Linux Kernel documentation](https://www.kernel.org/doc/html/latest/trace/kprobetrace.html)
diff --git a/virtualization/tdp_page_fault.md b/virtualization/tdp_page_fault.md