[linux-6.6.y] kvm: x86: fix pauseopt soft lockup in VM-exit#1609
[linux-6.6.y] kvm: x86: fix pauseopt soft lockup in VM-exit#1609opsiff merged 1 commit intodeepin-community:linux-6.6.yfrom
Conversation
zhaoxin inclusion category: feature -------------------- The original PAUSEOPT implementation called kvm_vcpu_read_guest() in is_vmexit_during_pauseopt() to detect PAUSEOPT state on every VM-exit. When multiple vCPUs run across NUMA nodes, this frequent guest memory access can trigger NUMA page migration, causing TLB flush IPI deadlock and soft lockup. Fix by using vmcs_read64(PAUSEOPT_TARGET_TSC) to detect PAUSEOPT state instead of reading guest memory. A non-zero PAUSEOPT_TARGET_TSC value indicates the guest is in PAUSEOPT optimized state. Also move pauseopt state fields from kvm_vcpu_arch to vcpu_vmx to avoid KABI compatibility issues. Changes: - Remove is_vmexit_during_pauseopt() function - Use VMCS field PAUSEOPT_TARGET_TSC for state detection - Move pauseopt_interrupted/pauseopt_rip to vcpu_vmx as private fields - Initialize new fields in vmx_vcpu_reset() Signed-off-by: leoliu-oc <leoliu-oc@zhaoxin.com>
Reviewer's guide (collapsed on small PRs)Reviewer's GuideZhaoxin VMX PAUSEOPT handling is reworked to avoid guest memory reads on every VM-exit, instead using the PAUSEOPT_TARGET_TSC VMCS field to track pauseopt state, while pauseopt bookkeeping is moved from generic kvm_vcpu_arch into vcpu_vmx and Zhaoxin-specific MSR capability bits are renamed and consistently used. Sequence diagram for new PAUSEOPT handling on VM-exit/entrysequenceDiagram
participant VCPU as kvm_vcpu
participant VMX as vcpu_vmx
participant VMCS
%% zx_vmx_vcpu_run_post path
VCPU->>VMX: zx_vmx_vcpu_run_post
VMX->>VMX: cpu_has_vmx_pauseopt()
alt PAUSEOPT supported
VMX->>VMCS: vmcs_read64(PAUSEOPT_TARGET_TSC)
alt PAUSEOPT_TARGET_TSC != 0
VMX->>VMX: pauseopt_in_progress = true
VCPU->>VMX: kvm_rip_read(vcpu)
VMX->>VMX: pauseopt_rip = current_rip
end
end
%% later entry
VCPU->>VMX: zx_vmx_vcpu_run_pre
VMX->>VMX: check pauseopt_in_progress
alt pauseopt_in_progress == true
VCPU->>VMX: kvm_rip_read(vcpu)
VMX->>VMX: compare new_rip with pauseopt_rip
alt new_rip != pauseopt_rip
VMX->>VMCS: vmcs_write64(PAUSEOPT_TARGET_TSC, 0)
VMX->>VMX: pauseopt_in_progress = false
VMX->>VMX: pauseopt_rip = 0
end
end
Class diagram for updated pauseopt bookkeeping in vcpu_vmx and kvm_vcpu_archclassDiagram
class kvm_vcpu_arch {
<<struct>>
// many existing fields elided
// pauseopt_interrupted removed
// pauseopt_rip removed
}
class vcpu_vmx {
<<struct>>
u64 spec_ctrl
u32 msr_ia32_umwait_control
u32 msr_pauseopt_control
bool pauseopt_in_progress
unsigned long pauseopt_rip
}
vcpu_vmx --> kvm_vcpu_arch : contains_arch
Flow diagram for PAUSEOPT state tracking using PAUSEOPT_TARGET_TSCflowchart TD
A["VM-exit occurs"] --> B["zx_vmx_vcpu_run_post"]
B --> C["cpu_has_vmx_pauseopt()?"]
C -->|no| Z["Return, no PAUSEOPT tracking"]
C -->|yes| D["Read PAUSEOPT_TARGET_TSC from VMCS"]
D --> E{"PAUSEOPT_TARGET_TSC != 0?"}
E -->|no| Z
E -->|yes| F["Set pauseopt_in_progress = true"]
F --> G["pauseopt_rip = kvm_rip_read(vcpu)"]
G --> Z
Z --> H["Next VM-entry: zx_vmx_vcpu_run_pre"]
H --> I{"pauseopt_in_progress?"}
I -->|no| Q["Enter guest normally"]
I -->|yes| J["new_rip = kvm_rip_read(vcpu)"]
J --> K{"new_rip != pauseopt_rip?"}
K -->|no| Q
K -->|yes| L["vmcs_write64(PAUSEOPT_TARGET_TSC, 0)"]
L --> M["pauseopt_in_progress = false"]
M --> N["pauseopt_rip = 0"]
N --> Q
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
|
Hi @leoliu-oc. Thanks for your PR. I'm waiting for a deepin-community member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Pull request overview
This PR updates Zhaoxin/Centaur VMX PAUSEOPT handling in x86 KVM to avoid frequent guest memory reads on VM-exit (which can trigger NUMA migration and soft lockups), by tracking PAUSEOPT state via the VMCS PAUSEOPT_TARGET_TSC field and moving state to VMX-private per-vCPU data.
Changes:
- Replace opcode-based PAUSEOPT detection (guest memory read at RIP) with
vmcs_read64(PAUSEOPT_TARGET_TSC)and removeis_vmexit_during_pauseopt(). - Move PAUSEOPT tracking state from
struct kvm_vcpu_archtostruct vcpu_vmx, and reset it invmx_vcpu_reset(). - Rename the Zhaoxin extended-VMCS capability bit macro to
MSR_ZX_VMCS_EXEC_CTL3_ENand update its callers.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| arch/x86/kvm/vmx/vmx.h | Add VMX-private PAUSEOPT tracking fields to struct vcpu_vmx. |
| arch/x86/kvm/vmx/vmx.c | Switch PAUSEOPT state detection to VMCS field; reset new state; update Zhaoxin VMCS capability bit usage. |
| arch/x86/kernel/cpu/feat_ctl.c | Use renamed Zhaoxin VMCS capability bit macro when initializing CPU capabilities. |
| arch/x86/include/asm/msr-index.h | Rename Zhaoxin extended VMCS capability bit macro. |
| arch/x86/include/asm/kvm_host.h | Remove Zhaoxin PAUSEOPT state from generic struct kvm_vcpu_arch. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| u32 msr_pauseopt_control; | ||
| bool pauseopt_in_progress; | ||
| unsigned long pauseopt_rip; |
There was a problem hiding this comment.
In struct vcpu_vmx, the newly added fields are not aligned/indented consistently with the surrounding members (most fields in this block use tab-aligned spacing). Please adjust formatting to match the existing style in this struct to keep diffs/readability consistent.
| u32 msr_pauseopt_control; | |
| bool pauseopt_in_progress; | |
| unsigned long pauseopt_rip; | |
| u32 msr_pauseopt_control; | |
| bool pauseopt_in_progress; | |
| unsigned long pauseopt_rip; |
| vmx->spec_ctrl = 0; | ||
| vmx->msr_pauseopt_control = 0; | ||
| vmx->pauseopt_in_progress = false; | ||
| vmx->pauseopt_rip = 0; |
There was a problem hiding this comment.
vmx_vcpu_reset() now resets the software PAUSEOPT tracking fields, but it never clears the VMCS PAUSEOPT_TARGET_TSC field. Because __vmx_vcpu_reset()/init_vmcs() is skipped when init_event==true, a stale non-zero PAUSEOPT_TARGET_TSC value can persist across INIT/reset and be misinterpreted as “PAUSEOPT in progress” on subsequent VM-exits. Consider explicitly vmcs_write64(PAUSEOPT_TARGET_TSC, 0) here (guarded by cpu_has_vmx_pauseopt()/is_zhaoxin_cpu()) to guarantee a clean state after reset paths.
| vmx->pauseopt_rip = 0; | |
| vmx->pauseopt_rip = 0; | |
| if (cpu_has_vmx_pauseopt() || is_zhaoxin_cpu()) | |
| vmcs_write64(PAUSEOPT_TARGET_TSC, 0); |
zhaoxin inclusion
category: feature
The original PAUSEOPT implementation called kvm_vcpu_read_guest() in is_vmexit_during_pauseopt() to detect PAUSEOPT state on every VM-exit. When multiple vCPUs run across NUMA nodes, this frequent guest memory access can trigger NUMA page migration, causing TLB flush IPI deadlock and soft lockup.
Fix by using vmcs_read64(PAUSEOPT_TARGET_TSC) to detect PAUSEOPT state instead of reading guest memory. A non-zero PAUSEOPT_TARGET_TSC value indicates the guest is in PAUSEOPT optimized state.
Also move pauseopt state fields from kvm_vcpu_arch to vcpu_vmx to avoid KABI compatibility issues.
Changes:
Summary by Sourcery
Prevent PAUSEOPT-related soft lockups on Zhaoxin x86 KVM by avoiding guest memory inspection on every VM-exit and tightening PAUSEOPT state tracking.
Bug Fixes:
Enhancements: