Skip to content

[linux-6.6.y] kvm: x86: fix pauseopt soft lockup in VM-exit#1609

Merged
opsiff merged 1 commit intodeepin-community:linux-6.6.yfrom
leoliu-oc:linux-6.6.y
Apr 9, 2026
Merged

[linux-6.6.y] kvm: x86: fix pauseopt soft lockup in VM-exit#1609
opsiff merged 1 commit intodeepin-community:linux-6.6.yfrom
leoliu-oc:linux-6.6.y

Conversation

@leoliu-oc
Copy link
Copy Markdown
Contributor

@leoliu-oc leoliu-oc commented Apr 9, 2026

zhaoxin inclusion
category: feature


The original PAUSEOPT implementation called kvm_vcpu_read_guest() in is_vmexit_during_pauseopt() to detect PAUSEOPT state on every VM-exit. When multiple vCPUs run across NUMA nodes, this frequent guest memory access can trigger NUMA page migration, causing TLB flush IPI deadlock and soft lockup.

Fix by using vmcs_read64(PAUSEOPT_TARGET_TSC) to detect PAUSEOPT state instead of reading guest memory. A non-zero PAUSEOPT_TARGET_TSC value indicates the guest is in PAUSEOPT optimized state.

Also move pauseopt state fields from kvm_vcpu_arch to vcpu_vmx to avoid KABI compatibility issues.

Changes:

  • Remove is_vmexit_during_pauseopt() function
  • Use VMCS field PAUSEOPT_TARGET_TSC for state detection
  • Move pauseopt_interrupted/pauseopt_rip to vcpu_vmx as private fields
  • Initialize new fields in vmx_vcpu_reset()

Summary by Sourcery

Prevent PAUSEOPT-related soft lockups on Zhaoxin x86 KVM by avoiding guest memory inspection on every VM-exit and tightening PAUSEOPT state tracking.

Bug Fixes:

  • Eliminate NUMA-induced TLB flush deadlocks and soft lockups by using the PAUSEOPT_TARGET_TSC VMCS field instead of reading guest memory to detect PAUSEOPT state.

Enhancements:

  • Track PAUSEOPT in-progress state and RIP as VMX-specific vcpu_vmx fields instead of generic kvm_vcpu_arch to avoid KABI issues and better encapsulate vendor-specific state.
  • Reset PAUSEOPT-related VMX state during vCPU reset to ensure a clean initial state.
  • Clarify and rename the Zhaoxin extended VMCS capability bit to MSR_ZX_VMCS_EXEC_CTL3_EN and use it consistently in VMCS setup and feature init code.

zhaoxin inclusion
category: feature

--------------------

The original PAUSEOPT implementation called kvm_vcpu_read_guest() in
is_vmexit_during_pauseopt() to detect PAUSEOPT state on every VM-exit.
When multiple vCPUs run across NUMA nodes, this frequent guest memory
access can trigger NUMA page migration, causing TLB flush IPI deadlock
and soft lockup.

Fix by using vmcs_read64(PAUSEOPT_TARGET_TSC) to detect PAUSEOPT state
instead of reading guest memory. A non-zero PAUSEOPT_TARGET_TSC value
indicates the guest is in PAUSEOPT optimized state.

Also move pauseopt state fields from kvm_vcpu_arch to vcpu_vmx to avoid
KABI compatibility issues.

Changes:
- Remove is_vmexit_during_pauseopt() function
- Use VMCS field PAUSEOPT_TARGET_TSC for state detection
- Move pauseopt_interrupted/pauseopt_rip to vcpu_vmx as private fields
- Initialize new fields in vmx_vcpu_reset()

Signed-off-by: leoliu-oc <leoliu-oc@zhaoxin.com>
@sourcery-ai
Copy link
Copy Markdown

sourcery-ai bot commented Apr 9, 2026

Reviewer's guide (collapsed on small PRs)

Reviewer's Guide

Zhaoxin VMX PAUSEOPT handling is reworked to avoid guest memory reads on every VM-exit, instead using the PAUSEOPT_TARGET_TSC VMCS field to track pauseopt state, while pauseopt bookkeeping is moved from generic kvm_vcpu_arch into vcpu_vmx and Zhaoxin-specific MSR capability bits are renamed and consistently used.

Sequence diagram for new PAUSEOPT handling on VM-exit/entry

sequenceDiagram
    participant VCPU as kvm_vcpu
    participant VMX as vcpu_vmx
    participant VMCS

    %% zx_vmx_vcpu_run_post path
    VCPU->>VMX: zx_vmx_vcpu_run_post
    VMX->>VMX: cpu_has_vmx_pauseopt()
    alt PAUSEOPT supported
        VMX->>VMCS: vmcs_read64(PAUSEOPT_TARGET_TSC)
        alt PAUSEOPT_TARGET_TSC != 0
            VMX->>VMX: pauseopt_in_progress = true
            VCPU->>VMX: kvm_rip_read(vcpu)
            VMX->>VMX: pauseopt_rip = current_rip
        end
    end

    %% later entry
    VCPU->>VMX: zx_vmx_vcpu_run_pre
    VMX->>VMX: check pauseopt_in_progress
    alt pauseopt_in_progress == true
        VCPU->>VMX: kvm_rip_read(vcpu)
        VMX->>VMX: compare new_rip with pauseopt_rip
        alt new_rip != pauseopt_rip
            VMX->>VMCS: vmcs_write64(PAUSEOPT_TARGET_TSC, 0)
            VMX->>VMX: pauseopt_in_progress = false
            VMX->>VMX: pauseopt_rip = 0
        end
    end
Loading

Class diagram for updated pauseopt bookkeeping in vcpu_vmx and kvm_vcpu_arch

classDiagram
    class kvm_vcpu_arch {
        <<struct>>
        // many existing fields elided
        // pauseopt_interrupted removed
        // pauseopt_rip removed
    }

    class vcpu_vmx {
        <<struct>>
        u64 spec_ctrl
        u32 msr_ia32_umwait_control
        u32 msr_pauseopt_control
        bool pauseopt_in_progress
        unsigned long pauseopt_rip
    }

    vcpu_vmx --> kvm_vcpu_arch : contains_arch
Loading

Flow diagram for PAUSEOPT state tracking using PAUSEOPT_TARGET_TSC

flowchart TD
    A["VM-exit occurs"] --> B["zx_vmx_vcpu_run_post"]
    B --> C["cpu_has_vmx_pauseopt()?"]
    C -->|no| Z["Return, no PAUSEOPT tracking"]
    C -->|yes| D["Read PAUSEOPT_TARGET_TSC from VMCS"]
    D --> E{"PAUSEOPT_TARGET_TSC != 0?"}
    E -->|no| Z
    E -->|yes| F["Set pauseopt_in_progress = true"]
    F --> G["pauseopt_rip = kvm_rip_read(vcpu)"]
    G --> Z

    Z --> H["Next VM-entry: zx_vmx_vcpu_run_pre"]
    H --> I{"pauseopt_in_progress?"}
    I -->|no| Q["Enter guest normally"]
    I -->|yes| J["new_rip = kvm_rip_read(vcpu)"]
    J --> K{"new_rip != pauseopt_rip?"}
    K -->|no| Q
    K -->|yes| L["vmcs_write64(PAUSEOPT_TARGET_TSC, 0)"]
    L --> M["pauseopt_in_progress = false"]
    M --> N["pauseopt_rip = 0"]
    N --> Q
Loading

File-Level Changes

Change Details Files
Avoid guest memory accesses on every VM-exit when checking PAUSEOPT state and instead use the VMCS PAUSEOPT_TARGET_TSC field.
  • Remove the is_vmexit_during_pauseopt() helper that decoded the PAUSEOPT opcode by reading guest memory at RIP.
  • In zx_vmx_vcpu_run_post(), detect pauseopt state by checking vmcs_read64(PAUSEOPT_TARGET_TSC) when cpu_has_vmx_pauseopt() is true.
  • In zx_vmx_vcpu_run_pre(), clear PAUSEOPT_TARGET_TSC and reset pauseopt tracking when RIP changes after an interrupted PAUSEOPT, preventing re-entering pauseopt optimized state.
arch/x86/kvm/vmx/vmx.c
Move PAUSEOPT per-vCPU state from generic KVM arch struct into the Zhaoxin VMX-specific vcpu_vmx struct and initialize it properly.
  • Remove pauseopt_interrupted and pauseopt_rip fields and comments from struct kvm_vcpu_arch to avoid KABI exposure.
  • Add pauseopt_in_progress and pauseopt_rip fields to struct vcpu_vmx alongside msr_pauseopt_control.
  • Initialize msr_pauseopt_control, pauseopt_in_progress, and pauseopt_rip in vmx_vcpu_reset().
  • Update zx_vmx_vcpu_run_pre/post() to use the vcpu_vmx fields instead of vcpu->arch fields.
arch/x86/include/asm/kvm_host.h
arch/x86/kvm/vmx/vmx.h
arch/x86/kvm/vmx/vmx.c
Align Zhaoxin extended VMCS capability bit naming and usage for the EXEC_CTL3 enable bit.
  • Rename MSR_ZX_VMCS_EXEC_CTL3 to MSR_ZX_VMCS_EXEC_CTL3_EN in the MSR index header to clarify it is an enable bit.
  • Update checks in setup_zhaoxin_vmcs_controls() and init_zhaoxin_ext_capabilities() to use the renamed MSR_ZX_VMCS_EXEC_CTL3_EN bit when gating use of PROCBASED_CTLS3.
arch/x86/include/asm/msr-index.h
arch/x86/kvm/vmx/vmx.c
arch/x86/kernel/cpu/feat_ctl.c

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

@deepin-ci-robot
Copy link
Copy Markdown

Hi @leoliu-oc. Thanks for your PR.

I'm waiting for a deepin-community member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@deepin-ci-robot
Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign avenger-285714 for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Copy Markdown

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've reviewed your changes and they look great!


Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates Zhaoxin/Centaur VMX PAUSEOPT handling in x86 KVM to avoid frequent guest memory reads on VM-exit (which can trigger NUMA migration and soft lockups), by tracking PAUSEOPT state via the VMCS PAUSEOPT_TARGET_TSC field and moving state to VMX-private per-vCPU data.

Changes:

  • Replace opcode-based PAUSEOPT detection (guest memory read at RIP) with vmcs_read64(PAUSEOPT_TARGET_TSC) and remove is_vmexit_during_pauseopt().
  • Move PAUSEOPT tracking state from struct kvm_vcpu_arch to struct vcpu_vmx, and reset it in vmx_vcpu_reset().
  • Rename the Zhaoxin extended-VMCS capability bit macro to MSR_ZX_VMCS_EXEC_CTL3_EN and update its callers.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
arch/x86/kvm/vmx/vmx.h Add VMX-private PAUSEOPT tracking fields to struct vcpu_vmx.
arch/x86/kvm/vmx/vmx.c Switch PAUSEOPT state detection to VMCS field; reset new state; update Zhaoxin VMCS capability bit usage.
arch/x86/kernel/cpu/feat_ctl.c Use renamed Zhaoxin VMCS capability bit macro when initializing CPU capabilities.
arch/x86/include/asm/msr-index.h Rename Zhaoxin extended VMCS capability bit macro.
arch/x86/include/asm/kvm_host.h Remove Zhaoxin PAUSEOPT state from generic struct kvm_vcpu_arch.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +284 to +286
u32 msr_pauseopt_control;
bool pauseopt_in_progress;
unsigned long pauseopt_rip;
Copy link

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In struct vcpu_vmx, the newly added fields are not aligned/indented consistently with the surrounding members (most fields in this block use tab-aligned spacing). Please adjust formatting to match the existing style in this struct to keep diffs/readability consistent.

Suggested change
u32 msr_pauseopt_control;
bool pauseopt_in_progress;
unsigned long pauseopt_rip;
u32 msr_pauseopt_control;
bool pauseopt_in_progress;
unsigned long pauseopt_rip;

Copilot uses AI. Check for mistakes.
vmx->spec_ctrl = 0;
vmx->msr_pauseopt_control = 0;
vmx->pauseopt_in_progress = false;
vmx->pauseopt_rip = 0;
Copy link

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

vmx_vcpu_reset() now resets the software PAUSEOPT tracking fields, but it never clears the VMCS PAUSEOPT_TARGET_TSC field. Because __vmx_vcpu_reset()/init_vmcs() is skipped when init_event==true, a stale non-zero PAUSEOPT_TARGET_TSC value can persist across INIT/reset and be misinterpreted as “PAUSEOPT in progress” on subsequent VM-exits. Consider explicitly vmcs_write64(PAUSEOPT_TARGET_TSC, 0) here (guarded by cpu_has_vmx_pauseopt()/is_zhaoxin_cpu()) to guarantee a clean state after reset paths.

Suggested change
vmx->pauseopt_rip = 0;
vmx->pauseopt_rip = 0;
if (cpu_has_vmx_pauseopt() || is_zhaoxin_cpu())
vmcs_write64(PAUSEOPT_TARGET_TSC, 0);

Copilot uses AI. Check for mistakes.
@opsiff opsiff merged commit c0aab44 into deepin-community:linux-6.6.y Apr 9, 2026
17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants