Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Add support for KVM_GET_XSAVE2 ioctls #261

Closed
wants to merge 1 commit into from

Conversation

roypat
Copy link
Collaborator

@roypat roypat commented Apr 24, 2024

Since Linux 5.17, the kvm_xsave struct has a flexible array member (FAM) at the end, which can be retrieved using the KVM_GET_XSAVE2 ioctl [1]. What makes this FAM special is that the length is not stored in the header, but has to be retrieved via
KVM_CHECK_CAPABILITY(KVM_CAP_XSAVE2), which returns the total size of the kvm_xsave struct (e.g. the traditional 4096 byte header + the size of the FAM). Thus, to support KVM_GET_XSAVE2, we first need to check the capability on the VM fd, construct a FamStructWrapper of the correct size, and then call KVM_GET_XSAVE2.

I'm marking this as "RFC" because I'm not quite sure how to best deal the combination of "check capability on VM FD" and "do ioctl on vcpu FD". In this patch, I simply add a VmFd parameter to Vcpu::get_xsave2, but that's kinda ugly. Alternatively, we could store a reference to the VM file descriptor inside each Vcpu structure, but that'd be a larger change. What do people think about this?

(also, this PR needs rust-vmm/kvm-bindings#104, because Rust apparently only type checks type aliases at use-sites instead of definition-site, meaning I did not notice that I forgot to implement Default for kvm_xsave2 when working on the new kvm-bindings release. That's why the CI is failing here for now)

Requirements

Before submitting your PR, please make sure you addressed the following
requirements:

  • All commits in this PR have Signed-Off-By trailers (with
    git commit -s), and the commit message has max 60 characters for the
    summary and max 75 characters for each description line.
  • All added/changed functionality has a corresponding unit/integration
    test.
  • All added/changed public-facing functionality has entries in the "Upcoming
    Release" section of CHANGELOG.md (if no such section exists, please create one).
  • Any newly added unsafe code is properly documented.

Since Linux 5.17, the `kvm_xsave` struct has a flexible array member
(FAM) at the end, which can be retrieved using the `KVM_GET_XSAVE2`
ioctl [1]. What makes this FAM special is that the length is not stored
in the header, but has to be retrieved via
`KVM_CHECK_CAPABILITY(KVM_CAP_XSAVE2)`, which returns the total size of
the `kvm_xsave` struct (e.g. the traditional 4096 byte header + the size
of the FAM). Thus, to support `KVM_GET_XSAVE2`, we first need to check
the capability on the VM fd, construct a FamStructWrapper of the correct
size, and then call `KVM_GET_XSAVE2`.

[1]: https://www.kernel.org/doc/html/latest/virt/kvm/api.html#kvm-get-xsave2

Signed-off-by: Patrick Roy <[email protected]>
@roypat
Copy link
Collaborator Author

roypat commented Apr 24, 2024

cc @likebreath since iirc you were looking into a post-5.17 solution for kvm_xsave, too

@zulinx86
Copy link
Contributor

zulinx86 commented Mar 3, 2025

FWIW

the traditional 4096 byte header + the size of the FAM

to be precise, the traditional 4096 bytes is not just a header, rather was large enough to save all the states of XSAVE feature components.

Intel AMX (Advanced Matrix Extensions) was introduced in Intel Sapphire Rapids. Since the feature is for AI / deep learning workloads. Specifically, XTILEDATA, one of bits for Intel AMX, requires larger size to save (8KB as seen in the following log), so it's disabled by default. Instead, kernel allows userspace to enable it dynamically via arch_prctl().

[    0.395985] x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers'
[    0.395987] x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers'
[    0.395988] x86/fpu: Supporting XSAVE feature 0x004: 'AVX registers'
[    0.395989] x86/fpu: Supporting XSAVE feature 0x020: 'AVX-512 opmask'
[    0.395989] x86/fpu: Supporting XSAVE feature 0x040: 'AVX-512 Hi256'
[    0.395990] x86/fpu: Supporting XSAVE feature 0x080: 'AVX-512 ZMM_Hi256'
[    0.395991] x86/fpu: Supporting XSAVE feature 0x200: 'Protection Keys User registers'
[    0.395992] x86/fpu: Supporting XSAVE feature 0x400: 'PASID state'
[    0.395993] x86/fpu: Supporting XSAVE feature 0x20000: 'AMX Tile config'
[    0.395994] x86/fpu: Supporting XSAVE feature 0x40000: 'AMX Tile data'
[    0.395995] x86/fpu: xstate_offset[2]:  576, xstate_sizes[2]:  256
[    0.395996] x86/fpu: xstate_offset[5]:  832, xstate_sizes[5]:   64
[    0.395997] x86/fpu: xstate_offset[6]:  896, xstate_sizes[6]:  512
[    0.395998] x86/fpu: xstate_offset[7]: 1408, xstate_sizes[7]: 1024
[    0.395999] x86/fpu: xstate_offset[9]: 2432, xstate_sizes[9]:    8
[    0.396000] x86/fpu: xstate_offset[10]: 2440, xstate_sizes[10]:    8
[    0.396001] x86/fpu: xstate_offset[17]: 2496, xstate_sizes[17]:   64
[    0.396002] x86/fpu: xstate_offset[18]: 2560, xstate_sizes[18]: 8192
[    0.396003] x86/fpu: Enabled xstate features 0x606e7, context size is 10752 bytes, using 'compacted' format.

See also https://elixir.bootlin.com/linux/v6.13.5/source/arch/x86/include/uapi/asm/kvm.h#L381.

More specifically, kvm_vcpu_ioctl_x86_get_xsave2() calls fpu_copy_guest_fpstate_to_uabi() that actually copies XSTATE to user buffer. As seen in the code, the type of the user buffer is union fpregs_state and, if XSAVE is supported, struct xregs_state is used. struct xregs_state starts with legacy FPU state and then XSTATE header, followed by the extended state area. Previously, all the entire data (not only the XSTATE header but also the first legacy FPU state as well as the extended area) fit in 4KB, as can be seen in the above kernel log, but now it can be larger than 4KB.

A minor comment: The as_slice() and as_mut_slice() of struct kvm_xsave2 starting from self.xsave.extra could look a bit weird maybe (from user's perspective). But it looks natural and easy-to-implement to people who understand the internal implementation. The structure is already released in 0.8.1 and such a change is breaking, so I don't think it's worth doing it. Feel free to ignore this minor comment.

/// ```
#[cfg(any(target_arch = "x86", target_arch = "x86_64"))]
pub fn get_xsave2(&self, vm_fd: &crate::VmFd) -> Result<Xsave> {
let xsave_size = vm_fd.xsave_size();
Copy link
Contributor

@zulinx86 zulinx86 Mar 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if we really should get the required size internally.

VMM may check KVM caps beforehand (to fail faster). In that case, VMM can cache the XSAVE size somewhere and reuse it without calling KVM_CHECK_EXTENSIONS. For more flexibility of the VMM implementation, it could be better to keep xsave_size() out of get_xsave2().

Rather, not having xsave_size() in the first place might be better. In the first look, handling both cases (KVM_GET_XSAVE2 supported and not supported) looks nice. But a VMM supporting multiple kernel versions supporting KVM_GET_XSAVE and KVM_GET_XSAVE2 (like firecracker) should check the availability of the new API first any way, and then, if not available, it should fall back to KVM_GET_XSAVE. If we really want to absorb the differences of these two APIs, we should implement the fallback part internally.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mhh, yeah, I think I see what you mean. In the context of get_xsave2, having the fallback of xsave_size() to be 4096 when CAP_XSAVE2 isnt supported it fairly useless, because the KVM_GET_XSAVE2 call is gonna fail anyway. So it's kinda half-baked.

The problem API wise is that we cannot let the user allocate the Xsave struct, because if they allocate a too-small one, then get_xsave2 will result in undefined behavior potentially (e.g. the kernel writing past the allocated buffer). We could mark it unsafe, but that's also not very nice imo.

And yeah, I guess if all Xsave structs are created from within kvm-ioctls, then its not even useful to expose xsave_size() as an API altogether.

Copy link
Contributor

@zulinx86 zulinx86 Mar 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least for me, marking it unsafe isn't so bad. unsafe marked functions always make me (hopefully all Rustaceans) pay my highest attention to it. Of course, safe wrapper is the best, but as being discussed in another thread, it may be impossible to wrap it safely due to the possible race condition.

So my current humble suggestion is:

  • Mark get_xsave2() unsafe
  • Make get_xsave2() fail early if KVM_CAP_XSAVE2 isn't supported
  • Add lower wrappers like we do for check_extension()
    • Add get_xsave2_xsave() which allows users to pass &mut Xsave.
    • Add get_xsave2_raw() which allows users to pass a pointer to struct kvm_xsave.

@roypat
Copy link
Collaborator Author

roypat commented Mar 4, 2025

...

More specifically, kvm_vcpu_ioctl_x86_get_xsave2() calls fpu_copy_guest_fpstate_to_uabi() that actually copies XSTATE to user buffer. As seen in the code, the type of the user buffer is union fpregs_state and, if XSAVE is supported, struct xregs_state is used. struct xregs_state starts with legacy FPU state and then XSTATE header, followed by the extended state area. Previously, all the entire data (not only the XSTATE header but also the first legacy FPU state as well as the extended area) fit in 4KB, as can be seen in the above kernel log, but now it can be larger than 4KB.

...

Mhhh, so are you saying that the size as returned by checking KVM_CAP_XSAVE2 can change depending on whether userspace does specific prctls? That's... worrying, because it opens us up to race conditions and essentially makes creating a safe wrapper over kvm_xsave2 pretty much impossible :/

@zulinx86
Copy link
Contributor

zulinx86 commented Mar 4, 2025

Mhhh, so are you saying that the size as returned by checking KVM_CAP_XSAVE2 can change depending on whether userspace does specific prctls? That's... worrying, because it opens us up to race conditions and essentially makes creating a safe wrapper over kvm_xsave2 pretty much impossible :/

yes, the required size depends on features enabled via arch_prctl().

KVM_CAP_XSAVE2 calls xstate_required_size() with kvm_get_filtered_xcr0() as the first arg. kvm_get_filtered_xcr0() returns supported XCR0 (XSAVE feature bits) including dynamically enabled ones. and then, xstate_required_size() calculates required size for a given supported XCR0.

The race condition that you are worried is like another thread may call arch_prctl() between KVM_CHECK_EXTENSIONS(KVM_CAP_XSAVE2) and KVM_GET_XSAVE2()? Yeah, such a situation can happen... It's fine to be marked unsafe as long as why it's unsafe explained in doc, IMHO. wdyt?

@roypat
Copy link
Collaborator Author

roypat commented Mar 4, 2025

The race condition that you are worried is like another thread may call arch_prctl() between KVM_CHECK_EXTENSIONS(KVM_CAP_XSAVE2) and KVM_GET_XSAVE2()? Yeah, such a situation can happen... It's fine to be marked unsafe as long as why it's unsafe explained in doc, IMHO. wdyt?

Yeah, it's a similar situation to the rust standard library functions for accessing environment variables, which were made unsafe in edition 2024 because of a similar scenario. I think it actually simplifies the entire scenario tbh, because if the function is unsafe anyway, we might as well have the caller allocate the struct Xsave2 - and then a whole lot of the problems we talked about disappear :)

@zulinx86
Copy link
Contributor

zulinx86 commented Mar 6, 2025

@roypat Could you close this in favor of #310.

@roypat
Copy link
Collaborator Author

roypat commented Mar 6, 2025

@roypat Could you close this in favor of #310.

Yes, thank you so much for taking this over!

@roypat roypat closed this Mar 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants