[RFC] Add support for `KVM_GET_XSAVE2` ioctls #261

roypat · 2024-04-24T12:17:08Z

Since Linux 5.17, the kvm_xsave struct has a flexible array member (FAM) at the end, which can be retrieved using the KVM_GET_XSAVE2 ioctl [1]. What makes this FAM special is that the length is not stored in the header, but has to be retrieved via
KVM_CHECK_CAPABILITY(KVM_CAP_XSAVE2), which returns the total size of the kvm_xsave struct (e.g. the traditional 4096 byte header + the size of the FAM). Thus, to support KVM_GET_XSAVE2, we first need to check the capability on the VM fd, construct a FamStructWrapper of the correct size, and then call KVM_GET_XSAVE2.

I'm marking this as "RFC" because I'm not quite sure how to best deal the combination of "check capability on VM FD" and "do ioctl on vcpu FD". In this patch, I simply add a VmFd parameter to Vcpu::get_xsave2, but that's kinda ugly. Alternatively, we could store a reference to the VM file descriptor inside each Vcpu structure, but that'd be a larger change. What do people think about this?

(also, this PR needs rust-vmm/kvm-bindings#104, because Rust apparently only type checks type aliases at use-sites instead of definition-site, meaning I did not notice that I forgot to implement Default for kvm_xsave2 when working on the new kvm-bindings release. That's why the CI is failing here for now)

Requirements

Before submitting your PR, please make sure you addressed the following
requirements:

All commits in this PR have Signed-Off-By trailers (with
git commit -s), and the commit message has max 60 characters for the
summary and max 75 characters for each description line.
All added/changed functionality has a corresponding unit/integration
test.
All added/changed public-facing functionality has entries in the "Upcoming
Release" section of CHANGELOG.md (if no such section exists, please create one).
Any newly added unsafe code is properly documented.

Since Linux 5.17, the `kvm_xsave` struct has a flexible array member (FAM) at the end, which can be retrieved using the `KVM_GET_XSAVE2` ioctl [1]. What makes this FAM special is that the length is not stored in the header, but has to be retrieved via `KVM_CHECK_CAPABILITY(KVM_CAP_XSAVE2)`, which returns the total size of the `kvm_xsave` struct (e.g. the traditional 4096 byte header + the size of the FAM). Thus, to support `KVM_GET_XSAVE2`, we first need to check the capability on the VM fd, construct a FamStructWrapper of the correct size, and then call `KVM_GET_XSAVE2`. [1]: https://www.kernel.org/doc/html/latest/virt/kvm/api.html#kvm-get-xsave2 Signed-off-by: Patrick Roy <[email protected]>

roypat · 2024-04-24T14:57:25Z

cc @likebreath since iirc you were looking into a post-5.17 solution for kvm_xsave, too

zulinx86 · 2025-03-03T17:45:29Z

FWIW

the traditional 4096 byte header + the size of the FAM

to be precise, the traditional 4096 bytes is not just a header, rather was large enough to save all the states of XSAVE feature components.

Intel AMX (Advanced Matrix Extensions) was introduced in Intel Sapphire Rapids. Since the feature is for AI / deep learning workloads. Specifically, XTILEDATA, one of bits for Intel AMX, requires larger size to save (8KB as seen in the following log), so it's disabled by default. Instead, kernel allows userspace to enable it dynamically via arch_prctl().

[    0.395985] x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers'
[    0.395987] x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers'
[    0.395988] x86/fpu: Supporting XSAVE feature 0x004: 'AVX registers'
[    0.395989] x86/fpu: Supporting XSAVE feature 0x020: 'AVX-512 opmask'
[    0.395989] x86/fpu: Supporting XSAVE feature 0x040: 'AVX-512 Hi256'
[    0.395990] x86/fpu: Supporting XSAVE feature 0x080: 'AVX-512 ZMM_Hi256'
[    0.395991] x86/fpu: Supporting XSAVE feature 0x200: 'Protection Keys User registers'
[    0.395992] x86/fpu: Supporting XSAVE feature 0x400: 'PASID state'
[    0.395993] x86/fpu: Supporting XSAVE feature 0x20000: 'AMX Tile config'
[    0.395994] x86/fpu: Supporting XSAVE feature 0x40000: 'AMX Tile data'
[    0.395995] x86/fpu: xstate_offset[2]:  576, xstate_sizes[2]:  256
[    0.395996] x86/fpu: xstate_offset[5]:  832, xstate_sizes[5]:   64
[    0.395997] x86/fpu: xstate_offset[6]:  896, xstate_sizes[6]:  512
[    0.395998] x86/fpu: xstate_offset[7]: 1408, xstate_sizes[7]: 1024
[    0.395999] x86/fpu: xstate_offset[9]: 2432, xstate_sizes[9]:    8
[    0.396000] x86/fpu: xstate_offset[10]: 2440, xstate_sizes[10]:    8
[    0.396001] x86/fpu: xstate_offset[17]: 2496, xstate_sizes[17]:   64
[    0.396002] x86/fpu: xstate_offset[18]: 2560, xstate_sizes[18]: 8192
[    0.396003] x86/fpu: Enabled xstate features 0x606e7, context size is 10752 bytes, using 'compacted' format.

See also https://elixir.bootlin.com/linux/v6.13.5/source/arch/x86/include/uapi/asm/kvm.h#L381.

More specifically, kvm_vcpu_ioctl_x86_get_xsave2() calls fpu_copy_guest_fpstate_to_uabi() that actually copies XSTATE to user buffer. As seen in the code, the type of the user buffer is union fpregs_state and, if XSAVE is supported, struct xregs_state is used. struct xregs_state starts with legacy FPU state and then XSTATE header, followed by the extended state area. Previously, all the entire data (not only the XSTATE header but also the first legacy FPU state as well as the extended area) fit in 4KB, as can be seen in the above kernel log, but now it can be larger than 4KB.

A minor comment: The as_slice() and as_mut_slice() of struct kvm_xsave2 starting from self.xsave.extra could look a bit weird maybe (from user's perspective). But it looks natural and easy-to-implement to people who understand the internal implementation. The structure is already released in 0.8.1 and such a change is breaking, so I don't think it's worth doing it. Feel free to ignore this minor comment.

zulinx86 · 2025-03-03T18:16:22Z

src/ioctls/vcpu.rs

+    /// ```
+    #[cfg(any(target_arch = "x86", target_arch = "x86_64"))]
+    pub fn get_xsave2(&self, vm_fd: &crate::VmFd) -> Result<Xsave> {
+        let xsave_size = vm_fd.xsave_size();


I'm not sure if we really should get the required size internally.

VMM may check KVM caps beforehand (to fail faster). In that case, VMM can cache the XSAVE size somewhere and reuse it without calling KVM_CHECK_EXTENSIONS. For more flexibility of the VMM implementation, it could be better to keep xsave_size() out of get_xsave2().

Rather, not having xsave_size() in the first place might be better. In the first look, handling both cases (KVM_GET_XSAVE2 supported and not supported) looks nice. But a VMM supporting multiple kernel versions supporting KVM_GET_XSAVE and KVM_GET_XSAVE2 (like firecracker) should check the availability of the new API first any way, and then, if not available, it should fall back to KVM_GET_XSAVE. If we really want to absorb the differences of these two APIs, we should implement the fallback part internally.

Mhh, yeah, I think I see what you mean. In the context of get_xsave2, having the fallback of xsave_size() to be 4096 when CAP_XSAVE2 isnt supported it fairly useless, because the KVM_GET_XSAVE2 call is gonna fail anyway. So it's kinda half-baked.

The problem API wise is that we cannot let the user allocate the Xsave struct, because if they allocate a too-small one, then get_xsave2 will result in undefined behavior potentially (e.g. the kernel writing past the allocated buffer). We could mark it unsafe, but that's also not very nice imo.

And yeah, I guess if all Xsave structs are created from within kvm-ioctls, then its not even useful to expose xsave_size() as an API altogether.

At least for me, marking it unsafe isn't so bad. unsafe marked functions always make me (hopefully all Rustaceans) pay my highest attention to it. Of course, safe wrapper is the best, but as being discussed in another thread, it may be impossible to wrap it safely due to the possible race condition.

So my current humble suggestion is:

Mark get_xsave2() unsafe

Make get_xsave2() fail early if KVM_CAP_XSAVE2 isn't supported

Add lower wrappers like we do for check_extension()

Add get_xsave2_xsave() which allows users to pass &mut Xsave.

Add get_xsave2_raw() which allows users to pass a pointer to struct kvm_xsave.

roypat · 2025-03-04T07:53:46Z

...

More specifically, kvm_vcpu_ioctl_x86_get_xsave2() calls fpu_copy_guest_fpstate_to_uabi() that actually copies XSTATE to user buffer. As seen in the code, the type of the user buffer is union fpregs_state and, if XSAVE is supported, struct xregs_state is used. struct xregs_state starts with legacy FPU state and then XSTATE header, followed by the extended state area. Previously, all the entire data (not only the XSTATE header but also the first legacy FPU state as well as the extended area) fit in 4KB, as can be seen in the above kernel log, but now it can be larger than 4KB.

...

Mhhh, so are you saying that the size as returned by checking KVM_CAP_XSAVE2 can change depending on whether userspace does specific prctls? That's... worrying, because it opens us up to race conditions and essentially makes creating a safe wrapper over kvm_xsave2 pretty much impossible :/

zulinx86 · 2025-03-04T08:29:21Z

Mhhh, so are you saying that the size as returned by checking KVM_CAP_XSAVE2 can change depending on whether userspace does specific prctls? That's... worrying, because it opens us up to race conditions and essentially makes creating a safe wrapper over kvm_xsave2 pretty much impossible :/

yes, the required size depends on features enabled via arch_prctl().

KVM_CAP_XSAVE2 calls xstate_required_size() with kvm_get_filtered_xcr0() as the first arg. kvm_get_filtered_xcr0() returns supported XCR0 (XSAVE feature bits) including dynamically enabled ones. and then, xstate_required_size() calculates required size for a given supported XCR0.

The race condition that you are worried is like another thread may call arch_prctl() between KVM_CHECK_EXTENSIONS(KVM_CAP_XSAVE2) and KVM_GET_XSAVE2()? Yeah, such a situation can happen... It's fine to be marked unsafe as long as why it's unsafe explained in doc, IMHO. wdyt?

roypat · 2025-03-04T13:04:28Z

The race condition that you are worried is like another thread may call arch_prctl() between KVM_CHECK_EXTENSIONS(KVM_CAP_XSAVE2) and KVM_GET_XSAVE2()? Yeah, such a situation can happen... It's fine to be marked unsafe as long as why it's unsafe explained in doc, IMHO. wdyt?

Yeah, it's a similar situation to the rust standard library functions for accessing environment variables, which were made unsafe in edition 2024 because of a similar scenario. I think it actually simplifies the entire scenario tbh, because if the function is unsafe anyway, we might as well have the caller allocate the struct Xsave2 - and then a whole lot of the problems we talked about disappear :)

zulinx86 · 2025-03-06T17:12:16Z

@roypat Could you close this in favor of #310.

roypat · 2025-03-06T17:25:40Z

@roypat Could you close this in favor of #310.

Yes, thank you so much for taking this over!

roypat requested review from acatangiu, alxiord, andreeaflorescu, lauralt, sameo and JonathanWoollett-Light as code owners April 24, 2024 12:17

roypat requested a review from rbradford April 24, 2024 12:17

roypat force-pushed the kvm_get_xsave2 branch 5 times, most recently from 1422be7 to d9a5bf6 Compare April 24, 2024 14:51

roypat force-pushed the kvm_get_xsave2 branch from d9a5bf6 to 28b7953 Compare April 24, 2024 14:53

zulinx86 reviewed Mar 3, 2025

View reviewed changes

zulinx86 mentioned this pull request Mar 5, 2025

feat: Intel AMX support firecracker-microvm/firecracker#5065

Merged

5 tasks

roypat closed this Mar 6, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[RFC] Add support for `KVM_GET_XSAVE2` ioctls #261

[RFC] Add support for `KVM_GET_XSAVE2` ioctls #261

Uh oh!

roypat commented Apr 24, 2024 •

edited

Loading

Uh oh!

roypat commented Apr 24, 2024

Uh oh!

zulinx86 commented Mar 3, 2025 •

edited

Loading

Uh oh!

zulinx86 Mar 3, 2025 •

edited

Loading

Uh oh!

roypat Mar 4, 2025

Uh oh!

zulinx86 Mar 4, 2025 •

edited

Loading

Uh oh!

roypat commented Mar 4, 2025

Uh oh!

zulinx86 commented Mar 4, 2025

Uh oh!

roypat commented Mar 4, 2025

Uh oh!

zulinx86 commented Mar 6, 2025 •

edited

Loading

Uh oh!

roypat commented Mar 6, 2025

Uh oh!

Uh oh!

[RFC] Add support for KVM_GET_XSAVE2 ioctls #261

[RFC] Add support for KVM_GET_XSAVE2 ioctls #261

Uh oh!

Conversation

roypat commented Apr 24, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Requirements

Uh oh!

roypat commented Apr 24, 2024

Uh oh!

zulinx86 commented Mar 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zulinx86 Mar 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

roypat Mar 4, 2025

Choose a reason for hiding this comment

Uh oh!

zulinx86 Mar 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

roypat commented Mar 4, 2025

Uh oh!

zulinx86 commented Mar 4, 2025

Uh oh!

roypat commented Mar 4, 2025

Uh oh!

zulinx86 commented Mar 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

roypat commented Mar 6, 2025

Uh oh!

Uh oh!

[RFC] Add support for `KVM_GET_XSAVE2` ioctls #261

[RFC] Add support for `KVM_GET_XSAVE2` ioctls #261

roypat commented Apr 24, 2024 •

edited

Loading

zulinx86 commented Mar 3, 2025 •

edited

Loading

zulinx86 Mar 3, 2025 •

edited

Loading

zulinx86 Mar 4, 2025 •

edited

Loading

zulinx86 commented Mar 6, 2025 •

edited

Loading