-
Notifications
You must be signed in to change notification settings - Fork 117
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] Add support for KVM_GET_XSAVE2
ioctls
#261
Conversation
1422be7
to
d9a5bf6
Compare
Since Linux 5.17, the `kvm_xsave` struct has a flexible array member (FAM) at the end, which can be retrieved using the `KVM_GET_XSAVE2` ioctl [1]. What makes this FAM special is that the length is not stored in the header, but has to be retrieved via `KVM_CHECK_CAPABILITY(KVM_CAP_XSAVE2)`, which returns the total size of the `kvm_xsave` struct (e.g. the traditional 4096 byte header + the size of the FAM). Thus, to support `KVM_GET_XSAVE2`, we first need to check the capability on the VM fd, construct a FamStructWrapper of the correct size, and then call `KVM_GET_XSAVE2`. [1]: https://www.kernel.org/doc/html/latest/virt/kvm/api.html#kvm-get-xsave2 Signed-off-by: Patrick Roy <[email protected]>
cc @likebreath since iirc you were looking into a post-5.17 solution for kvm_xsave, too |
FWIW
to be precise, the traditional 4096 bytes is not just a header, rather was large enough to save all the states of XSAVE feature components. Intel AMX (Advanced Matrix Extensions) was introduced in Intel Sapphire Rapids. Since the feature is for AI / deep learning workloads. Specifically, XTILEDATA, one of bits for Intel AMX, requires larger size to save (8KB as seen in the following log), so it's disabled by default. Instead, kernel allows userspace to enable it dynamically via arch_prctl().
See also https://elixir.bootlin.com/linux/v6.13.5/source/arch/x86/include/uapi/asm/kvm.h#L381. More specifically, kvm_vcpu_ioctl_x86_get_xsave2() calls
|
/// ``` | ||
#[cfg(any(target_arch = "x86", target_arch = "x86_64"))] | ||
pub fn get_xsave2(&self, vm_fd: &crate::VmFd) -> Result<Xsave> { | ||
let xsave_size = vm_fd.xsave_size(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if we really should get the required size internally.
VMM may check KVM caps beforehand (to fail faster). In that case, VMM can cache the XSAVE size somewhere and reuse it without calling KVM_CHECK_EXTENSIONS. For more flexibility of the VMM implementation, it could be better to keep xsave_size()
out of get_xsave2()
.
Rather, not having xsave_size()
in the first place might be better. In the first look, handling both cases (KVM_GET_XSAVE2 supported and not supported) looks nice. But a VMM supporting multiple kernel versions supporting KVM_GET_XSAVE and KVM_GET_XSAVE2 (like firecracker) should check the availability of the new API first any way, and then, if not available, it should fall back to KVM_GET_XSAVE. If we really want to absorb the differences of these two APIs, we should implement the fallback part internally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mhh, yeah, I think I see what you mean. In the context of get_xsave2
, having the fallback of xsave_size()
to be 4096 when CAP_XSAVE2 isnt supported it fairly useless, because the KVM_GET_XSAVE2 call is gonna fail anyway. So it's kinda half-baked.
The problem API wise is that we cannot let the user allocate the Xsave struct, because if they allocate a too-small one, then get_xsave2 will result in undefined behavior potentially (e.g. the kernel writing past the allocated buffer). We could mark it unsafe, but that's also not very nice imo.
And yeah, I guess if all Xsave structs are created from within kvm-ioctls, then its not even useful to expose xsave_size() as an API altogether.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At least for me, marking it unsafe
isn't so bad. unsafe
marked functions always make me (hopefully all Rustaceans) pay my highest attention to it. Of course, safe wrapper is the best, but as being discussed in another thread, it may be impossible to wrap it safely due to the possible race condition.
So my current humble suggestion is:
- Mark
get_xsave2()
unsafe - Make
get_xsave2()
fail early if KVM_CAP_XSAVE2 isn't supported - Add lower wrappers like we do for
check_extension()
- Add
get_xsave2_xsave()
which allows users to pass&mut Xsave
. - Add
get_xsave2_raw()
which allows users to pass a pointer tostruct kvm_xsave
.
- Add
Mhhh, so are you saying that the size as returned by checking KVM_CAP_XSAVE2 can change depending on whether userspace does specific prctls? That's... worrying, because it opens us up to race conditions and essentially makes creating a safe wrapper over |
yes, the required size depends on features enabled via arch_prctl(). KVM_CAP_XSAVE2 calls xstate_required_size() with kvm_get_filtered_xcr0() as the first arg. kvm_get_filtered_xcr0() returns supported XCR0 (XSAVE feature bits) including dynamically enabled ones. and then, xstate_required_size() calculates required size for a given supported XCR0. The race condition that you are worried is like another thread may call arch_prctl() between KVM_CHECK_EXTENSIONS(KVM_CAP_XSAVE2) and KVM_GET_XSAVE2()? Yeah, such a situation can happen... It's fine to be marked |
Yeah, it's a similar situation to the rust standard library functions for accessing environment variables, which were made unsafe in edition 2024 because of a similar scenario. I think it actually simplifies the entire scenario tbh, because if the function is unsafe anyway, we might as well have the caller allocate the |
Since Linux 5.17, the
kvm_xsave
struct has a flexible array member (FAM) at the end, which can be retrieved using theKVM_GET_XSAVE2
ioctl [1]. What makes this FAM special is that the length is not stored in the header, but has to be retrieved viaKVM_CHECK_CAPABILITY(KVM_CAP_XSAVE2)
, which returns the total size of thekvm_xsave
struct (e.g. the traditional 4096 byte header + the size of the FAM). Thus, to supportKVM_GET_XSAVE2
, we first need to check the capability on the VM fd, construct a FamStructWrapper of the correct size, and then callKVM_GET_XSAVE2
.I'm marking this as "RFC" because I'm not quite sure how to best deal the combination of "check capability on VM FD" and "do ioctl on vcpu FD". In this patch, I simply add a
VmFd
parameter toVcpu::get_xsave2
, but that's kinda ugly. Alternatively, we could store a reference to the VM file descriptor inside eachVcpu
structure, but that'd be a larger change. What do people think about this?(also, this PR needs rust-vmm/kvm-bindings#104, because Rust apparently only type checks type aliases at use-sites instead of definition-site, meaning I did not notice that I forgot to implementDefault
forkvm_xsave2
when working on the new kvm-bindings release. That's why the CI is failing here for now)Requirements
Before submitting your PR, please make sure you addressed the following
requirements:
git commit -s
), and the commit message has max 60 characters for thesummary and max 75 characters for each description line.
test.
Release" section of CHANGELOG.md (if no such section exists, please create one).
unsafe
code is properly documented.