diff --git a/doc/content/lib/_index.md b/doc/content/lib/_index.md index a0592427b0b..cc9f7eef275 100644 --- a/doc/content/lib/_index.md +++ b/doc/content/lib/_index.md @@ -1,5 +1,4 @@ --- title: Libraries -hidden: true --- {{% children description=true %}} \ No newline at end of file diff --git a/doc/content/lib/xen/_index.md b/doc/content/lib/xen/_index.md new file mode 100644 index 00000000000..cb63f39892b --- /dev/null +++ b/doc/content/lib/xen/_index.md @@ -0,0 +1,5 @@ +--- +title: Xen +description: Insights into Xen hypercall functions exposed to the toolstack +--- +{{% children description=true %}} \ No newline at end of file diff --git a/doc/content/lib/xen/get_free_buddy-flowchart.md b/doc/content/lib/xen/get_free_buddy-flowchart.md new file mode 100644 index 00000000000..9e609e87bfe --- /dev/null +++ b/doc/content/lib/xen/get_free_buddy-flowchart.md @@ -0,0 +1,68 @@ +--- +title: Flowchart of get_free_buddy() of the Xen Buddy allocator +hidden: true +--- +```mermaid +flowchart TD + +alloc_round_robin + --No free memory on the host--> + Failure + +node_affinity_exact + --No free memory
on the Domain's + node_affinity nodes:
Abort exact allocation--> + Failure + +get_free_buddy["get_free_buddy()"] + -->MEMF_node{memflags
&
MEMF_node?} + --Yes--> + try_MEMF_node{Alloc + from + node} + --Success: page--> + Success + try_MEMF_node + --No free memory on the node--> + MEMF_exact{memflags + & + MEMF_exact?} + --"No"--> + node_affinity_set{NUMA affinity set?} + -- Domain->node_affinity + is not set: Fall back to + round-robin allocation + --> alloc_round_robin + + MEMF_exact + --Yes: + As there is not enough + free memory on the + exact NUMA node(s): + Abort exact allocation + -->Failure + + MEMF_node + --No NUMA node in memflags--> + node_affinity_set{domain->
node_affinity
set?} + --Set--> + node_affinity{Alloc from
node_affinity
nodes} + --No free memory on + the node_affinity nodes + Check if exact request--> + node_affinity_exact{memflags
&
MEMF_exact?} + --Not exact: Fall back to
round-robin allocation--> + alloc_round_robin + + node_affinity--Success: page-->Success + + alloc_round_robin{" Fall back to + round-robin + allocation"} + --Success: page--> + Success(Success: Return the page) + +click get_free_buddy +"https://github.com/xen-project/xen/blob/e16acd80/xen/common/page_alloc.c#L855-L1116 +" _blank +``` diff --git a/doc/content/lib/xen/get_free_buddy.md b/doc/content/lib/xen/get_free_buddy.md new file mode 100644 index 00000000000..49c57b98078 --- /dev/null +++ b/doc/content/lib/xen/get_free_buddy.md @@ -0,0 +1,85 @@ +--- +title: get_free_buddy() +description: Find free memory based on the given flags and optionally, a domain +mermaid: + force: true +--- +## Overview + +[get_free_buddy()](https://github.com/xen-project/xen/blob/e16acd80/xen/common/page_alloc.c#L855-L1116) is +[called](https://github.com/xen-project/xen/blob/e16acd80/xen/common/page_alloc.c#L1005) +from [alloc_heap_pages()](xc_domain_populate_physmap#alloc_heap_pages) +to find a page at the most suitable place for a memory allocation. + +It finds memory depending on the given flags and domain: + +- Optionally allocate prefer to allocate from a passed NUMA node +- Optionally allocate from the domain's next affine NUMA node (round-robin) +- Optionally return if the preferred NUMA allocation did not succeed +- Optionally allocate from not-yet scrubbed memory +- Optionally allocate from the given range of memory zones +- Fall back allocate from the next NUMA node on the system (round-robin) + +## Input parameters + +- `struct domain` +- Zones to allocate from (`zone_hi` until `zone_lo`) +- Page order (size of the page) + - populate_physmap() starts with 1GB pages and falls back to 2MB and 4k pages. + +## Allocation strategy + +Its first attempt is to find a page of matching page order +on the requested NUMA node(s). + +If this is not successful, it looks to breaking higher page orders, +and if that fails too, it lowers the zone until `zone_lo`. + +It does not attempt to use not scrubbed pages, but when `memflags` +tell it `MEMF_no_scrub`, it uses `check_and_stop_scrub(pg)` on 4k +pages to prevent breaking higher order pages instead. + +If this fails, it checks if other NUMA nodes shall be tried. + +### Exact NUMA allocation (on request, e.g. for vNUMA) + +For example for vNUMA domains, the calling functions pass one specific +NUMA node, and they would also set `MEMF_exact_node` to make sure that +memory is specifically only allocated from this NUMA node. + +If no NUMA node was passed or the allocation from it failed, and +`MEMF_exact_node` was not set in `memflags`, the function falls +back to the first fallback, NUMA-affine allocation. + +### NUMA-affine allocation + +For local NUMA memory allocation, the domain should have one or more NUMA nodes +in its `struct domain->node_affinity` field when this function is called. + +This happens as part of +[NUMA placement](../../../xenopsd/walkthroughs/VM.build/Domain.build/#numa-placement) +which writes the planned vCPU affinity of the domain's vCPUs to the XenStore +which [xenguest](../../../xenopsd/walkthroughs/VM.build/xenguest) reads to +update the vCPU affinities of the domain's vCPUs in Xen, which in turn, by +default (when to domain->auto_node_affinity is active) also updates the +`struct domain->node_affinity` field. + +Note: In case it contains multiple +NUMA nodes, this step allocates from the next NUMA node after the previous +NUMA node the domain allocated from in a round-robin way. + +Otherwise, the function falls back to host-wide round-robin allocation. + +### Host-wide round-robin allocation + +When the domain's `node_affinity` is not defined or did not succeed +and `MEMF_exact_node` was not passed in `memflags`, all remaining +NUMA nodes are attempted in a round-robin way: Each subsequent call +uses the next NUMA node after the previous node that the domain +allocated memory from. + +## Flowchart + +This flowchart shows an overview of the decision chain of `get_free_buddy()` + +{{% include "get_free_buddy-flowchart.md" %}} diff --git a/doc/content/lib/xen/populate_physmap-chart.md b/doc/content/lib/xen/populate_physmap-chart.md new file mode 100644 index 00000000000..41cea76c08c --- /dev/null +++ b/doc/content/lib/xen/populate_physmap-chart.md @@ -0,0 +1,53 @@ +--- +title: Simplified flowchart of populate_physmap() +hidden: true +--- + +```mermaid +flowchart LR + +subgraph hypercall handlers + populate_physmap("populate_physmap() + One call for each memory + range (extent)") +end + + +subgraph "Xen buddy allocator:" + + populate_physmap + --> alloc_domheap_pages("alloc_domheap_pages() + Assign allocated pages to + the domain") + + alloc_domheap_pages + --> alloc_heap_pages("alloc_heap_pages() + If needed: split high-order + pages into smaller buddies, + and scrub dirty pages") + --> get_free_buddy("get_free_buddy() + If reqested: Allocate from a + preferred/exact NUMA node + and/or from + unscrubbed memory + ") + +end + +click populate_physmap +"https://github.com/xen-project/xen/blob/e16acd80/xen/common/memory.c#L159-L314 +" _blank + +click alloc_domheap_pages +"https://github.com/xen-project/xen/blob/e16acd80/xen/common/page_alloc.c#L2641-L2697 +" _blank + +click get_free_buddy +"https://github.com/xen-project/xen/blob/e16acd80/xen/common/page_alloc.c#L855-L958 +" _blank + +click alloc_heap_pages +"https://github.com/xen-project/xen/blob/e16acd80/xen/common/page_alloc.c#L967-L1116 +" _blank + +``` diff --git a/doc/content/lib/xen/populate_physmap-dataflow.md b/doc/content/lib/xen/populate_physmap-dataflow.md new file mode 100644 index 00000000000..2bff9071cf0 --- /dev/null +++ b/doc/content/lib/xen/populate_physmap-dataflow.md @@ -0,0 +1,124 @@ +--- +title: Flowchart for the populate_physmap hypercall +hidden: true +--- +```mermaid +flowchart TD + +subgraph XenCtrl +xc_domain_populate_physmap["xc_domain_populate_physmap()"] +xc_domain_populate_physmap_exact["xc_domain_populate_physmap_exact()"] +end + +subgraph Xen + +%% sub-subgraph from memory_op() to populate_node() and back + +xc_domain_populate_physmap & xc_domain_populate_physmap_exact +<--reservation,
and for preempt:
nr_start/nr_done--> +memory_op("memory_op(XENMEM_populate_physmap)") + +memory_op + --struct xen_memory_reservation--> + construct_memop_from_reservation("construct_memop_from_reservation()") + --struct
xen_memory_reservation->mem_flags--> + propagate_node("propagate_node()") + --_struct
memop_args->memflags_--> + construct_memop_from_reservation + --_struct memop_args_--> +memory_op<--struct memop_args *: + struct domain *, + List of extent base addrs, + Number of extents, + Size of each extent (extent_order), + Allocation flags(memflags)--> + populate_physmap[["populate_physmap()"]] + <-.domain, extent base addrs, extent size, memflags, nr_start and nr_done.-> + populate_physmap_loop--if memflags & MEMF_populate_on_demand -->guest_physmap_mark_populate_on_demand(" + guest_physmap_mark_populate_on_demand()") + populate_physmap_loop@{ label: "While extents to populate, + and not asked to preempt, + for each extent left to do:", shape: notch-pent } + --domain, order, memflags--> + alloc_domheap_pages("alloc_domheap_pages()") + --zone_lo, zone_hi, order, memflags, domain--> + alloc_heap_pages + --zone_lo, zone_hi, order, memflags, domain--> + get_free_buddy("get_free_buddy()") + --_page_info_ + -->alloc_heap_pages + --if no page--> + no_scrub("get_free_buddy(MEMF_no_scrub) + (honored only when order==0)") + --_dirty 4k page_ + -->alloc_heap_pages + <--_dirty 4k page_--> + scrub_one_page("scrub_one_page()") + alloc_heap_pages("alloc_heap_pages() + (also splits higher-order pages + into smaller buddies if needed)") + --_page_info_ + -->alloc_domheap_pages + --page_info, order, domain, memflags-->assign_page("assign_page()") + assign_page + --page_info, nr_mfns, domain, memflags--> + assign_pages("assign_pages()") + --domain, nr_mfns--> + domain_adjust_tot_pages("domain_adjust_tot_pages()") + alloc_domheap_pages + --_page_info_--> + populate_physmap_loop + --page(gpfn, mfn, extent_order)--> + guest_physmap_add_page("guest_physmap_add_page()") + +populate_physmap--nr_done, preempted-->memory_op +end + +click memory_op +"https://github.com/xen-project/xen/blob/e16acd80/xen/common/memory.c#L1409-L1425 +" _blank + +click construct_memop_from_reservation +"https://github.com/xen-project/xen/blob/e16acd80/xen/common/memory.c#L1022-L1071 +" _blank + +click propagate_node +"https://github.com/xen-project/xen/blob/e16acd80/xen/common/memory.c#L524-L547 +" _blank + +click populate_physmap +"https://github.com/xen-project/xen/blob/e16acd80/xen/common/memory.c#L159-L314 +" _blank + +click populate_physmap_loop +"https://github.com/xen-project/xen/blob/e16acd80/xen/common/memory.c#L197-L304 +" _blank + +click guest_physmap_mark_populate_on_demand +"https://github.com/xen-project/xen/blob/e16acd80/xen/common/page_alloc.c#L210-220 +" _blank + +click guest_physmap_add_page +"https://github.com/xen-project/xen/blob/e16acd80/xen/common/page_alloc.c#L296 +" _blank + +click alloc_domheap_pages +"https://github.com/xen-project/xen/blob/e16acd80/xen/common/page_alloc.c#L2641-L2697 +" _blank + +click alloc_heap_pages +"https://github.com/xen-project/xen/blob/e16acd80/xen/common/page_alloc.c#L967-L1116 +" _blank + +click get_free_buddy +"https://github.com/xen-project/xen/blob/e16acd80/xen/common/page_alloc.c#L855-L958 +" _blank + +click assign_page +"https://github.com/xen-project/xen/blob/e16acd80/xen/common/page_alloc.c#L2540-L2633 +" _blank + +click assign_pages +"https://github.com/xen-project/xen/blob/e16acd80/xen/common/page_alloc.c#L2635-L2639 +" _blank +``` diff --git a/doc/content/lib/xenctrl/struct/xen_memory_reservation.md b/doc/content/lib/xenctrl/struct/xen_memory_reservation.md new file mode 100644 index 00000000000..b9772ed3520 --- /dev/null +++ b/doc/content/lib/xenctrl/struct/xen_memory_reservation.md @@ -0,0 +1,53 @@ +--- +title: xen_memory_reservation +description: xen_memory_reservation for memory-related hypercalls +hidden: true +--- +[struct xen_memory_reservation](https://github.com/xen-project/xen/blob/96970b46/xen/include/public/memory.h#L46-80) +is used by +[these XENMEM hypercall commands](https://github.com/xen-project/xen/blob/96970b46/xen/include/public/memory.h#L48-59): + +- `XENMEM_increase_reservation`: Returns the first MFN of the allocated extents +- `XENMEM_decrease_reservation`: To pass the first GPFN of extents to free +- [XENMEM_populate_physmap](../xc_domain_populate_physmap): + - In: To pass the first GPFN to populate with memory + - Out: Returns the first GMFN base of extents that were allocated + (NB. This command also updates the mach_to_phys translation table) +- `XENMEM_claim_pages`: Not used, must be passed as 0 + (This is explicitly checked: Otherwise, it returns `-EINVAL`) + +[struct xen_memory_reservation](https://github.com/xen-project/xen/blob/96970b46/xen/include/public/memory.h#L46-80) +is defined as: + +```c +struct xen_memory_reservation { + XEN_GUEST_HANDLE(xen_pfn_t) extent_start; /* PFN of the starting extent */ + xen_ulong_t nr_extents; /* number of extents of size extent_order */ + unsigned int extent_order; /* an order of 0 means: 4k pages, 1: 8k, etc. */ + unsigned int mem_flags; + domid_t domid; /* integer ID of the domain */ +}; +``` + +The `mem_flags` bit field is accessed using: + +```js +/* + * Maximum # bits addressable by the user of the allocated region (e.g., I/O + * devices often have a 32-bit limitation even in 64-bit systems). If zero + * then the user has no addressing restriction. This field is not used by + * XENMEM_decrease_reservation. + */ +#define XENMEMF_address_bits(x) (x) +#define XENMEMF_get_address_bits(x) ((x) & 0xffu) +/* NUMA node to allocate from. */ +#define XENMEMF_node(x) (((x) + 1) << 8) +#define XENMEMF_get_node(x) ((((x) >> 8) - 1) & 0xffu) +/* Flag to populate physmap with populate-on-demand entries */ +#define XENMEMF_populate_on_demand (1<<16) +/* Flag to request allocation only from the node specified */ +#define XENMEMF_exact_node_request (1<<17) +#define XENMEMF_exact_node(n) (XENMEMF_node(n) | XENMEMF_exact_node_request) +/* Flag to indicate the node specified is virtual node */ +#define XENMEMF_vnode (1<<18) +``` diff --git a/doc/content/lib/xenctrl/xc_domain_claim_pages.md b/doc/content/lib/xenctrl/xc_domain_claim_pages.md index 7f72f01342c..805d6c34248 100644 --- a/doc/content/lib/xenctrl/xc_domain_claim_pages.md +++ b/doc/content/lib/xenctrl/xc_domain_claim_pages.md @@ -13,11 +13,91 @@ The domain can still attempt to allocate beyond the claim, but those are not guaranteed to be successful and will fail if the domain's memory reaches it's `max_mem` value. -Each domain can only have one claim, and the domid is the key of the claim. -By killing the domain, the claim is also released. +The aim is to attempt to stake a claim for a domain on a quantity of pages +of system RAM, but _not_ assign specific page frames. +It performs only arithmetic so the hypercall is very fast and not be preempted. +Thereby, it sidesteps any time-of-check-time-of-use races for memory allocation. -Depending on the given size argument, the remaining stack of the domain -can be set initially, updated to the given amount, or reset to no claim (0). +## Usage notes + +`xc_domain_claim_pages()` returns 0 if the Xen page allocator has atomically +and successfully claimed the requested number of pages, else non-zero. + +> [!info] +> Errors returned by `xc_domain_claim_pages()` must be handled as they are a +normal result of the `xenopsd` thread-pool claiming and starting many VMs +in parallel during a boot storm scenario. + +> [!warning] +> This is especially important when staking claims on NUMA nodes using an updated +version of this function. In this case, the only options of the calling worker +thread would be to adapt to the NUMA boot storm: +Attempt to find a different NUMA node for claiming the memory and try again. + +## Design notes + +For reference, see the hypercall comment Xen hypervisor header: +[xen/include/public/memory.h](https://github.com/xen-project/xen/blob/e7e0d485/xen/include/public/memory.h#L552-L578) + +The design of the backing hypercall in Xen is as follows: + +- A domain can only have one claim. +- Subsequent calls to the backing hypercalls update claim. +- The domain ID is the key of the claim. +- By killing the domain, the claim is also released. +- When sufficient memory has been allocated to resolve the claim, + the claim silently expires. +- Depending on the given size argument, the remaining stack of the domain + can be set initially, updated to the given amount, or reset. +- Claiming zero pages effectively resets any outstanding claim and + is always successful. + +### Users + +To set up the boot memory of new domain, the [libxenguest](../xenguest) function +[xc_dom_boot_mem_init](../xenguest/xc_dom_boot_mem_init) relies on this design: + +The [memory setup](../../../xenopsd/walkthroughs/VM.build/xenguest/setup_mem) +of `xenguest` and by the `xl` CLI using `libxl` use it. +`libxl` actively uses it to stack an initial claim before allocating memory. + +> [!note] +> The functions +> [meminit_hvm()](https://github.com/xen-project/xen/blob/39c45c/tools/libs/guest/xg_dom_x86.c#L1348-L1648) +> and `meminit_pv` used by +> [xc_dom_boot_mem_init()](../xenguest/xc_dom_boot_mem_init) +> always destroy any remaining claim before they return. +> Thus, after [libxenguest](../xenguest) has completed allocating and populating +> the physical memory of the domain, no domain has a remaining claim! + +> [!warning] +> From this point onwards, no memory allocation is guaranteed! +> While swapping memory between domains can be expected to always succeed, +> Allocating a new page after freeing another can always fail unless a +> privileged domain stakes a claim for such allocations beforehand! + +### Implementation + +The Xen upstream memory claims code is implemented to work as follows: + +When allocating memory, while a domain has an outstanding claim, +Xen updates remaining claim accordingly. + +If a domain has a stake from claim, when memory is freed, +freed amount of memory increases the stake of memory. + +> [!note] Note: This detail may have to change for implementing NUMA claims +> Xen doesn't know if a page was allocated using a NUMA node claim. +> Therefore, it cannot know if it would be legal to increment a stake a NUMA +> node claim when freeing pages. + +> [!info] +> The smallest possible change to achieve a similar effect would be +> to add a field to the Xen hypervisor's domain struct. +> It would be used for remembering the last total NUMA node claim. +> With it, freeing memory from a NUMA node could attempt to increase +> the outstanding claim on the claimed NUMA node, but only given this +> amount is available and not claimed. ## Management of claims @@ -60,8 +140,7 @@ and is always successful. > If an allocation would cause the domain to exceed it's `max_mem` > value, it will always fail. - -## Implementation +## Hypercall API Function signature of the libXenCtrl function to call the Xen hypercall: @@ -80,76 +159,82 @@ struct xen_memory_reservation { }; ``` -### Concurrency - -Xen protects the consistency of the stake of the domain -using the domain's `page_alloc_lock` and the global `heap_lock` of Xen. -Thse spin-locks prevent any "time-of-check-time-of-use" races. -As the hypercall needs to take those spin-locks, it cannot be preempted. - -### Return value - -The call returns 0 if the hypercall successfully claimed the requested amount -of memory, else it returns non-zero. - ## Current users +It is used by [libxenguest](../xenguest), which is used at least by: +- [xenguest](../../../xenopsd/walkthroughs/VM.build/xenguest) +- `libxl`/the `xl` CLI. + ### libxl and the xl CLI -If the `struct xc_dom_image` passed by `libxl` to the -[libxenguest](https://github.com/xen-project/xen/tree/master/tools/libs/guest) -functions -[meminit_hvm()](https://github.com/xen-project/xen/blob/de0254b9/tools/libs/guest/xg_dom_x86.c#L1348-L1649) -and -[meminit_pv()](https://github.com/xen-project/xen/blob/de0254b9/tools/libs/guest/xg_dom_x86.c#L1183-L1333) -has it's `claim_enabled` field set, they, -before allocating the domain's system memory using the allocation function -[xc_populate_physmap()](https://github.com/xen-project/xen/blob/de0254b9/xen/common/memory.c#L159-L314) which calls the hypercall to allocate and populate -the domain's main system memory, will attempt to claim the to-be allocated -memory using a call to `xc_domain_claim_pages()`. -In case this fails, they do not attempt to continue and return the error code -of `xc_domain_claim_pages()`. - -Both functions also (unconditionally) reset the claim upon return. - -But, the `xl` CLI uses this functionality (unless disabled in `xl.conf`) -to make building the domains fail to prevent running out of memory inside -the `meminit_hvm` and `meminit_pv` calls. -Instead, they immediately return an error. +The `xl` cli uses claims actively. By default, it enables `libxl` to pass +the `struct xc_dom_image` with its `claim_enabled` field set. + +The functions dispatched by [xc_boot_mem_init()](../xenguest/xc_dom_boot_mem_init) +then attempt to claim the boot using `xc_domain_claim_pages()`. +They also (unconditionally) destroy any open claim upon return. This means that in case the claim fails, `xl` avoids: - The effort of allocating the memory, thereby not blocking it for other domains. - The effort of potentially needing to scrub the memory after the build failure. -### xenguest +## Updates: Improved NUMA memory allocation + +### Enablement of a NUMA node claim instead of a host-wide claim -While [xenguest](../../../xenopsd/walkthroughs/VM.build/xenguest) calls the -[libxenguest](https://github.com/xen-project/xen/tree/master/tools/libs/guest) -functions -[meminit_hvm()](https://github.com/xen-project/xen/blob/de0254b9/tools/libs/guest/xg_dom_x86.c#L1348-L1649) -and -[meminit_pv()](https://github.com/xen-project/xen/blob/de0254b9/tools/libs/guest/xg_dom_x86.c#L1183-L1333) -like `libxl` does, it does not set -[struct xc_dom_image.claim_enabled](https://github.com/xen-project/xen/blob/de0254b9/tools/include/xenguest.h#L186), -so it does not enable the first call to `xc_domain_claim_pages()` -which would claim the amount of memory that these functions will -attempt to allocate and populate for the domain. +With a +[proposed update](https://lore.kernel.org/xen-devel/20250314172502.53498-1-alejandro.vallejo@cloud.com/T/#mc2b0d3a8b994708da1ac199df3262c912c559efd) +of `xc_domain_claim_pages()` for NUMA node claims, a node argument is added. -#### Future design ideas for improved NUMA support +It can be `XC_NUMA_NO_NODE` for defining a host-wide claim or a NUMA node +for staking a claim on one NUMA node. + +This update does not change the foundational design of memory claims of +the Xen hypervisor where a claim is defined as a single claim for the domain. For improved support for [NUMA](../../../toolstack/features/NUMA/), `xenopsd` -may want to call an updated version of this function for the domain, so it has -a stake on the NUMA node's memory before `xenguest` will allocate for the domain -before assigning an NUMA node to a new domain. - -Further, as PV drivers `unmap` and `free` memory for grant tables to Xen and -then re-allocate memory for those grant tables, `xenopsd` may want to try to -stake a very small claim for the domain on the NUMA node of the domain so that -Xen can increase this claim when the PV drivers `free` this memory and re-use -the resulting claimed amount for allocating the grant tables. This would ensure -that the grant tables are then allocated on the local NUMA node of the domain, -avoiding remote memory accesses when accessing the grant tables from inside -the domain. +is updated to call an updated version of this function for the domain. + +It reserves NUMA node memory before `xenguest` is called, and a new `pnode` +argument is added to `xenguest`. + +It sets the NUMA node for memory allocations by the xenguest +[boot memory setup](../../../xenopsd/walkthroughs/VM.build/xenguest/setup_mem) +[xc_boot_mem_init()](../xenguest/xc_dom_boot_mem_init) which also +sets the `exact` flag. + +The `exact` flag forces [get_free_buddy()](../xen/get_free_buddy) to fail +if it could not find scrubbed memory for the give `pnode` which causes +[alloc_heap_pages()](./xc_domain_populate_physmap#alloc_heap_pages) +to re-run [get_free_buddy()](../xen/get_free_buddy) with the flag to +fall back to dirty, not yet scrubbed memory. + +[alloc_heap_pages()](./xc_domain_populate_physmap#alloc_heap_pages) then +checks each 4k page for the need to scrub and scrubs those before returning them. + +This is expected to improve the issue of potentially spreading memory over +all NUMA nodes in case of parallel boot by preventing one NUMA node to become +the target of parallel memory allocations which cannot fit on it and also the +issue of spreading memory over all NUMA nodes if case of a huge amount of +dirty memory that needs to be scrubbed for before a domain can start or restart. + +### NUMA grant tables and ballooning + +This is not seen as an issue as grant tables and I/O pages are usually not as +frequently used as regular system memory of domains, but this potential issue +remains: + +In discussions, it was said that Windows PV drivers `unmap` and `free` memory +for grant tables to Xen and then re-allocate memory for those grant tables. + +`xenopsd` may want to try to stake a very small claim for the domain on the +NUMA node of the domain so that Xen can increase this claim when the PV drivers +`free` this memory and re-use the resulting claimed amount for allocating +the grant tables. + +This would ensure that the grant tables are then allocated on the local NUMA +node of the domain, avoiding remote memory accesses when accessing the grant +tables from inside the domain. Note: In case the corresponding backend process in Dom0 is running on another NUMA node, it would access the domain's grant tables from a remote NUMA node, diff --git a/doc/content/lib/xenctrl/xc_domain_populate_physmap.md b/doc/content/lib/xenctrl/xc_domain_populate_physmap.md new file mode 100644 index 00000000000..5a795659c95 --- /dev/null +++ b/doc/content/lib/xenctrl/xc_domain_populate_physmap.md @@ -0,0 +1,104 @@ +--- +title: xc_domain_populate_physmap() +description: Populate a Xen domain's physical memory map +mermaid: + force: true +--- +## Overview + +[xenguest](../../../xenopsd/walkthroughs/VM.build/xenguest) uses +`xc_domain_populate_physmap()` to populate a Xen domain's physical memory map: +Both call the `XENMEM_populate_physmap` hypercall. + +`xc_domain_populate_physmap_exact()` also sets the "exact" flag +for allocating memory only on the given NUMA node. +This is a very simplified overview of the hypercall: + +{{% include "../xen/populate_physmap-chart.md" %}} + +### memory_op(XENMEM_populate_physmap) + +It calls +[construct_memop_from_reservation()](https://github.com/xen-project/xen/blob/39c45c/xen/common/memory.c#L1022-L1071) +to convert the arguments for allocating a page from +[struct xen_memory_reservation](https://github.com/xen-project/xen/blob/master/xen/include/public/memory.h#L46-L80) +to `struct memop_args` and +[passes](https://github.com/xen-project/xen/blob/e16acd80/xen/common/memory.c#L1459) +it to +[populate_physmap()](https://github.com/xen-project/xen/blob/e16acd80/xen/common/memory.c#L159-L314): + +### construct_memop_from_reservation() + +It populates `struct memop_args` using the +[hypercall arguments](struct/xen_memory_reservation). It: + +- Copies `extent_start`, `nr_extents`, and `extent_order`. +- Populates `memop_args->memflags` using the given `mem_flags`. + +#### Converting a vNODE to a pNODE for vNUMA + +When a vNUMA vnode is passed using `XENMEMF_vnode`, and `domain->vnuma` +and `domain->vnuma->nr_vnodes` are set, and the +`vnode` (virtual NUMA node) maps to a `pnode` (physical NUMA node), it also: + +- Populates the `pnode` in the `memflags` of the `struct memop_args` +- and sets a `XENMEMF_exact_node_request` in them as well. + +#### propagate_node() + +If no vNUMA node is passed, `construct_memop_from_reservation()` +[calls](https://github.com/xen-project/xen/blob/e16acd80/xen/common/memory.c#L1067) +[propagate_node()](https://github.com/xen-project/xen/blob/e16acd80/xen/common/memory.c#L524-L547) +to propagate the NUMA node and `XENMEMF_exact_node_request` for use in Xen. + +### populate_physmap() + +It handles hypercall preemption and resumption after a preemption, keeps track of +the already populated pages. + +For each range (extent), it runs an iteration of the +[allocation loop](https://github.com/xen-project/xen/blob/e16acd80/xen/common/memory.c#L197). +It +[passes](https://github.com/xen-project/xen/blob/e16acd80/xen/common/memory.c#L275) +the + +- `struct domain` +- page order +- count remaining pages populate +- and the converted `memflags` + +to +[alloc_domheap_pages()](https://github.com/xen-project/xen/blob/91772f84/xen/common/page_alloc.c#L2640-L2696): + +### alloc_domheap_pages() + +It +[calls](https://github.com/xen-project/xen/blob/e16acd80/xen/common/page_alloc.c#L2673) +[alloc_heap_pages()](https://github.com/xen-project/xen/blob/e16acd80/xen/common/page_alloc.c#L967-L1116) +and on success, assigns the allocated pages to the domain. + +### alloc_heap_pages() + +It +[calls](https://github.com/xen-project/xen/blob/e16acd80/xen/common/page_alloc.c#L1005) +get_free_buddy()[^1] to allocate a page at the most suitable place: +When no pages of the requested size are free, +it splits larger superpages into pages of the requested size. + +### get_free_buddy() + +It finds memory based on the flags and domain and return its `page struct`: + +- Optionally allocate prefer to allocate from a passed NUMA node +- Optionally allocate from the domain's next affine NUMA node (round-robin) +- Optionally return if the preferred NUMA allocation did not succeed +- Optionally allocate from not-yet scrubbed memory +- Optionally allocate from the given range of memory zones +- Fall back allocate from the next NUMA node on the system (round-robin) + +For details see the +[get_free_buddy()](../xen/get_free_buddy) finds memory based on the flags and domain. + +## Full flowchart + +{{% include "../xen/populate_physmap-dataflow.md" %}} diff --git a/doc/content/lib/xenctrl/xc_vcpu_setaffinity.md b/doc/content/lib/xenctrl/xc_vcpu_setaffinity.md index 8586492d9cc..d389ccdc9da 100644 --- a/doc/content/lib/xenctrl/xc_vcpu_setaffinity.md +++ b/doc/content/lib/xenctrl/xc_vcpu_setaffinity.md @@ -25,9 +25,13 @@ In the Xen hypervisor, each vCPU has: Hard affinity is currently not used for NUMA placement, but can be configured manually for a given domain, either using `xe VCPUs-params:mask=` or the API. - For example, the vCPU’s pinning can be configured using a template with: + For example, the vCPU’s pinning can be configured for a VM with:[^1] + [^1]: The VM parameter + [VCPUs-params:mask](https://docs.xenserver.com/en-us/citrix-hypervisor/command-line-interface.html#vm-parameters) + is documented in the official XAPI user documentation. + ```py - xe template-param-set uuid= vCPUs-params:mask=1,2,3 + xe vm-param-set uuid= vCPUs-params:mask=1,2,3 ``` There are also host-level `guest_VCPUs_params` which are used by diff --git a/doc/content/lib/xenguest/_index.md b/doc/content/lib/xenguest/_index.md new file mode 100644 index 00000000000..556a6e12948 --- /dev/null +++ b/doc/content/lib/xenguest/_index.md @@ -0,0 +1,32 @@ +--- +title: libxenguest +description: Xen Guest library for building Xen Guest domains +--- +## Introduction + +`libxenguest` is a library written in C provided for the Xen Hypervisor in Dom0. + +For example, it used as the low-level interface building Xen Guest domains. + +Its source is located in the folder +[tools/libs/guest](https://github.com/xen-project/xen/tree/master/tools/libs/guest) +of the Xen repository. + +## Responsibilities + +### Allocating the boot memory for new & migrated VMs + +One important responsibility of `libxenguest` is creating the memory layout +of new and migrated VMs. + +The [boot memory setup](../../../xenopsd/walkthroughs/VM.build/xenguest/setup_mem) +of `xenguest` and `libxl` (used by the `xl` CLI command) call +[xc_dom_boot_mem_init()](xc_dom_boot_mem_init) which dispatches the +call to +[meminit_hvm()](https://github.com/xen-project/xen/blob/de0254b9/tools/libs/guest/xg_dom_x86.c#L1348-L1649) +and +[meminit_pv()](https://github.com/xen-project/xen/blob/de0254b9/tools/libs/guest/xg_dom_x86.c#L1183-L1333) which layout, allocate and populate the boot memory of domains. + +## Functions + +{{% children description=true %}} \ No newline at end of file diff --git a/doc/content/lib/xenguest/boot_mem_init-chart.md b/doc/content/lib/xenguest/boot_mem_init-chart.md new file mode 100644 index 00000000000..b49d7a55d3a --- /dev/null +++ b/doc/content/lib/xenguest/boot_mem_init-chart.md @@ -0,0 +1,44 @@ +--- +title: Simple Flowchart of xc_dom_boot_mem_init() +hidden: true +--- +```mermaid +flowchart LR + +subgraph libxl / xl CLI + libxl__build_dom("libxl__build_dom()") +end + +subgraph xenguest + hvm_build_setup_mem("hvm_build_setup_mem()") +end + +subgraph libxenctrl + xc_domain_populate_physmap("One call for each memory range (extent): + xc_domain_populate_physmap() + xc_domain_populate_physmap() + xc_domain_populate_physmap()") +end + +subgraph libxenguest + + hvm_build_setup_mem & libxl__build_dom + --> xc_dom_boot_mem_init("xc_dom_boot_mem_init()") + + xc_dom_boot_mem_init + --> meminit_hvm("meminit_hvm()") & meminit_pv("meminit_pv()") + --> xc_domain_populate_physmap +end + +click xc_dom_boot_mem_init +"https://github.com/xen-project/xen/blob/39c45c/tools/libs/guest/xg_dom_boot.c#L110-L126 +" _blank + +click meminit_hvm +"https://github.com/xen-project/xen/blob/39c45c/tools/libs/guest/xg_dom_x86.c#L1348-L1648 +" _blank + +click meminit_pv +"https://github.com/xen-project/xen/blob/de0254b9/tools/libs/guest/xg_dom_x86.c#L1183-L1333 +" _blank +``` diff --git a/doc/content/lib/xenguest/xc_dom_boot_mem_init.md b/doc/content/lib/xenguest/xc_dom_boot_mem_init.md new file mode 100644 index 00000000000..ddba647cb13 --- /dev/null +++ b/doc/content/lib/xenguest/xc_dom_boot_mem_init.md @@ -0,0 +1,40 @@ +--- +title: xc_dom_boot_mem_init() +description: VM boot memory setup by calling meminit_hvm() or meminit_pv() +mermaid: + force: true +--- +## VM boot memory setup + +[xenguest's](../../xenopsd/walkthroughs/VM.build/xenguest/_index.md) +`hvm_build_setup_mem()` and `libxl` and the `xl` CLI call +[xc_dom_boot_mem_init()](https://github.com/xen-project/xen/blob/39c45c/tools/libs/guest/xg_dom_boot.c#L110-L126) +to allocate and populate the domain's system memory for booting it: + +{{% include "boot_mem_init-chart.md" %}} + +The allocation strategies of them called functions are: + +### Strategy of the libxenguest meminit functions + +- Attempt to allocate 1GB superpages when possible +- Fall back to 2MB pages when 1GB allocation failed +- Fall back to 4k pages when both failed + +They use +[xc_domain_populate_physmap()](../xenctrl/xc_domain_populate_physmap.md) +to perform memory allocation and to map the allocated memory +to the system RAM ranges of the domain. + +### Strategy of xc_domain_populate_physmap() + +[xc_domain_populate_physmap()](../xenctrl/xc_domain_populate_physmap.md) +calls the `XENMEM_populate_physmap` command of the Xen memory hypercall. + + +For a more detailed walk-through of the inner workings of this hypercall, +see the reference on +[xc_domain_populate_physmap()](../xenctrl/xc_domain_populate_physmap). + +For more details on the VM build step involving `xenguest` and Xen side see: +https://wiki.xenproject.org/wiki/Walkthrough:_VM_build_using_xenguest diff --git a/doc/content/xenopsd/walkthroughs/VM.build/VM_build-chart.md b/doc/content/xenopsd/walkthroughs/VM.build/VM_build-chart.md index eec1f05fc0e..9c7c0ee9184 100644 --- a/doc/content/xenopsd/walkthroughs/VM.build/VM_build-chart.md +++ b/doc/content/xenopsd/walkthroughs/VM.build/VM_build-chart.md @@ -6,13 +6,18 @@ weight: 10 --- ```mermaid -flowchart -subgraph xenopsd VM_build[xenopsd: VM_build micro#8209;op] -direction LR -VM_build --> VM.build -VM.build --> VM.build_domain -VM.build_domain --> VM.build_domain_exn -VM.build_domain_exn --> Domain.build +flowchart LR + +subgraph xenopsd: VM_build micro-op + direction LR + + VM_build(VM_build) + --> VM.build(VM.build) + --> VM.build_domain(VM.build_domain) + --> VM.build_domain_exn(VM.build_domain_exn) + --> Domain.build(Domain.build) +end + click VM_build " https://github.com/xapi-project/xen-api/blob/83555067/ocaml/xenopsd/lib/xenops_server.ml#L2255-L2271" _blank click VM.build " @@ -23,5 +28,4 @@ click VM.build_domain_exn " https://github.com/xapi-project/xen-api/blob/83555067/ocaml/xenopsd/xc/xenops_server_xen.ml#L2024-L2248" _blank click Domain.build " https://github.com/xapi-project/xen-api/blob/83555067/ocaml/xenopsd/xc/domain.ml#L1111-L1210" _blank -end ``` diff --git a/doc/content/xenopsd/walkthroughs/VM.build/xenguest.md b/doc/content/xenopsd/walkthroughs/VM.build/xenguest.md deleted file mode 100644 index 70908d556fb..00000000000 --- a/doc/content/xenopsd/walkthroughs/VM.build/xenguest.md +++ /dev/null @@ -1,185 +0,0 @@ ---- -title: xenguest -description: - "Perform building VMs: Allocate and populate the domain's system memory." ---- -As part of starting a new domain in VM_build, `xenopsd` calls `xenguest`. -When multiple domain build threads run in parallel, -also multiple instances of `xenguest` also run in parallel: - -```mermaid -flowchart -subgraph xenopsd VM_build[xenopsd VM_build micro#8209;ops] -direction LR -xenopsd1[Domain.build - Thread #1] --> xenguest1[xenguest #1] -xenopsd2[Domain.build - Thread #2] --> xenguest2[xenguest #2] -xenguest1 --> libxenguest -xenguest2 --> libxenguest2[libxenguest] -click xenopsd1 "../Domain.build/index.html" -click xenopsd2 "../Domain.build/index.html" -click xenguest1 "https://github.com/xenserver/xen.pg/blob/XS-8/patches/xenguest.patch" _blank -click xenguest2 "https://github.com/xenserver/xen.pg/blob/XS-8/patches/xenguest.patch" _blank -click libxenguest "https://github.com/xen-project/xen/tree/master/tools/libs/guest" _blank -click libxenguest2 "https://github.com/xen-project/xen/tree/master/tools/libs/guest" _blank -libxenguest --> Xen[Xen
Hypervisor] -libxenguest2 --> Xen -end -``` - -## About xenguest - -`xenguest` is called by the xenopsd [Domain.build](Domain.build) function -to perform the build phase for new VMs, which is part of the `xenopsd` -[VM.start operation](VM.start). - -[xenguest](https://github.com/xenserver/xen.pg/blob/XS-8/patches/xenguest.patch) -was created as a separate program due to issues with -`libxenguest`: - -- It wasn't threadsafe: fixed, but it still uses a per-call global struct -- It had an incompatible licence, but now licensed under the LGPL. - -Those were fixed, but we still shell out to `xenguest`, which is currently -carried in the patch queue for the Xen hypervisor packages, but could become -an individual package once planned changes to the Xen hypercalls are stabilised. - -Over time, `xenguest` has evolved to build more of the initial domain state. - -## Interface to xenguest - -```mermaid -flowchart -subgraph xenopsd VM_build[xenopsd VM_build micro#8209;op] -direction TB -mode -domid -memmax -Xenstore -end -mode[--mode build_hvm] --> xenguest -domid --> xenguest -memmax --> xenguest -Xenstore[Xenstore platform data] --> xenguest -``` - -`xenopsd` must pass this information to `xenguest` to build a VM: - -- The domain type to build for (HVM, PHV or PV). - - It is passed using the command line option `--mode hvm_build`. -- The `domid` of the created empty domain, -- The amount of system memory of the domain, -- A number of other parameters that are domain-specific. - -`xenopsd` uses the Xenstore to provide platform data: - -- the vCPU affinity -- the vCPU credit2 weight/cap parameters -- whether the NX bit is exposed -- whether the viridian CPUID leaf is exposed -- whether the system has PAE or not -- whether the system has ACPI or not -- whether the system has nested HVM or not -- whether the system has an HPET or not - -When called to build a domain, `xenguest` reads those and builds the VM accordingly. - -## Walkthrough of the xenguest build mode - -```mermaid -flowchart -subgraph xenguest[xenguest #8209;#8209;mode hvm_build domid] -direction LR -stub_xc_hvm_build[stub_xc_hvm_build#40;#41;] --> get_flags[ - get_flags#40;#41; <#8209; Xenstore platform data -] -stub_xc_hvm_build --> configure_vcpus[ - configure_vcpus#40;#41; #8209;> Xen hypercall -] -stub_xc_hvm_build --> setup_mem[ - setup_mem#40;#41; #8209;> Xen hypercalls to setup domain memory -] -end -``` - -Based on the given domain type, the `xenguest` program calls dedicated -functions for the build process of the given domain type. - -These are: - -- `stub_xc_hvm_build()` for HVM, -- `stub_xc_pvh_build()` for PVH, and -- `stub_xc_pv_build()` for PV domains. - -These domain build functions call these functions: - -1. `get_flags()` to get the platform data from the Xenstore -2. `configure_vcpus()` which uses the platform data from the Xenstore to configure vCPU affinity and the credit scheduler parameters vCPU weight and vCPU cap (max % pCPU time for throttling) -3. The `setup_mem` function for the given VM type. - -## The function hvm_build_setup_mem() - -For HVM domains, `hvm_build_setup_mem()` is responsible for deriving the memory -layout of the new domain, allocating the required memory and populating for the -new domain. It must: - -1. Derive the `e820` memory layout of the system memory of the domain - including memory holes depending on PCI passthrough and vGPU flags. -2. Load the BIOS/UEFI firmware images -3. Store the final MMIO hole parameters in the Xenstore -4. Call the `libxenguest` function `xc_dom_boot_mem_init()` (see below) -5. Call `construct_cpuid_policy()` to apply the CPUID `featureset` policy - -## The function xc_dom_boot_mem_init() - -```mermaid -flowchart LR -subgraph xenguest -hvm_build_setup_mem[hvm_build_setup_mem#40;#41;] -end -subgraph libxenguest -hvm_build_setup_mem --> xc_dom_boot_mem_init[xc_dom_boot_mem_init#40;#41;] -xc_dom_boot_mem_init -->|vmemranges| meminit_hvm[meninit_hvm#40;#41;] -click xc_dom_boot_mem_init "https://github.com/xen-project/xen/blob/39c45c/tools/libs/guest/xg_dom_boot.c#L110-L126" _blank -click meminit_hvm "https://github.com/xen-project/xen/blob/39c45c/tools/libs/guest/xg_dom_x86.c#L1348-L1648" _blank -end -``` - -`hvm_build_setup_mem()` calls -[xc_dom_boot_mem_init()](https://github.com/xen-project/xen/blob/39c45c/tools/libs/guest/xg_dom_boot.c#L110-L126) -to allocate and populate the domain's system memory. - -It calls -[meminit_hvm()](https://github.com/xen-project/xen/blob/39c45c/tools/libs/guest/xg_dom_x86.c#L1348-L1648) -to loop over the `vmemranges` of the domain for mapping the system RAM -of the guest from the Xen hypervisor heap. Its goals are: - -- Attempt to allocate 1GB superpages when possible -- Fall back to 2MB pages when 1GB allocation failed -- Fall back to 4k pages when both failed - -It uses the hypercall -[XENMEM_populate_physmap](https://github.com/xen-project/xen/blob/39c45c/xen/common/memory.c#L1408-L1477) -to perform memory allocation and to map the allocated memory -to the system RAM ranges of the domain. - -https://github.com/xen-project/xen/blob/39c45c/xen/common/memory.c#L1022-L1071 - -`XENMEM_populate_physmap`: - -1. Uses - [construct_memop_from_reservation](https://github.com/xen-project/xen/blob/39c45c/xen/common/memory.c#L1022-L1071) - to convert the arguments for allocating a page from - [struct xen_memory_reservation](https://github.com/xen-project/xen/blob/master/xen/include/public/memory.h#L46-L80) - to `struct memop_args`. -2. Sets flags and calls functions according to the arguments -3. Allocates the requested page at the most suitable place - - depending on passed flags, allocate on a specific NUMA node - - else, if the domain has node affinity, on the affine nodes - - also in the most suitable memory zone within the NUMA node -4. Falls back to less desirable places if this fails - - or fail for "exact" allocation requests -5. When no pages of the requested size are free, - it splits larger superpages into pages of the requested size. - -For more details on the VM build step involving `xenguest` and Xen side see: -https://wiki.xenproject.org/wiki/Walkthrough:_VM_build_using_xenguest diff --git a/doc/content/xenopsd/walkthroughs/VM.build/xenguest/_index.md b/doc/content/xenopsd/walkthroughs/VM.build/xenguest/_index.md new file mode 100644 index 00000000000..f94079628c3 --- /dev/null +++ b/doc/content/xenopsd/walkthroughs/VM.build/xenguest/_index.md @@ -0,0 +1,59 @@ +--- +title: xenguest +description: + "Perform building VMs: Allocate and populate the domain's system memory." +mermaid: + force: true +--- +## Introduction + +`xenguest` is called by the xenopsd [Domain.build](../Domain.build) function +to perform the build phase for new VMs, which is part of the `xenopsd` +[VM.build](../../VM.build) micro-op: + +{{% include "VM_build-chart.md" %}} + +[Domain.build](../Domain.build) calls `xenguest` (during boot storms, +many run in parallel to accelerate boot storm completion), and during +[migration](../../VM.migrate.md), `emu-manager` also calls `xenguest`: + +```mermaid +flowchart +subgraph "xenopsd & emu-manager call xenguest:" +direction LR +xenopsd1(Domain.build for VM #1) --> xenguest1(xenguest for #1) +xenopsd2(emu-manager for VM #2) --> xenguest2(xenguest for #2) +xenguest1 --> libxenguest(libxenguest) +xenguest2 --> libxenguest2(libxenguest) +click xenopsd1 "../Domain.build/index.html" +click xenopsd2 "../Domain.build/index.html" +click xenguest1 "https://github.com/xenserver/xen.pg/blob/XS-8/patches/xenguest.patch" _blank +click xenguest2 "https://github.com/xenserver/xen.pg/blob/XS-8/patches/xenguest.patch" _blank +click libxenguest "https://github.com/xen-project/xen/tree/master/tools/libs/guest" _blank +click libxenguest2 "https://github.com/xen-project/xen/tree/master/tools/libs/guest" _blank +libxenguest --> Xen(Xen
Hypercalls,

e.g.:

XENMEM

populate

physmap) +libxenguest2 --> Xen +end +``` + +## Historical heritage + +[xenguest](https://github.com/xenserver/xen.pg/blob/XS-8/patches/xenguest.patch) +was created as a separate program due to issues with +`libxenguest`: + +- It wasn't threadsafe: fixed, but it still uses a per-call global struct +- It had an incompatible licence, but now licensed under the LGPL. + +Those were fixed, but we still shell out to `xenguest`, which is currently +carried in the patch queue for the Xen hypervisor packages, but could become +an individual package once planned changes to the Xen hypercalls are stabilised. + +Over time, `xenguest` evolved to build more of the initial domain state. + +## Details + +The details the the invocation of xenguest, the build modes +and the VM memory setup are described in these child pages: + +{{% children description=true %}} \ No newline at end of file diff --git a/doc/content/xenopsd/walkthroughs/VM.build/xenguest/build_modes.md b/doc/content/xenopsd/walkthroughs/VM.build/xenguest/build_modes.md new file mode 100644 index 00000000000..1dd7ea9efd7 --- /dev/null +++ b/doc/content/xenopsd/walkthroughs/VM.build/xenguest/build_modes.md @@ -0,0 +1,101 @@ +--- +title: Build Modes +description: Description of the xenguest build modes (HVM, PVH, PV) with focus on HVM +weight: 20 +mermaid: + force: true +--- +## Invocation of the HVM build mode + +{{% include "mode_vm_build.md" %}} + +## Walk-through of the HVM build mode + +The domain build functions +[stub_xc_hvm_build()](https://github.com/xenserver/xen.pg/blob/65c0438b/patches/xenguest.patch#L2329-L2436) +and stub_xc_pv_build() call these functions: + +- [get_flags()](https://github.com/xenserver/xen.pg/blob/65c0438b/patches/xenguest.patch#L1164-L1288) + to get the platform data from the Xenstore + for filling out the fields of `struct flags` and `struct xc_dom_image`. +- [configure_vcpus()](https://github.com/xenserver/xen.pg/blob/65c0438b/patches/xenguest.patch#L1297) + which uses the platform data from the Xenstore: + - When `platform/vcpu//affinity` is set: set the vCPU affinity. + By default, this sets the domain's `node_affinity` mask (NUMA nodes) as well. + This configures + [`get_free_buddy()`](https://github.com/xen-project/xen/blob/e16acd80/xen/common/page_alloc.c#L855-L958) + to prefer memory allocations from this NUMA node_affinity mask. + - If `platform/vcpu/weight` is set, the domain's scheduling weight + - If `platform/vcpu/cap` is set, the domain's scheduling cap (%cpu time) +- [xc_dom_boot_mem_init()](https://github.com/xen-project/xen/blob/39c45c/tools/libs/guest/xg_dom_boot.c#L110-L126) + to call `_build_setup_mem()`, + +Call graph of +[do_hvm_build()](https://github.com/xenserver/xen.pg/blob/65c0438b/patches/xenguest.patch#L596-L615) +with emphasis on information flow: + +{{% include "do_hvm_build" %}} + +## The function hvm_build_setup_mem() + +For HVM domains, `hvm_build_setup_mem()` is responsible for deriving the memory +layout of the new domain, allocating the required memory and populating for the +new domain. It must: + +1. Derive the `e820` memory layout of the system memory of the domain + including memory holes depending on PCI passthrough and vGPU flags. +2. Load the BIOS/UEFI firmware images +3. Store the final MMIO hole parameters in the Xenstore +4. Call the `libxenguest` function `xc_dom_boot_mem_init()` (see below) +5. Call `construct_cpuid_policy()` to apply the CPUID `featureset` policy + +It starts this by: +- Getting `struct xc_dom_image`, `max_mem_mib`, and `max_start_mib`. +- Calculating start and size of lower ranges of the domain's memory maps + - taking memory holes for I/O into account, e.g. `mmio_size` and `mmio_start`. +- Calculating `lowmem_end` and `highmem_end`. + +It then calls `xc_dom_boot_mem_init()`: + +## The function xc_dom_boot_mem_init() + +`hvm_build_setup_mem()` calls +[xc_dom_boot_mem_init()](https://github.com/xen-project/xen/blob/39c45c/tools/libs/guest/xg_dom_boot.c#L110-L126) +to allocate and populate the domain's system memory: + +```mermaid +flowchart LR +subgraph xenguest +hvm_build_setup_mem[hvm_build_setup_mem#40;#41;] +end +subgraph libxenguest +hvm_build_setup_mem --vmemranges--> xc_dom_boot_mem_init[xc_dom_boot_mem_init#40;#41;] +xc_dom_boot_mem_init -->|vmemranges| meminit_hvm[meninit_hvm#40;#41;] +click xc_dom_boot_mem_init "https://github.com/xen-project/xen/blob/39c45c/tools/libs/guest/xg_dom_boot.c#L110-L126" _blank +click meminit_hvm "https://github.com/xen-project/xen/blob/39c45c/tools/libs/guest/xg_dom_x86.c#L1348-L1648" _blank +end +``` + +Except error handling and tracing, it only is a wrapper to call the +architecture-specific `meminit()` hook for the domain type: + +```c +rc = dom->arch_hooks->meminit(dom); +``` + +For HVM domains, it calls +[meminit_hvm()](https://github.com/xen-project/xen/blob/39c45c/tools/libs/guest/xg_dom_x86.c#L1348-L1648) +to loop over the `vmemranges` of the domain for mapping the system RAM +of the guest from the Xen hypervisor heap. Its goals are: + +- Attempt to allocate 1GB superpages when possible +- Fall back to 2MB pages when 1GB allocation failed +- Fall back to 4k pages when both failed + +It uses +[xc_domain_populate_physmap()](../../../../../lib/xenctrl/xc_domain_populate_physmap.md) +to perform memory allocation and to map the allocated memory +to the system RAM ranges of the domain. + +For more details on the VM build step involving `xenguest` and Xen side see: +https://wiki.xenproject.org/wiki/Walkthrough:_VM_build_using_xenguest diff --git a/doc/content/xenopsd/walkthroughs/VM.build/xenguest/do_hvm_build.md b/doc/content/xenopsd/walkthroughs/VM.build/xenguest/do_hvm_build.md new file mode 100644 index 00000000000..07bf9c0d067 --- /dev/null +++ b/doc/content/xenopsd/walkthroughs/VM.build/xenguest/do_hvm_build.md @@ -0,0 +1,78 @@ +--- +title: Call graph of xenguest/do_hvm_build() +description: Call graph of xenguest/do_hvm_build() with emphasis on information flow +hidden: true +--- +```mermaid +flowchart TD + +do_hvm_build("do_hvm_build() for HVM") + --> stub_xc_hvm_build("stub_xc_hvm_build()") + +get_flags("get_flags()") --"VM platform_data from XenStore" + --> stub_xc_hvm_build + +stub_xc_hvm_build + --> configure_vcpus("configure_vcpus()") + +configure_vcpus --"When
platform/ + vcpu/%d/affinity
is set" + --> xc_vcpu_setaffinity + +configure_vcpus --"When
platform/ + vcpu/cap
or + vcpu/weight
is set" + --> xc_sched_credit_domain_set + +stub_xc_hvm_build + --"struct xc_dom_image, mem_start_mib, mem_max_mib" + --> hvm_build_setup_mem("hvm_build_setup_mem()") + -- "struct xc_dom_image + with + optional vmemranges" + --> xc_dom_boot_mem_init + +subgraph libxenguest + xc_dom_boot_mem_init("xc_dom_boot_mem_init()") + -- "struct xc_dom_image + with + optional vmemranges" --> + meminit_hvm("meminit_hvm()") + -- page_size(1GB,2M,4k, memflags: e.g. exact) --> + xc_domain_populate_physmap("xc_domain_populate_physmap()") +end + +subgraph direct xenguest hypercalls + xc_vcpu_setaffinity("xc_vcpu_setaffinity()") + --> vcpu_set_affinity("vcpu_set_affinity()") + --> domain_update_node_aff("domain_update_node_aff()") + -- "if auto_node_affinity + is on (default)"--> auto_node_affinity(Update dom->node_affinity) + + xc_sched_credit_domain_set("xc_sched_credit_domain_set()") +end + +click do_hvm_build +"https://github.com/xenserver/xen.pg/blob/65c0438b/patches/xenguest.patch#L596-L615" _blank +click xc_vcpu_setaffinity "../../../../../lib/xenctrl/xc_vcpu_setaffinity/index.html" _blank +click vcpu_set_affinity +"https://github.com/xen-project/xen/blob/e16acd806/xen/common/sched/core.c#L1353-L1393" _blank +click domain_update_node_aff +"https://github.com/xen-project/xen/blob/e16acd806/xen/common/sched/core.c#L1809-L1876" _blank +click stub_xc_hvm_build +"https://github.com/xenserver/xen.pg/blob/65c0438b/patches/xenguest.patch#L2329-L2436" _blank +click hvm_build_setup_mem +"https://github.com/xenserver/xen.pg/blob/65c0438b/patches/xenguest.patch#L2002-L2219" _blank +click get_flags +"https://github.com/xenserver/xen.pg/blob/65c0438b/patches/xenguest.patch#L1164-L1288" _blank +click configure_vcpus +"https://github.com/xenserver/xen.pg/blob/65c0438b/patches/xenguest.patch#L1297" _blank +click xc_dom_boot_mem_init +"https://github.com/xen-project/xen/blob/e16acd806/tools/libs/guest/xg_dom_boot.c#L110-L125" +click meminit_hvm +"https://github.com/xen-project/xen/blob/e16acd806/tools/libs/guest/xg_dom_x86.c#L1348-L1648" +click xc_domain_populate_physmap +"../../../../../lib/xenctrl/xc_domain_populate_physmap/index.html" _blank +click auto_node_affinity +"../../../../../lib/xenctrl/xc_domain_node_setaffinity/index.html#flowchart-in-relation-to-xc_set_vcpu_affinity" _blank +``` diff --git a/doc/content/xenopsd/walkthroughs/VM.build/xenguest/invoke.md b/doc/content/xenopsd/walkthroughs/VM.build/xenguest/invoke.md new file mode 100644 index 00000000000..88511eb022f --- /dev/null +++ b/doc/content/xenopsd/walkthroughs/VM.build/xenguest/invoke.md @@ -0,0 +1,34 @@ +--- +title: Invocation +description: Invocation of xenguest and the interfaces used for it +weight: 10 +mermaid: + force: true +--- +## Interface to xenguest + +[xenopsd](../../../) passes this information to [xenguest](index.html) +(for [migration](../../VM.migrate.md), using `emu-manager`): + +- The domain type using the command line option `--mode _build`. +- The `domid` of the created empty domain, +- The amount of system memory of the domain, +- A number of other parameters that are domain-specific. + +`xenopsd` uses the Xenstore to provide platform data: + +- in case the domain has a [VCPUs-mask](../../../../lib/xenctrl/xc_vcpu_setaffinity.md), + the statically configured vCPU hard-affinity +- the vCPU credit2 weight/cap parameters +- whether the NX bit is exposed +- whether the viridian CPUID leaf is exposed +- whether the system has PAE or not +- whether the system has ACPI or not +- whether the system has nested HVM or not +- whether the system has an HPET or not + +When called to build a domain, `xenguest` reads those and builds the VM accordingly. + +## Parameters of the VM build modes + +{{% include "mode_vm_build.md" %}} diff --git a/doc/content/xenopsd/walkthroughs/VM.build/xenguest/mode_vm_build.md b/doc/content/xenopsd/walkthroughs/VM.build/xenguest/mode_vm_build.md new file mode 100644 index 00000000000..e8a659d56f3 --- /dev/null +++ b/doc/content/xenopsd/walkthroughs/VM.build/xenguest/mode_vm_build.md @@ -0,0 +1,40 @@ +--- +hidden: true +title: Call graph to the xenguest hvm/pvh/pv build functions +description: Call graph of xenguest for calling the hvm/pvh/pv build functions +--- +```mermaid +flowchart LR + +xenguest_main(" + xenguest + --mode hvm_build + / + --mode pvh_build + / + --mode pv_build ++ +domid +mem_max_mib +mem_start_mib +image +store_port +store_domid +console_port +console_domid") + --> do_hvm_build("do_hvm_build() for HVM + ") & do_pvh_build("do_pvh_build() for PVH") + --> stub_xc_hvm_build("stub_xc_hvm_build()") + +xenguest_main --> do_pv_build(do_pvh_build for PV) --> + stub_xc_pv_build("stub_xc_pv_build()") + +click do_pv_build +"https://github.com/xenserver/xen.pg/blob/65c0438b/patches/xenguest.patch#L575-L594" _blank +click do_hvm_build +"https://github.com/xenserver/xen.pg/blob/65c0438b/patches/xenguest.patch#L596-L615" _blank +click do_pvh_build +"https://github.com/xenserver/xen.pg/blob/65c0438b/patches/xenguest.patch#L617-L640" _blank +click stub_xc_hvm_build +"https://github.com/xenserver/xen.pg/blob/65c0438b/patches/xenguest.patch#L2329-L2436" _blank +``` diff --git a/doc/content/xenopsd/walkthroughs/VM.build/xenguest/setup_mem.md b/doc/content/xenopsd/walkthroughs/VM.build/xenguest/setup_mem.md new file mode 100644 index 00000000000..81cbe41d968 --- /dev/null +++ b/doc/content/xenopsd/walkthroughs/VM.build/xenguest/setup_mem.md @@ -0,0 +1,40 @@ +--- +title: Memory Setup +description: Creation and allocation of the boot memory layout of VMs +weight: 30 +mermaid: + force: true +--- +## HVM boot memory setup + +For HVM domains, `hvm_build_setup_mem()` is responsible for deriving the memory +layout of the new domain, allocating the required memory and populating for the +new domain. It must: + +1. Derive the `e820` memory layout of the system memory of the domain + including memory holes depending on PCI passthrough and vGPU flags. +2. Load the BIOS/UEFI firmware images +3. Store the final MMIO hole parameters in the Xenstore +4. Call the `libxenguest` function `xc_dom_boot_mem_init()` (see below) +5. Call `construct_cpuid_policy()` to apply the CPUID `featureset` policy + +It starts this by: +- Getting `struct xc_dom_image`, `max_mem_mib`, and `max_start_mib`. +- Calculating start and size of lower ranges of the domain's memory maps + - taking memory holes for I/O into account, e.g. `mmio_size` and `mmio_start`. +- Calculating `lowmem_end` and `highmem_end`. + +## Calling into libxenguest for the bootmem setup + +`hvm_build_setup_mem()` then calls the [libxenguest](../../../../lib/xenguest/) +function +[xc_dom_boot_mem_init()](../../../../lib/xenguest/xc_dom_boot_mem_init.md) +to set up the boot memory of domains. + +`xl` CLI also uses it to set up the boot memory of domains. +It constructs the memory layout of the domain and allocates and populates +the main system memory of the domain using calls to +[xc_domain_populate_physmap()](../../../../lib/xenctrl/xc_domain_populate_physmap.md). + +For more details on the VM build step involving `xenguest` and Xen side see +https://wiki.xenproject.org/wiki/Walkthrough:_VM_build_using_xenguest