Hello, it’s been a while since I last blogged, had a busy few years, took a sabbatical or there was nothing to blog about without breaching a NDA. Anyway I’m back here with a new issue which may help some of those who are virtualizing latency sensitive workloads, a niche world, I know…
Back to the issue at hand. One of the features of VM latency sensitivity set to High is granting exclusive access (ExAff) to a physical core. In the past this only worked when physical core was available to be fully reserved and was silently failing if it wasn’t. KB339990 mentioned this was fixed by explicitly requiring 100% CPU reservation for VMs with LS=HIGH since VM HW version 14, I believe. However this doesn’t always work and I try to explain this when.
It is important to note that LS=HIGH attempts to collocation of all vCPUs and memory on a single NUMA node.
And this is the problem as CPU reservation works on global level for ESXi and does not take an individual NUMA node’s capacity into account for VMs with NUMA affinity.
Therefore you can end up in a situation where a VM with LS=HIGH and 100% CPU reservation is scheduled as over-provisioned, meaning some of its vCPUs will be scheduled on a single pCPU. That’s right! It won’t create a wide VM (VM stretched across multiple NUMA nodes) with full cores even if they are available, but rather over-provision such VM. Both of the scenarios are bad, but over-provisioning is worse in my opinion.
Ideally, I would prefer such VM to fail to power on or give the user a choice of what to do in such a scenario. For example, to trigger a warning indicating that there are no resources available in the host/cluster anymore, and also that DRS should be able to schedule such a VM on another host within the cluster. The current situation forces customers to manage LS=HIGH workloads manually, or using complex scripts for placement as DRS is not good at handling this on its own.
Here is my proof
Test setup:
- ESXi 8.0U3 (Also tested on ESX 7.0U3)
- 2xCPU 16cores each.
- 2 NUMA nodes
- 3xVMs with 9 vCPU each LS=HIGH and 100%CPU reservation
Below you can see an output from vcpu_affinity_info.sh handy script by Valentin Bondzio
[root@localhost:~] ./vcpu_affinity_info.sh CID=2100566 GID=21068 LWID=2100656 Name=brick1 Group CPU Affinity: guest worlds:0-31 non-guest worlds:0-31 Latency Sensitivity: -1 NUMA client 0: affinity: 0x00000003 home: 0x00000001 vcpuId vcpu# pcpu# affinityMode softAffinity Affinity ExAff 2100656 0 16 2 -> sched 16 0-31 yes 2100658 1 26 2 -> sched 26 0-31 yes 2100659 2 22 2 -> sched 22 0-31 yes 2100660 3 18 2 -> sched 18 0-31 yes 2100661 4 23 2 -> sched 23 0-31 yes 2100662 5 21 2 -> sched 21 0-31 yes 2100663 6 19 2 -> sched 19 0-31 yes 2100664 7 20 2 -> sched 20 0-31 yes 2100665 8 30 2 -> sched 30 0-31 yes CID=2100679 GID=21728 LWID=2100768 Name=brick2 Group CPU Affinity: guest worlds:0-31 non-guest worlds:0-31 Latency Sensitivity: -1 NUMA client 0: affinity: 0x00000003 home: 0x00000000 vcpuId vcpu# pcpu# affinityMode softAffinity Affinity ExAff 2100768 0 4 2 -> sched 4 0-31 yes 2100770 1 1 2 -> sched 1 0-31 yes 2100771 2 13 2 -> sched 13 0-31 yes 2100772 3 3 2 -> sched 3 0-31 yes 2100773 4 9 2 -> sched 9 0-31 yes 2100774 5 7 2 -> sched 7 0-31 yes 2100775 6 8 2 -> sched 8 0-31 yes 2100776 7 11 2 -> sched 11 0-31 yes 2100777 8 6 2 -> sched 6 0-31 yes CID=2101305 GID=27125 LWID=2101394 Name=brick3 Group CPU Affinity: guest worlds:0-31 non-guest worlds:0-31 Latency Sensitivity: -1 NUMA client 0: affinity: 0x00000003 home: 0x00000000 vcpuId vcpu# pcpu# affinityMode softAffinity Affinity ExAff 2101394 0 2 2 -> sched 0-15 0-31 no 2101396 1 12 2 -> sched 0-15 0-31 no 2101397 2 10 2 -> sched 0-15 0-31 no 2101398 3 14 2 -> sched 0-15 0-31 no 2101399 4 0 2 -> sched 0-15 0-31 no 2101400 5 10 2 -> sched 0-15 0-31 no 2101401 6 14 2 -> sched 0-15 0-31 no 2101402 7 0 2 -> sched 0-15 0-31 no 2101403 8 15 2 -> sched 0-15 0-31 no
Here you can see brick1 and brick2 are both scheduled with ExAff as expected, but brick3 has no ExAff. Also vCPU 2,5 are scheduled on pCPU 10 and vCPU 3,6 on pCPU 14 therefore over-provisioned. The reason is obvious there is enough CPU to be reserved globally on the host, but not enough on the single NUMA node. The math here is quite simple 32-9-9=14, 16-9=7.
Now to be fair there is a small warning issued in the host’s event log:
Unable to apply latency-sensitivity setting to virtual machine brick3. No valid placement on the host.
And also vmkernel.log:
2024-06-27T09:12:02.773Z Wa(180) vmkwarning: cpu1:2101317)WARNING: CpuSched: 1400: Unable to apply latency-sensitivity setting to virtual machine brick3. No valid placement on the host.
But DRS has no issue powering on a such VM either, which complicates management of such VMs.
I have an SR open for this, as I don’t think it should work as it does right now, will try to keep you posted.
Latest posts by Dusan Tekeljak (see all)
- VM Latency Sensitivity set to High still fails with no (proper) warning - June 27, 2024
- ESXi 6.7 U1 fixes: APD and VMCP is not triggered even when no paths can service I/Os - November 30, 2018
- Update manager error: hosts could not enter maintenance mode - November 19, 2018