VM Latency Sensitivity set to High still fails with no (proper) warning

Share this:

Hello, it’s been a while since I last blogged, had a busy few years, took a sabbatical or there was nothing to blog about without breaching a NDA. Anyway I’m back here with a new issue which may help some of those who are virtualizing latency sensitive workloads, a niche world, I know…

Back to the issue at hand. One of the features of VM latency sensitivity set to High is granting exclusive access (ExAff) to a physical core. In the past this only worked when physical core was available to be fully reserved and was silently failing if it wasn’t. KB339990 mentioned this was fixed by explicitly requiring 100% CPU reservation for VMs with LS=HIGH since VM HW version 14, I believe. However this doesn’t always work and I try to explain this when.

It is important to note that LS=HIGH attempts to collocation of all vCPUs and memory on a single NUMA node.

And this is the problem as CPU reservation works on global level for ESXi and does not take an individual NUMA node’s capacity into account for VMs with NUMA affinity.

Therefore you can end up in a situation where a VM with LS=HIGH and 100% CPU reservation is scheduled as over-provisioned, meaning some of its vCPUs will be scheduled on a single pCPU. That’s right! It won’t create a wide VM (VM stretched across multiple NUMA nodes) with full cores even if they are available, but rather over-provision such VM. Both of the scenarios are bad, but over-provisioning is worse in my opinion.

Ideally, I would prefer such VM to fail to power on or give the user a choice of what to do in such a scenario. For example, to trigger a warning indicating that there are no resources available in the host/cluster anymore, and also that DRS should be able to schedule such a VM on another host within the cluster. The current situation forces customers to manage LS=HIGH workloads manually, or using complex scripts for placement as DRS is not good at handling this on its own.

Here is my proof

Test setup:

  • ESXi 8.0U3 (Also tested on ESX 7.0U3)
  • 2xCPU 16cores each.
  • 2 NUMA nodes
  • 3xVMs with 9 vCPU each LS=HIGH and 100%CPU reservation

Below you can see an output from vcpu_affinity_info.sh handy script by Valentin Bondzio

[root@localhost:~] ./vcpu_affinity_info.sh
CID=2100566     GID=21068       LWID=2100656    Name=brick1

Group CPU Affinity:
   guest worlds:0-31
   non-guest worlds:0-31

Latency Sensitivity:
   -1

NUMA client 0:
   affinity: 0x00000003
   home: 0x00000001

      vcpuId  vcpu#  pcpu#  affinityMode  softAffinity   Affinity  ExAff
     2100656      0    16     2 -> sched            16       0-31    yes
     2100658      1    26     2 -> sched            26       0-31    yes
     2100659      2    22     2 -> sched            22       0-31    yes
     2100660      3    18     2 -> sched            18       0-31    yes
     2100661      4    23     2 -> sched            23       0-31    yes
     2100662      5    21     2 -> sched            21       0-31    yes
     2100663      6    19     2 -> sched            19       0-31    yes
     2100664      7    20     2 -> sched            20       0-31    yes
     2100665      8    30     2 -> sched            30       0-31    yes


CID=2100679     GID=21728       LWID=2100768    Name=brick2

Group CPU Affinity:
   guest worlds:0-31
   non-guest worlds:0-31

Latency Sensitivity:
   -1

NUMA client 0:
   affinity: 0x00000003
   home: 0x00000000

      vcpuId  vcpu#  pcpu#  affinityMode  softAffinity   Affinity  ExAff
     2100768      0     4     2 -> sched             4       0-31    yes
     2100770      1     1     2 -> sched             1       0-31    yes
     2100771      2    13     2 -> sched            13       0-31    yes
     2100772      3     3     2 -> sched             3       0-31    yes
     2100773      4     9     2 -> sched             9       0-31    yes
     2100774      5     7     2 -> sched             7       0-31    yes
     2100775      6     8     2 -> sched             8       0-31    yes
     2100776      7    11     2 -> sched            11       0-31    yes
     2100777      8     6     2 -> sched             6       0-31    yes


CID=2101305     GID=27125       LWID=2101394    Name=brick3

Group CPU Affinity:
   guest worlds:0-31
   non-guest worlds:0-31

Latency Sensitivity:
   -1

NUMA client 0:
   affinity: 0x00000003
   home: 0x00000000

      vcpuId  vcpu#  pcpu#  affinityMode  softAffinity   Affinity  ExAff
     2101394      0     2     2 -> sched          0-15       0-31     no
     2101396      1    12     2 -> sched          0-15       0-31     no
     2101397      2    10     2 -> sched          0-15       0-31     no
     2101398      3    14     2 -> sched          0-15       0-31     no
     2101399      4     0     2 -> sched          0-15       0-31     no
     2101400      5    10     2 -> sched          0-15       0-31     no
     2101401      6    14     2 -> sched          0-15       0-31     no
     2101402      7     0     2 -> sched          0-15       0-31     no
     2101403      8    15     2 -> sched          0-15       0-31     no

Here you can see brick1 and brick2 are both scheduled with ExAff as expected, but brick3 has no ExAff. Also vCPU 2,5 are scheduled on pCPU 10 and vCPU 3,6 on pCPU 14 therefore over-provisioned. The reason is obvious there is enough CPU to be reserved globally on the host, but not enough on the single NUMA node. The math here is quite simple 32-9-9=14, 16-9=7.

Now to be fair there is a small warning issued in the host’s event log:

Unable to apply latency-sensitivity setting to virtual machine brick3. No valid placement on the host.

And also vmkernel.log:

2024-06-27T09:12:02.773Z Wa(180) vmkwarning: cpu1:2101317)WARNING: CpuSched: 1400: Unable to apply latency-sensitivity setting to virtual machine brick3. No valid placement on the host.

But DRS has no issue powering on a such VM either, which complicates management of such VMs.

I have an SR open for this, as I don’t think it should work as it does right now, will try to keep you posted.

The following two tabs change content below.
With over 12 years of experience in the Virtualization field, currently working as a Senior Consultant for Evoila, contracted to VMware PSO, helping customers with Telco Cloud Platform bundle. Previous roles include VMware Architect for Public Cloud services at Etisalat and Senior Architect for the VMware platform at the largest retail bank in Slovakia. Background in closely related technologies includes server operating systems, networking, and storage. A former member of the VMware Center of Excellence at IBM and co-author of several Redpapers. The main scope of work involves designing and optimizing the performance of business-critical virtualized solutions on vSphere, including, but not limited to, Oracle WebLogic, MSSQL, and others. Holding several industry-leading IT certifications such as VCAP-DCD, VCAP-DCA, VCAP-NV, and MCITP. Honored with #vExpert2015-2019 awards by VMware for contributions to the community. Opinions are my own!

About Dusan Tekeljak

With over 12 years of experience in the Virtualization field, currently working as a Senior Consultant for Evoila, contracted to VMware PSO, helping customers with Telco Cloud Platform bundle. Previous roles include VMware Architect for Public Cloud services at Etisalat and Senior Architect for the VMware platform at the largest retail bank in Slovakia. Background in closely related technologies includes server operating systems, networking, and storage. A former member of the VMware Center of Excellence at IBM and co-author of several Redpapers. The main scope of work involves designing and optimizing the performance of business-critical virtualized solutions on vSphere, including, but not limited to, Oracle WebLogic, MSSQL, and others. Holding several industry-leading IT certifications such as VCAP-DCD, VCAP-DCA, VCAP-NV, and MCITP. Honored with #vExpert2015-2019 awards by VMware for contributions to the community. Opinions are my own!
Bookmark the permalink.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.