IBM has released a flash alert regarding a new behavior introduced in vSphere 5.5 Update2 and vSphere 6.0 where VAAI ATS (Atomic Test and Set or in the other words Hardware accelerated locking) is used for heartbeat I/O.
http://www-01.ibm.com/support/docview.wss?uid=ssg1S1005201
According to IBM:
Due to the low timeout value for heartbeat I/O using ATS, this can lead to host disconnects and application outages if delays of 8 seconds or longer are experienced in completing individual heartbeat I/Os on backend storage systems or the SAN infrastructure.
All version of the Storwize family (and SVC) since 6.4 are affected.
SYMPTOMS:
In vCenter’s events you may observe:
Lost access to volume <uuid><volume name> due to connectivity issues. Recovery attempt is in progress and the outcome will be reported shortly
and in ESXi’s vmkernel.log:
ATS Miscompare detected beween test and set HB images at offset XXX on vol YYY
FIX:
Workaround for now should be disabling of the ATS for Heartbeats only, note you don’t need to disable ATS globally as you would lose the benefit of the ATS, the operation is non-disruptive so it can be performed online – without a host reboot (although IBM is stating otherwise in their KB – we have successfully tested it without outage).
VMware also released KB2113956 already where you can find more details.
Disable ATS for heartbeats:
esxcli system settings advanced set -i 0 -o /VMFS3/useATSForHBOnVMFS5
and to verify with:
esxcli system settings advanced list -o /VMFS3/useATSForHBOnVMFS5
You can read more about disabling VAAI and ATS in VMware kb1033665
There’s no KB from VMware out yet regarding this issue and I have no information if any other storage vendors are affected by this issue too. If you are not sure I suggest you to verify with your storage vendor or VMware.
I will update this article once I’ll have more information and please leave a comment if your storage type is affected as well.
Update 16th April 2015: Added symptoms, VMware KB, changed headline
Update 18th May 2015: looks like EMC VMAX is affected by this as well. You can read more about at http://timsvirtualworld.com/2015/04/ats-miscompare-issue-with-emc-vmax/
Latest posts by Dusan Tekeljak (see all)
- VM Latency Sensitivity set to High still fails with no (proper) warning - June 27, 2024
- ESXi 6.7 U1 fixes: APD and VMCP is not triggered even when no paths can service I/Os - November 30, 2018
- Update manager error: hosts could not enter maintenance mode - November 19, 2018
This looks like a issue which I have in one of my customers…. Thanks a lot for sharing! If this resolves my problem, I own you a beer!
let us know!
Had a major downtime in a fresh 6.0 cluster with IBM SVC and V7000, >350 VMs were down because of this. Thanks VMware.
Shit happens, but believe me Ralf you are not the only one in this :/
Happened to me – luckily in a very small greenfield preprod environment as PoC in beginning of the migration. But I also heard about the big ones. And I suppose there will be still some in upcoming months as companies will continue with vSphere 6 adaptation.
Pingback: IBM FlashSystem V9000 and VMware vSphere ESXi Guidelines - The Virtualist
Hi, This issue also relevant to IBM XIV storage system ?
We have a big issue with the same error message, but without the messages on the vmkernel.log.
I was having chat with multiple storage experts and they were saying that 8s timeout in the SAN world is too low and it can impact the other storage vendors as well, but most of the time virtualized types like SVC,VMAX…. I cannot say for XIV, but if you don’t have it in your vmkernel.log, it is most probably something else.
EMC has now also published the an article for the VPLEX https://support.emc.com/kb/207382
We ran into this issue on the vplex, but it was made infinitely worse by the ESX qLogic HBA driver not handling the ATS Miscompare SCSI Sense code. The end result is that the driver would occasionally flip out, and an entire HBA would loose -all- its paths. When this happened to 2 HBAs simultaneously, we lost all storage on that host, and thus all the VMs.
Thankfully, this issue is fixed in driver version 1.1.58.0:
Here is the details of what was wrong and what was fixed: (from the release notes)
——————————————————
Between versions 1.1.54.0 and 1.1.55.0:
* Problem Description: ATS miscompare check conditions were not being reported to upper layer. <- Here is where we see messages not being reported to NMP. This is the first attempt to fix the issue.
* Solution: Driver was interpreting the error condition as a "dropped frame" scenario and the issue was never rectified. Fix was to check for this miscompare condition before determining if it was indeed a "dropped frame".
Between versions 1.1.56.0 and 1.1.57.0:
* Problem Description: Omit SCSI opcode check in ATS miscompare check condition. <- Here we see the incorrect response to the ATS miscompare. This is the correct fix for the issue.
* Solution: Omit SCSI opcode check and rely on sense key and ASC to determine if this scenario is encountered.
——————————————————
We have now fixed this issue by both updating the HBA Driver in ESX, as well as turning off ATS Heartbeating as to the described workaround.
Thanks a lot for valuable update! btw would you mind to past EMC kb here, I’m curious and unfortunately don’t have access to their portal
EMC KB for the vplex is here: http://pastebin.com/T1QX9mU6
Thanks a lot, definitely better explanation than by IBM or VMware
Savvy discussion . I was fascinated by the points , Does anyone know if my assistant might be able to acquire a blank a form document to type on ?
This post is genius. Have been losing and gaining connectivity on my whitebox ESXi 6.5 with just local SSD storage every 15 seconds for days. Been causing no end of issues. The workaround here has solved it and my box is now flying like Concorde.
Thanks OP
YW, strange I would not expect this to be a problem with local controllers. Didn’t know they started vaai suppprt there
i experienced that message.
only Disable ATS Heartbeat is best?
i received messages from VMware SR that Disable ATS & Disable ATS Heartbeat
What is the difference between ‘Disable ATS’ and ‘Disable ATS Heartbeat?
i know Disable ATS is effect Storage performance
Hi, ATS will help a lot if you have multiple VMs per datastore especially with even more with thin provisioned disks. Some storage arrays had problems also with ATS itself, but I would start with disabling heartbeat only. Depends on behavior. You can also ask support for explanation why they recommends disabling ATS as well. Unfortunately quality of 1st line decreased a lot.
ATS is used for virtual machine’s files locks instead of SCSI reservations. ATS Heartbeat just leverage ATS for datastore hearbeats.
By the way which ESXi version you’re running? I read somewhere ATS miscompare handling was improved in 6.5 already and should not cause disconnects anymore.
Check here https://cormachogan.com/2017/08/24/ats-miscompare-revisited-vsphere-6-5/ however some people still complains about HP arrays. I have ATS heartbeat enabled myself ever since updated to ESXi 6.5 9 months ago and haven’t got an issue. But running IBM SVC and they make some improvements in v7.6 firmware on their site as well.
We are using HPE Nimble HF20(H) storage, with HPE B series FC switches and ESXi 6.7 U2 (PSP 10.09.2019 and ESXi patched to end of sep,) and we see the same errors.
At two other locations we use the same HPE DL325Gen10 hosts, but direct attaced using FC to Nimble HF20H arrays and there we don’t see these errors.