Updated: Storage disconnects when using VAAI ATS on vSphere 5.5 Update2 and vSphere 6.0

By Dusan Tekeljak April 16, 2015 May 18, 2015 Cloud and Virtualization, Hardware, IBM, Other, VMware, vSphere

SYMPTOMS:

In vCenter’s events you may observe:

Lost access to volume <uuid><volume name> due to connectivity issues. Recovery attempt is in progress and the outcome will be reported shortly

and in ESXi’s vmkernel.log:

ATS Miscompare detected beween test and set HB images at offset XXX on vol YYY

FIX:

Workaround for now should be disabling of the ATS for Heartbeats only, note you don’t need to disable ATS globally as you would lose the benefit of the ATS, the operation is non-disruptive so it can be performed online – without a host reboot (although IBM is stating otherwise in their KB – we have successfully tested it without outage).

VMware also released KB2113956 already where you can find more details.

Disable ATS for heartbeats:

esxcli system settings advanced set -i 0 -o /VMFS3/useATSForHBOnVMFS5

and to verify with:

esxcli system settings advanced list -o /VMFS3/useATSForHBOnVMFS5

You can read more about disabling VAAI and ATS in VMware kb1033665

~~There’s no KB from VMware out yet regarding this issue~~ and I have no information if any other storage vendors are affected by this issue too. If you are not sure I suggest you to verify with your storage vendor or VMware.

I will update this article once I’ll have more information and please leave a comment if your storage type is affected as well.

Update 16th April 2015: Added symptoms, VMware KB, changed headline

Update 18th May 2015: looks like EMC VMAX is affected by this as well. You can read more about at http://timsvirtualworld.com/2015/04/ats-miscompare-issue-with-emc-vmax/

Bio
Latest Posts

Dusan Tekeljak

Experienced infrastructure architect and consultant with more than a decade of hands-on expertise in designing, deploying, and optimizing secure, high-performance cloud solutions across Europe and the Middle East. My focus is on VMware technologies, where I’ve led major implementations, architected mission-critical systems for telecom and finance clients, and contributed to industry knowledge as an IBM Redbooks co-author. With a collection of advanced certifications—including VCAP-DCD, VCAP-DCA, VCAP-NV, multiple VMware expert credentials—I combine technical leadership with practical delivery, consistently driving successful infrastructure transformations, operational excellence, and digital innovation for enterprise clients Opinions are my own!

Latest posts by Dusan Tekeljak (see all)

VM Latency Sensitivity set to High still fails with no (proper) warning - June 27, 2024
ESXi 6.7 U1 fixes: APD and VMCP is not triggered even when no paths can service I/Os - November 30, 2018
Update manager error: hosts could not enter maintenance mode - November 19, 2018

About Dusan Tekeljak

View all posts by Dusan Tekeljak →

Bookmark the permalink.

18 Comments

Nikolay Nikolov
April 16, 2015 at 1:36 pm

This looks like a issue which I have in one of my customers…. Thanks a lot for sharing! If this resolves my problem, I own you a beer!

Reply
Dusan Tekeljak
April 16, 2015 at 1:43 pm

let us know!

Reply
Ralf
August 29, 2015 at 10:49 am

Had a major downtime in a fresh 6.0 cluster with IBM SVC and V7000, >350 VMs were down because of this. Thanks VMware.

Reply
- Dusan Tekeljak
  September 1, 2015 at 9:10 pm
  
  Shit happens, but believe me Ralf you are not the only one in this :/
  Happened to me – luckily in a very small greenfield preprod environment as PoC in beginning of the migration. But I also heard about the big ones. And I suppose there will be still some in upcoming months as companies will continue with vSphere 6 adaptation.
  
  Reply
Pingback: IBM FlashSystem V9000 and VMware vSphere ESXi Guidelines - The Virtualist
Yaron Cohen
October 7, 2015 at 6:49 pm

Hi, This issue also relevant to IBM XIV storage system ?
We have a big issue with the same error message, but without the messages on the vmkernel.log.

Reply
- Dusan Tekeljak
  October 7, 2015 at 7:00 pm
  
  I was having chat with multiple storage experts and they were saying that 8s timeout in the SAN world is too low and it can impact the other storage vendors as well, but most of the time virtualized types like SVC,VMAX…. I cannot say for XIV, but if you don’t have it in your vmkernel.log, it is most probably something else.
  
  Reply
TheFluffySysOp
November 4, 2015 at 1:36 pm

EMC has now also published the an article for the VPLEX https://support.emc.com/kb/207382

We ran into this issue on the vplex, but it was made infinitely worse by the ESX qLogic HBA driver not handling the ATS Miscompare SCSI Sense code. The end result is that the driver would occasionally flip out, and an entire HBA would loose -all- its paths. When this happened to 2 HBAs simultaneously, we lost all storage on that host, and thus all the VMs.

Thankfully, this issue is fixed in driver version 1.1.58.0:
Here is the details of what was wrong and what was fixed: (from the release notes)
——————————————————
Between versions 1.1.54.0 and 1.1.55.0:

* Problem Description: ATS miscompare check conditions were not being reported to upper layer. <- Here is where we see messages not being reported to NMP. This is the first attempt to fix the issue.

* Solution: Driver was interpreting the error condition as a "dropped frame" scenario and the issue was never rectified. Fix was to check for this miscompare condition before determining if it was indeed a "dropped frame".

Between versions 1.1.56.0 and 1.1.57.0:

* Problem Description: Omit SCSI opcode check in ATS miscompare check condition. <- Here we see the incorrect response to the ATS miscompare. This is the correct fix for the issue.

* Solution: Omit SCSI opcode check and rely on sense key and ASC to determine if this scenario is encountered.
——————————————————

We have now fixed this issue by both updating the HBA Driver in ESX, as well as turning off ATS Heartbeating as to the described workaround.

Reply
- Dusan Tekeljak
  November 5, 2015 at 5:54 pm
  
  Thanks a lot for valuable update! btw would you mind to past EMC kb here, I’m curious and unfortunately don’t have access to their portal
  
  Reply
TheFluffySysOp
November 6, 2015 at 11:46 am

EMC KB for the vplex is here: http://pastebin.com/T1QX9mU6

Reply
- Dusan Tekeljak
  November 6, 2015 at 3:37 pm
  
  Thanks a lot, definitely better explanation than by IBM or VMware
  
  Reply
Manda Albert
May 15, 2016 at 7:33 am

Savvy discussion . I was fascinated by the points , Does anyone know if my assistant might be able to acquire a blank a form document to type on ?

Reply
Carl Clements
February 13, 2017 at 7:16 am

This post is genius. Have been losing and gaining connectivity on my whitebox ESXi 6.5 with just local SSD storage every 15 seconds for days. Been causing no end of issues. The workaround here has solved it and my box is now flying like Concorde.
Thanks OP

Reply
- Dusan Tekeljak
  February 13, 2017 at 6:09 pm
  
  YW, strange I would not expect this to be a problem with local controllers. Didn’t know they started vaai suppprt there
  
  Reply
seungwan kang
March 6, 2019 at 8:09 am

i experienced that message.
only Disable ATS Heartbeat is best?
i received messages from VMware SR that Disable ATS & Disable ATS Heartbeat
What is the difference between ‘Disable ATS’ and ‘Disable ATS Heartbeat?
i know Disable ATS is effect Storage performance

Reply
- Dusan Tekeljak
  March 6, 2019 at 8:29 am
  
  Hi, ATS will help a lot if you have multiple VMs per datastore especially with even more with thin provisioned disks. Some storage arrays had problems also with ATS itself, but I would start with disabling heartbeat only. Depends on behavior. You can also ask support for explanation why they recommends disabling ATS as well. Unfortunately quality of 1st line decreased a lot.
  ATS is used for virtual machine’s files locks instead of SCSI reservations. ATS Heartbeat just leverage ATS for datastore hearbeats.
  
  By the way which ESXi version you’re running? I read somewhere ATS miscompare handling was improved in 6.5 already and should not cause disconnects anymore.
  
  Reply
  - Dusan Tekeljak
    March 6, 2019 at 8:38 am
    
    Check here https://cormachogan.com/2017/08/24/ats-miscompare-revisited-vsphere-6-5/ however some people still complains about HP arrays. I have ATS heartbeat enabled myself ever since updated to ESXi 6.5 9 months ago and haven’t got an issue. But running IBM SVC and they make some improvements in v7.6 firmware on their site as well.
    
    Reply
J.Leeflang
October 16, 2019 at 1:17 pm

We are using HPE Nimble HF20(H) storage, with HPE B series FC switches and ESXi 6.7 U2 (PSP 10.09.2019 and ESXi patched to end of sep,) and we see the same errors.
At two other locations we use the same HPE DL325Gen10 hosts, but direct attaced using FC to Nimble HF20H arrays and there we don’t see these errors.

Reply

Updated: Storage disconnects when using VAAI ATS on vSphere 5.5 Update2 and vSphere 6.0

SYMPTOMS: