Just Another ESXi 6.0 Storage APD Handling Bug

Share this:

It’s been a while since we experienced partial outage on our storage infrastructure running under IBM SVC storage. Few LUNs (vdisks in IBM SVC world) presented from storage went offline. I’m not going into details why it happened as HW problems just happens sometimes. Important thing is how are you prepared to mitigate impacts of possible failures.

To me, surprising side effect of this storage outage was that all ESXi’s running ESXi 6.0 U2 got disconnected from vCenter while ESXi 5.5 survived. I was curious as I knew this behavior is something which should-be fixed long time ago with APD Handling feature. APD (All-Paths-Down).

As you may remember ESXi’s prior version 5.0 didn’t have APD Handling feature which effectively meant your host went into not responding (disconnected) state in vCenter every time you experienced some storage outage.  APD Handling was introduced in ESXi 5.0 and can be controlled via Misc.APDHandlingEnable setting.

So I started digging a bit

VMkernel.log was spewed with following messages:

cpu16:711239)lpfc: lpfc_scsi_cmd_iocb_cmpl:5108: 1:(0):3271: FCP cmd xa3 failed <6/112> sid xd50d00, did xd35100, oxid x6b9 iotag x9bf SCSI Chk Cond - Not Ready: Data(x2:x2:x4:xc)
cpu4:33112)WARNING: VMW_SATP_ALUA: satp_alua_issueCommandOnPath:703: Path "vmhba2:C0:T6:L112" determined to be in unexpected NOT READY state when probed (0x2/0x4/0xc).
cpu1:879475)lpfc: lpfc_scsi_cmd_iocb_cmpl:5108: 0:(0):3271: FCP cmd xa3 failed <5/112> sid xcb0d00, did xc95100, oxid xe45 iotag x4cb SCSI Chk Cond - Not Ready: Data(x2:x2:x4:xc)
cpu14:33112)WARNING: VMW_SATP_ALUA: satp_alua_issueCommandOnPath:703: Path "vmhba1:C0:T5:L112" determined to be in unexpected NOT READY state when probed (0x2/0x4/0xc).
cpu9:33112)WARNING: vmw_psp_rr: psp_rrSelectPathToActivate:1099: Could not select path for device "naa.600507680c80805ca8000000000009f6".
cpu0:33459)WARNING: NMP: nmpDeviceAttemptFailover:603: Retry world failover device "naa.600507680c80805ca8000000000009f6" - issuing command 0x439eb946ed00
cpu0:33459)WARNING: vmw_psp_rr: psp_rrSelectPath:1315: Could not select path for device "naa.600507680c80805ca8000000000009f6".
cpu0:33459)WARNING: NMP: nmpDeviceAttemptFailover:678: Retry world failover device "naa.600507680c80805ca8000000000009f6" - failed to issue command due to Not found (APD), try again...
cpu0:33459)WARNING: NMP: nmpDeviceAttemptFailover:728: Logical device "naa.600507680c80805ca8000000000009f6": awaiting fast path state update...
cpu20:33491)lpfc: lpfc_scsi_cmd_iocb_cmpl:5108: 1:(0):3271: FCP cmd xa3 failed <6/112> sid xd50d00, did xd35100, oxid x721 iotag xa27 SCSI Chk Cond - Not Ready: Data(x2:x2:x4:xc)
cpu20:33111)WARNING: VMW_SATP_ALUA: satp_alua_issueCommandOnPath:703: Path "vmhba2:C0:T6:L112" determined to be in unexpected NOT READY state when probed (0x2/0x4/0xc).

As you can see paths were in NOT READY state. This was reported by storage using SCSI sense codes. The reason for this is because storage controllers (in our case IBM SVC) nodes were still online however they lost underlying storage. This is standard response as controllers cannot say if it will be permanent or temporary condition. You can also see ESXi correctly detected APD situation “failed to issue command due to Not found (APD), try again…”

However biggest issue here was that APD Handling feature wasn’t triggered, there is no log event about it anywhere. Therefore the same situation as pre ESXi 5.0 times.

 If you are not sure, you need to look for log events containing:

esx.problem.storage.apd.start
esx.problem.storage.apd.stop
apdcorrelator

All hosts went back online right after we un-mapped those volumes from the ESXi hosts on storage side, means storage sent SCSI sense code announcing PDL (Permanent Device Loss) and it was handled (luckily) correctly on ESXi side.

VMware Component Protection (VMCP)

This issue effectively means that VMware Component Protection for APD will not work in this case either.  As it is directly connected with APD handling.

VMware actions

I contacted VMware support asking about explanation of this. After couple of weeks of the investigation on their side, they accepted it as a valid bug and opened PR to the engineering to fix it. Lets hope fix will be available soon as this is 2nd issue affecting APD scenarios in vSphere 6.0, first one is described here.

Please note this is not something which is limited to the IBM storage, other vendors use NOT READY codes (although I’m not sure if this happens only in NOT READY scenarios) as well for example NetApp, EMC…

To get more information about APD handling and VMware Component Protection, I also recommend you following blog posts:

https://blogs.vmware.com/vsphere/2011/08/all-path-down-apd-handling-in-50.html

https://blogs.vmware.com/vsphere/2015/06/vm-component-protection-vmcp.html


Update November 30, 2018: It was finally fixed ESXi 6.7 U1 fixes: APD and VMCP is not triggered even when no paths can service I/Os

The following two tabs change content below.
With over 12 years of experience in the Virtualization field, currently working as a Senior Consultant for Evoila, contracted to VMware PSO, helping customers with Telco Cloud Platform bundle. Previous roles include VMware Architect for Public Cloud services at Etisalat and Senior Architect for the VMware platform at the largest retail bank in Slovakia. Background in closely related technologies includes server operating systems, networking, and storage. A former member of the VMware Center of Excellence at IBM and co-author of several Redpapers. The main scope of work involves designing and optimizing the performance of business-critical virtualized solutions on vSphere, including, but not limited to, Oracle WebLogic, MSSQL, and others. Holding several industry-leading IT certifications such as VCAP-DCD, VCAP-DCA, VCAP-NV, and MCITP. Honored with #vExpert2015-2019 awards by VMware for contributions to the community. Opinions are my own!

About Dusan Tekeljak

With over 12 years of experience in the Virtualization field, currently working as a Senior Consultant for Evoila, contracted to VMware PSO, helping customers with Telco Cloud Platform bundle. Previous roles include VMware Architect for Public Cloud services at Etisalat and Senior Architect for the VMware platform at the largest retail bank in Slovakia. Background in closely related technologies includes server operating systems, networking, and storage. A former member of the VMware Center of Excellence at IBM and co-author of several Redpapers. The main scope of work involves designing and optimizing the performance of business-critical virtualized solutions on vSphere, including, but not limited to, Oracle WebLogic, MSSQL, and others. Holding several industry-leading IT certifications such as VCAP-DCD, VCAP-DCA, VCAP-NV, and MCITP. Honored with #vExpert2015-2019 awards by VMware for contributions to the community. Opinions are my own!
Bookmark the permalink.

6 Comments

  1. Have they supplied any SR number, or KB article, to describe this new problem? i had this the other day! I would like to know so I can track the outcome with engineering. Please comment back,

  2. Pingback: ESXi 6.7 U1 fixes: APD and VMCP is not triggered even when no paths can service I/Os - The Virtualist

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.