It’s been a while since we experienced partial outage on our storage infrastructure running under IBM SVC storage. Few LUNs (vdisks in IBM SVC world) presented from storage went offline. I’m not going into details why it happened as HW problems just happens sometimes. Important thing is how are you prepared to mitigate impacts of possible failures.
To me, surprising side effect of this storage outage was that all ESXi’s running ESXi 6.0 U2 got disconnected from vCenter while ESXi 5.5 survived. I was curious as I knew this behavior is something which should-be fixed long time ago with APD Handling feature. APD (All-Paths-Down).
As you may remember ESXi’s prior version 5.0 didn’t have APD Handling feature which effectively meant your host went into not responding (disconnected) state in vCenter every time you experienced some storage outage. Â APD Handling was introduced in ESXi 5.0 and can be controlled via Misc.APDHandlingEnable setting.
So I started digging a bit
VMkernel.log was spewed with following messages:
cpu16:711239)lpfc: lpfc_scsi_cmd_iocb_cmpl:5108: 1:(0):3271: FCP cmd xa3 failed <6/112> sid xd50d00, did xd35100, oxid x6b9 iotag x9bf SCSI Chk Cond - Not Ready: Data(x2:x2:x4:xc) cpu4:33112)WARNING: VMW_SATP_ALUA: satp_alua_issueCommandOnPath:703: Path "vmhba2:C0:T6:L112" determined to be in unexpected NOT READY state when probed (0x2/0x4/0xc). cpu1:879475)lpfc: lpfc_scsi_cmd_iocb_cmpl:5108: 0:(0):3271: FCP cmd xa3 failed <5/112> sid xcb0d00, did xc95100, oxid xe45 iotag x4cb SCSI Chk Cond - Not Ready: Data(x2:x2:x4:xc) cpu14:33112)WARNING: VMW_SATP_ALUA: satp_alua_issueCommandOnPath:703: Path "vmhba1:C0:T5:L112" determined to be in unexpected NOT READY state when probed (0x2/0x4/0xc). cpu9:33112)WARNING: vmw_psp_rr: psp_rrSelectPathToActivate:1099: Could not select path for device "naa.600507680c80805ca8000000000009f6". cpu0:33459)WARNING: NMP: nmpDeviceAttemptFailover:603: Retry world failover device "naa.600507680c80805ca8000000000009f6" - issuing command 0x439eb946ed00 cpu0:33459)WARNING: vmw_psp_rr: psp_rrSelectPath:1315: Could not select path for device "naa.600507680c80805ca8000000000009f6". cpu0:33459)WARNING: NMP: nmpDeviceAttemptFailover:678: Retry world failover device "naa.600507680c80805ca8000000000009f6" - failed to issue command due to Not found (APD), try again... cpu0:33459)WARNING: NMP: nmpDeviceAttemptFailover:728: Logical device "naa.600507680c80805ca8000000000009f6": awaiting fast path state update... cpu20:33491)lpfc: lpfc_scsi_cmd_iocb_cmpl:5108: 1:(0):3271: FCP cmd xa3 failed <6/112> sid xd50d00, did xd35100, oxid x721 iotag xa27 SCSI Chk Cond - Not Ready: Data(x2:x2:x4:xc) cpu20:33111)WARNING: VMW_SATP_ALUA: satp_alua_issueCommandOnPath:703: Path "vmhba2:C0:T6:L112" determined to be in unexpected NOT READY state when probed (0x2/0x4/0xc).
As you can see paths were in NOT READY state. This was reported by storage using SCSI sense codes. The reason for this is because storage controllers (in our case IBM SVC) nodes were still online however they lost underlying storage. This is standard response as controllers cannot say if it will be permanent or temporary condition. You can also see ESXi correctly detected APD situation “failed to issue command due to Not found (APD), try again…”
However biggest issue here was that APD Handling feature wasn’t triggered, there is no log event about it anywhere. Therefore the same situation as pre ESXi 5.0 times.
 If you are not sure, you need to look for log events containing:
esx.problem.storage.apd.start
esx.problem.storage.apd.stop
apdcorrelator
All hosts went back online right after we un-mapped those volumes from the ESXi hosts on storage side, means storage sent SCSI sense code announcing PDL (Permanent Device Loss) and it was handled (luckily) correctly on ESXi side.
VMware Component Protection (VMCP)
This issue effectively means that VMware Component Protection for APD will not work in this case either. As it is directly connected with APD handling.
VMware actions
I contacted VMware support asking about explanation of this. After couple of weeks of the investigation on their side, they accepted it as a valid bug and opened PR to the engineering to fix it. Lets hope fix will be available soon as this is 2nd issue affecting APD scenarios in vSphere 6.0, first one is described here.
Please note this is not something which is limited to the IBM storage, other vendors use NOT READY codes (although I’m not sure if this happens only in NOT READY scenarios) as well for example NetApp, EMC…
To get more information about APD handling and VMware Component Protection, I also recommend you following blog posts:
https://blogs.vmware.com/vsphere/2011/08/all-path-down-apd-handling-in-50.html
https://blogs.vmware.com/vsphere/2015/06/vm-component-protection-vmcp.html
Update November 30, 2018: It was finally fixed ESXi 6.7 U1 fixes: APD and VMCP is not triggered even when no paths can service I/Os
Latest posts by Dusan Tekeljak (see all)
- VM Latency Sensitivity set to High still fails with no (proper) warning - June 27, 2024
- ESXi 6.7 U1 fixes: APD and VMCP is not triggered even when no paths can service I/Os - November 30, 2018
- Update manager error: hosts could not enter maintenance mode - November 19, 2018
Have they supplied any SR number, or KB article, to describe this new problem? i had this the other day! I would like to know so I can track the outcome with engineering. Please comment back,
Hi,
Sorry to hear, I sent you an email with SR number so you can try to reference it
Sorry Dusan I did not receive it
Please write me to dusan@thevirtualist.org
Hi, it was fixed in ESXi 6.7
Pingback: ESXi 6.7 U1 fixes: APD and VMCP is not triggered even when no paths can service I/Os - The Virtualist