Just Another ESXi 6.0 Storage APD Handling Bug

So I started digging a bit

VMkernel.log was spewed with following messages:

cpu16:711239)lpfc: lpfc_scsi_cmd_iocb_cmpl:5108: 1:(0):3271: FCP cmd xa3 failed <6/112> sid xd50d00, did xd35100, oxid x6b9 iotag x9bf SCSI Chk Cond - Not Ready: Data(x2:x2:x4:xc)
cpu4:33112)WARNING: VMW_SATP_ALUA: satp_alua_issueCommandOnPath:703: Path "vmhba2:C0:T6:L112" determined to be in unexpected NOT READY state when probed (0x2/0x4/0xc).
cpu1:879475)lpfc: lpfc_scsi_cmd_iocb_cmpl:5108: 0:(0):3271: FCP cmd xa3 failed <5/112> sid xcb0d00, did xc95100, oxid xe45 iotag x4cb SCSI Chk Cond - Not Ready: Data(x2:x2:x4:xc)
cpu14:33112)WARNING: VMW_SATP_ALUA: satp_alua_issueCommandOnPath:703: Path "vmhba1:C0:T5:L112" determined to be in unexpected NOT READY state when probed (0x2/0x4/0xc).
cpu9:33112)WARNING: vmw_psp_rr: psp_rrSelectPathToActivate:1099: Could not select path for device "naa.600507680c80805ca8000000000009f6".
cpu0:33459)WARNING: NMP: nmpDeviceAttemptFailover:603: Retry world failover device "naa.600507680c80805ca8000000000009f6" - issuing command 0x439eb946ed00
cpu0:33459)WARNING: vmw_psp_rr: psp_rrSelectPath:1315: Could not select path for device "naa.600507680c80805ca8000000000009f6".
cpu0:33459)WARNING: NMP: nmpDeviceAttemptFailover:678: Retry world failover device "naa.600507680c80805ca8000000000009f6" - failed to issue command due to Not found (APD), try again...
cpu0:33459)WARNING: NMP: nmpDeviceAttemptFailover:728: Logical device "naa.600507680c80805ca8000000000009f6": awaiting fast path state update...
cpu20:33491)lpfc: lpfc_scsi_cmd_iocb_cmpl:5108: 1:(0):3271: FCP cmd xa3 failed <6/112> sid xd50d00, did xd35100, oxid x721 iotag xa27 SCSI Chk Cond - Not Ready: Data(x2:x2:x4:xc)
cpu20:33111)WARNING: VMW_SATP_ALUA: satp_alua_issueCommandOnPath:703: Path "vmhba2:C0:T6:L112" determined to be in unexpected NOT READY state when probed (0x2/0x4/0xc).

As you can see paths were in NOT READY state. This was reported by storage using SCSI sense codes. The reason for this is because storage controllers (in our case IBM SVC) nodes were still online however they lost underlying storage. This is standard response as controllers cannot say if it will be permanent or temporary condition. You can also see ESXi correctly detected APD situation “failed to issue command due to Not found (APD), try again…”

However biggest issue here was that APD Handling feature wasn’t triggered, there is no log event about it anywhere. Therefore the same situation as pre ESXi 5.0 times.

If you are not sure, you need to look for log events containing:

esx.problem.storage.apd.start esx.problem.storage.apd.stop apdcorrelator

All hosts went back online right after we un-mapped those volumes from the ESXi hosts on storage side, means storage sent SCSI sense code announcing PDL (Permanent Device Loss) and it was handled (luckily) correctly on ESXi side.

VMware Component Protection (VMCP)

This issue effectively means that VMware Component Protection for APD will not work in this case either. As it is directly connected with APD handling.

VMware actions

I contacted VMware support asking about explanation of this. After couple of weeks of the investigation on their side, they accepted it as a valid bug and opened PR to the engineering to fix it. Lets hope fix will be available soon as this is 2nd issue affecting APD scenarios in vSphere 6.0, first one is described here.

Please note this is not something which is limited to the IBM storage, other vendors use NOT READY codes (although I’m not sure if this happens only in NOT READY scenarios) as well for example NetApp, EMC…

To get more information about APD handling and VMware Component Protection, I also recommend you following blog posts:

https://blogs.vmware.com/vsphere/2011/08/all-path-down-apd-handling-in-50.html

https://blogs.vmware.com/vsphere/2015/06/vm-component-protection-vmcp.html

Update November 30, 2018: It was finally fixed ESXi 6.7 U1 fixes: APD and VMCP is not triggered even when no paths can service I/Os

Bio
Latest Posts

Dusan Tekeljak

Experienced infrastructure architect and consultant with more than a decade of hands-on expertise in designing, deploying, and optimizing secure, high-performance cloud solutions across Europe and the Middle East. My focus is on VMware technologies, where I’ve led major implementations, architected mission-critical systems for telecom and finance clients, and contributed to industry knowledge as an IBM Redbooks co-author. With a collection of advanced certifications—including VCAP-DCD, VCAP-DCA, VCAP-NV, multiple VMware expert credentials—I combine technical leadership with practical delivery, consistently driving successful infrastructure transformations, operational excellence, and digital innovation for enterprise clients Opinions are my own!

Latest posts by Dusan Tekeljak (see all)

VM Latency Sensitivity set to High still fails with no (proper) warning - June 27, 2024
ESXi 6.7 U1 fixes: APD and VMCP is not triggered even when no paths can service I/Os - November 30, 2018
Update manager error: hosts could not enter maintenance mode - November 19, 2018

6 Comments

ash wallace
November 22, 2016 at 4:45 am

Have they supplied any SR number, or KB article, to describe this new problem? i had this the other day! I would like to know so I can track the outcome with engineering. Please comment back,

Loading...

- Dusan Tekeljak
  November 22, 2016 at 9:41 am
  
  Hi,
  
  Sorry to hear, I sent you an email with SR number so you can try to reference it
  
  Loading...
  
  - ash wallace
    November 23, 2016 at 12:11 am
    
    Sorry Dusan I did not receive it
    
    Loading...
    
    - Dusan Tekeljak
      November 23, 2016 at 2:35 am
      
      Please write me to dusan@thevirtualist.org
      
      Loading...
      
- Dusan Tekeljak
  November 30, 2018 at 3:21 pm
  
  Hi, it was fixed in ESXi 6.7
  
  Loading...
  
Pingback: ESXi 6.7 U1 fixes: APD and VMCP is not triggered even when no paths can service I/Os - The Virtualist

Just Another ESXi 6.0 Storage APD Handling Bug

So I started digging a bit

VMware Component Protection (VMCP)

VMware actions

Dusan Tekeljak

Latest posts by Dusan Tekeljak (see all)

Like this:

About Dusan Tekeljak

6 Comments

Leave a ReplyCancel reply

Last Posts

Top Posts & Pages

Categories

Just Another ESXi 6.0 Storage APD Handling Bug

So I started digging a bit

VMware Component Protection (VMCP)

VMware actions

Dusan Tekeljak

Latest posts by Dusan Tekeljak (see all)

Share this:

Like this:

About Dusan Tekeljak

6 Comments

Leave a ReplyCancel reply

Last Posts

Top Posts & Pages

Categories

Tags