ESXi 6.7 U1 fixes: APD and VMCP is not triggered even when no paths can service I/Os

Share this:

It has been over two years since I blogged about APD handling bug which I discovered, during partial storage outage in my environment.

All path down (APD) handling and VMware Component Protection (VMCP) is not working when you have paths physically connected, but they are not able to serve any I/Os. It can even leads into unresponsive ESXi host.
For example specifically IBM SVC reports SCSI NOT READY codes if you run out of space on thin provisioned volume. But there are other cases as well.
I also heard about NetApp and EMC storage which can behave like this in some stretch cluster configurations.

I have fought with VMware engineering quite a lot about this at time. Took me 9 months, had to escalate several times, but they eventually accepted it as valid bug and promised to fix it. Its shame it took them 2 years, since I reported it, but it is finally here!

Even my friend at VMware was telling me it is a lost battle with them, saying it is like it is, others asked for it and failed… I haven’t gave up!

You can find mention about it in the vSphere 6.7 U1 release notes (there is no other mention or KB yet):

PR 2167098: All Paths Down (APD) is not triggered for LUNs behind IBM SAN Volume Controller (SVC) target even when no paths can service I/Os
In an ESXi configuration with multiple paths leading to LUNs behind IBM SVC targets, in case of connection loss on active paths and if at the same time the other connected paths are not in a state to service I/Os, the ESXi host might not detect this condition as APD even as no paths are actually available to service I/Os. As a result, I/Os to the device are not fast failed.
This issue is resolved in this release. The fix is disabled by default. To enable the fix, set the ESXi config option /Scsi/ExtendAPDCondition to esxcfg-advcfg -s 1 /Scsi/ExtendAPDCondition.

I believe it is not IBM SVC issue fix only, so others may try as well.

Since there are no more details I wasn’t 100% sure it is fixing my issue. I tested it and it indeed fixes my bug!

As expected in default configuration with /Scsi/ExtendAPDCondition disabled, APD handling is not working for SCSI NOT READY

vmkernel.log flood messages like:

2018-11-23T11:50:15.495Z cpu26:2097964)lpfc: lpfc_handle_status:5066: 1:(0):3271: FCP cmd x89 failed <14/135> sid xd80f00, did x733600, oxid x105 iotag x40b SCSI Chk Cond - Not Ready: Data(x2:x2:x4:xc)
2018-11-23T11:50:15.495Z cpu27:2098141)NMP: nmp_ThrottleLogForDevice:3781: Cmd 0x89 (0x45a37928ec00, 2097213) to dev "naa.600507680c8000689000000000002f49" on path "vmhba6:C0:T14:L135" Failed: H:0x0 D:0x2 P:0x0 Valid sense data: 0x2 0x4 0xc. Act:FAILOVER
2018-11-23T11:50:15.495Z cpu27:2098141)WARNING: NMP: nmp_DeviceRetryCommand:133: Device "naa.600507680c8000689000000000002f49": awaiting fast path state update for failover with I/O blocked. No prior reservation exists on the device.
2018-11-23T11:50:16.454Z cpu26:2097689)WARNING: vmw_psp_rr: psp_rrSelectPathToActivate:1844: Could not select path for device "naa.600507680c8000689000000000002f49".
2018-11-23T11:50:16.454Z cpu42:2098022)WARNING: NMP: nmpDeviceAttemptFailover:640: Retry world failover device "naa.600507680c8000689000000000002f49" - issuing command 0x45a37928ec00
2018-11-23T11:50:16.454Z cpu42:2098022)WARNING: vmw_psp_rr: psp_rrSelectPath:2177: Could not select path for device "naa.600507680c8000689000000000002f49".
2018-11-23T11:50:16.454Z cpu42:2098022)WARNING: NMP: nmpDeviceAttemptFailover:715: Retry world failover device "naa.600507680c8000689000000000002f49" - failed to issue command due to Not found (APD), try again...
2018-11-23T11:50:16.454Z cpu42:2098022)WARNING: NMP: nmpDeviceAttemptFailover:765: Logical device "naa.600507680c8000689000000000002f49": awaiting fast path state update...
2018-11-23T11:50:17.454Z cpu26:2097689)WARNING: vmw_psp_rr: psp_rrSelectPathToActivate:1844: Could not select path for device "naa.600507680c8000689000000000002f49".
2018-11-23T11:50:17.454Z cpu42:2098022)WARNING: NMP: nmpDeviceAttemptFailover:640: Retry world failover device "naa.600507680c8000689000000000002f49" - issuing command 0x45a37928ec00
2018-11-23T11:50:17.454Z cpu42:2098022)WARNING: vmw_psp_rr: psp_rrSelectPath:2177: Could not select path for device "naa.600507680c8000689000000000002f49".
2018-11-23T11:50:17.454Z cpu42:2098022)WARNING: NMP: nmpDeviceAttemptFailover:715: Retry world failover device "naa.600507680c8000689000000000002f49" - failed to issue command due to Not found (APD), try again...
2018-11-23T11:50:17.454Z cpu42:2098022)WARNING: NMP: nmpDeviceAttemptFailover:765: Logical device "naa.600507680c8000689000000000002f49": awaiting fast path state update...

However once I enabled /Scsi/ExtendAPDCondition

vmkernel log showed APD handling messages immediately:

2018-11-23T11:54:36.470Z cpu36:2100104 opID=135c47ca)Config: 703: "ExtendAPDCondition" = 1, Old Value: 0, (Status: 0x0)

2018-11-23T11:56:06.495Z cpu26:2098516)lpfc: lpfc_handle_status:5066: 1:(0):3271: FCP cmd x89 failed <14/135> sid xd80f00, did x733600, oxid x100 iotag x406 SCSI Chk Cond - Not Ready: Data(x2:x2:x4:xc)
2018-11-23T11:56:06.495Z cpu32:2098141)NMP: nmp_ThrottleLogForDevice:3781: Cmd 0x89 (0x45a3793678c0, 2097213) to dev "naa.600507680c8000689000000000002f49" on path "vmhba6:C0:T14:L135" Failed: H:0x0 D:0x2 P:0x0 Valid sense data: 0x2 0x4 0xc. Act:FAILOVER
2018-11-23T11:56:06.495Z cpu32:2098141)WARNING: NMP: nmp_DeviceRetryCommand:133: Device "naa.600507680c8000689000000000002f49": awaiting fast path state update for failover with I/O blocked. No prior reservation exists on the device.
2018-11-23T11:56:07.453Z cpu29:2097689)WARNING: vmw_psp_rr: psp_rrSelectPathToActivate:1844: Could not select path for device "naa.600507680c8000689000000000002f49".
2018-11-23T11:56:07.453Z cpu29:2097689)LVM: 5748: Received APD EventType: APD_START (3) for device <naa.600507680c8000689000000000002f49:1> (gen 1)
2018-11-23T11:56:07.453Z cpu29:2097689)LVM: 5356: Handling APD EventType: APD_START (3) for device <naa.600507680c8000689000000000002f49:1> (unlocked, gen 1, cur apd state UNKNOWN, cur dev state 1)
2018-11-23T11:56:07.453Z cpu29:2097689)HBX: 6287: APD EventType: APD_START (3) for vol 'APDTEST'
2018-11-23T11:56:07.453Z cpu29:2097689)ScsiDevice: 5978: Device state of naa.600507680c8000689000000000002f49 set to APD_START; token num:1
2018-11-23T11:56:07.453Z cpu29:2097689)StorageApdHandler: 1203: APD start for 0x4307894417b0 [naa.600507680c8000689000000000002f49]
2018-11-23T11:56:07.453Z cpu42:2098022)WARNING: NMP: nmpDeviceAttemptFailover:640: Retry world failover device "naa.600507680c8000689000000000002f49" - issuing command 0x45a3793678c0
2018-11-23T11:56:07.453Z cpu10:2097674)StorageApdHandler: 419: APD start event for 0x4307894417b0 [naa.600507680c8000689000000000002f49]
2018-11-23T11:56:07.453Z cpu42:2098022)WARNING: vmw_psp_rr: psp_rrSelectPath:2177: Could not select path for device "naa.600507680c8000689000000000002f49".
2018-11-23T11:56:07.453Z cpu10:2097674)StorageApdHandlerEv: 110: Device or filesystem with identifier [naa.600507680c8000689000000000002f49] has entered the All Paths Down state.
2018-11-23T11:56:07.453Z cpu42:2098022)WARNING: NMP: nmpDeviceAttemptFailover:715: Retry world failover device "naa.600507680c8000689000000000002f49" - failed to issue command due to Not found (APD), try again...
2018-11-23T11:56:07.453Z cpu42:2098022)WARNING: NMP: nmpDeviceAttemptFailover:765: Logical device "naa.600507680c8000689000000000002f49": awaiting fast path state update...

And exactly after 140 seconds which is default APD handling timeout value:

2018-11-23T11:58:27.455Z cpu10:2097674)LVM: 5748: Received APD EventType: APD_TIMEOUT (5) for device <naa.600507680c8000689000000000002f49:1> (gen 2)
2018-11-23T11:58:27.455Z cpu10:2097674)LVM: 5356: Handling APD EventType: APD_TIMEOUT (5) for device <naa.600507680c8000689000000000002f49:1> (unlocked, gen 2, cur apd state APD_START, cur dev state 1)
2018-11-23T11:58:27.455Z cpu10:2097674)HBX: 6287: APD EventType: APD_TIMEOUT (5) for vol 'APDTEST'
2018-11-23T11:58:27.455Z cpu10:2097674)StorageApdHandler: 609: APD timeout event for 0x4307894417b0 [naa.600507680c8000689000000000002f49]
2018-11-23T11:58:27.455Z cpu10:2097674)StorageApdHandlerEv: 126: Device or filesystem with identifier [naa.600507680c8000689000000000002f49] has entered the All Paths Down Timeout state after being in the All Paths Down state for 140 seconds. I/Os will
You may also observe similar backtraces:
2018-11-23T11:58:30.457Z cpu0:2097622)BC: 410: write to vmware.log (f532 28 3 5bf7e56e ee182ee d37c63c1 83aed0a 2000c44 e 0 0 0 0 0) 1054 bytes failed: No connection
2018-11-23T11:58:30.457Z cpu0:2097622)Log: 1437: Generating backtrace for 2097622: bcflushd
2018-11-23T11:58:30.457Z cpu0:2097622)Backtrace for current CPU #0, worldID=2097622, fp=0x2001d6
2018-11-23T11:58:30.457Z cpu0:2097622)0x451b0eb1ba40:[0x4180160f72c2]Log_Backtrace@vmkernel#nover+0x10e stack: 0xf532, 0x60050, 0xbad0040, 0x418016056bd9, 0x3
2018-11-23T11:58:30.457Z cpu0:2097622)0x451b0eb1ba70:[0x418016056bd8]BCFileIO@vmkernel#nover+0x1f9 stack: 0x451a0ee182ee, 0x4180d37c63c1, 0x451b083aed0a, 0x451a02000c44, 0x262b60000000e
2018-11-23T11:58:30.457Z cpu0:2097622)0x451b0eb1bd50:[0x418016058047]BCWriteBuffer@vmkernel#nover+0x200 stack: 0x1f2bf6, 0xffffffffffffffff, 0x418000000001, 0x37000, 0x809fdb7000
2018-11-23T11:58:30.457Z cpu0:2097622)0x451b0eb1bf60:[0x4180160581b3]BCFlushFile@vmkernel#nover+0x44 stack: 0x417fd6133d40, 0x4180160581e6, 0x4180160581e6, 0x418016058295, 0x451b0eb1bfaf
2018-11-23T11:58:30.457Z cpu0:2097622)0x451b0eb1bf90:[0x418016058294]BCFlushAll@vmkernel#nover+0xc5 stack: 0x451b0e4a3100, 0x451b0eb23100, 0x0, 0x0, 0x0
2018-11-23T11:58:30.457Z cpu0:2097622)0x451b0eb1bfd0:[0x4180160582c9]BCFlusherWorld@vmkernel#nover+0xe stack: 0x0, 0x0, 0x0, 0x0, 0x0
2018-11-23T11:58:30.457Z cpu0:2097622)0x451b0eb1bfe0:[0x418016309112]CpuSched_StartWorld@vmkernel#nover+0x77 stack: 0x0, 0x0, 0x0, 0x0, 0x0

I was a bit worried they may event lead into PSoD, so I checked with VMware and it looks like they are perfectly normal and safe during the APD condition.

And here we have recovery from the APD

2018-11-23T12:11:28.453Z cpu31:2097689)ScsiDevice: 6000: Setting Device naa.600507680c8000689000000000002f49 state back to 0x2
2018-11-23T12:11:28.453Z cpu31:2097689)LVM: 5748: Received APD EventType: APD_EXIT (4) for device <naa.600507680c8000689000000000002f49:1> (gen 3)
2018-11-23T12:11:28.453Z cpu31:2097689)LVM: 5356: Handling APD EventType: APD_EXIT (4) for device <naa.600507680c8000689000000000002f49:1> (unlocked, gen 3, cur apd state APD_TIMEOUT, cur dev state 1)
2018-11-23T12:11:28.453Z cpu31:2097689)ScsiDevice: 6021: Device naa.600507680c8000689000000000002f49 is Out of APD; token num:1
2018-11-23T12:11:28.453Z cpu0:2097339)LVM: 5356: Handling APD EventType: APD_EXIT (4) for device <naa.600507680c8000689000000000002f49:1> (locked, gen 3, cur apd state APD_TIMEOUT, cur dev state 1)
2018-11-23T12:11:28.453Z cpu0:2097339)WARNING: vmw_psp_rr: psp_rrSelectPath:2177: Could not select path for device "naa.600507680c8000689000000000002f49".
2018-11-23T12:11:28.453Z cpu0:2097339)WARNING: NMP: nmp_IssueCommandToDevice:5082: I/O could not be issued to device "naa.600507680c8000689000000000002f49" due to Not found
2018-11-23T12:11:28.453Z cpu0:2097339)WARNING: NMP: nmp_DeviceRetryCommand:133: Device "naa.600507680c8000689000000000002f49": awaiting fast path state update for failover with I/O blocked. No prior reservation exists on the device.
2018-11-23T12:11:28.453Z cpu0:2097339)WARNING: NMP: nmp_DeviceStartLoop:729: NMP Device "naa.600507680c8000689000000000002f49" is blocked. Not starting I/O from device.
2018-11-23T12:11:28.457Z cpu42:2098022)WARNING: NMP: nmpDeviceAttemptFailover:640: Retry world failover device "naa.600507680c8000689000000000002f49" - issuing command 0x459b410438c0
2018-11-23T12:11:28.459Z cpu2:2098139)NMP: nmpCompleteRetryForPath:327: Retry world recovered device "naa.600507680c8000689000000000002f49"
2018-11-23T12:11:28.463Z cpu36:2098134)ScsiDeviceIO: 3082: Cmd(0x459b41003cc0) 0x35, CmdSN 0x144fc from world 0 to dev "eui.0050430000000000" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0.
2018-11-23T12:11:28.465Z cpu0:2097339)HBX: 6287: APD EventType: APD_EXIT (4) for vol 'APDTEST'
2018-11-23T12:11:28.474Z cpu7:2098358)HBX: 312: 'APDTEST': HB at offset 3178496 - Reclaimed heartbeat [Timeout]:
2018-11-23T12:11:28.474Z cpu7:2098358) [HB state abcdef02 offset 3178496 gen 7 stampUS 259800031085 uuid 5bf3f7c7-bb9c97bc-97d5-7cd30aed3a00 jrnl <FB 9> drv 24.82 lockImpl 4 ip 10.212.5.127]
2018-11-23T12:11:28.474Z cpu7:2098358)HBX: 5509: Completed APD/PDL event processing on vol 'APDTEST'.
2018-11-23T12:11:28.474Z cpu10:2097674)StorageApdHandler: 507: APD exit event for 0x4307894417b0 [naa.600507680c8000689000000000002f49]
2018-11-23T12:11:28.474Z cpu10:2097674)StorageApdHandlerEv: 117: Device or filesystem with identifier [naa.600507680c8000689000000000002f49] has exited the All Paths Down state.

Even VMware Component Protection works now (Response recovery set to Restart VM)

2018-11-23T12:11:29.502Z verbose fdm[2177615] [Originator@6876 sub=Policy opID=SWI-3cfbb7a2] [MonitorComponentState::ChangeStateIfTimersExpired] Go to reset state for VM /vmfs/volumes/5bf7e56e-0ee182ee-63c1-7cd30aed3a08/_apdtest/_apdtest.vmx because APD recovered before VMCP reacts
2018-11-23T12:11:29.502Z info fdm[2177615] [Originator@6876 sub=Policy opID=SWI-3cfbb7a2] [VmOperationsManager::PerformResets] Resetting a list of 1 VMs.

Its shame it is vSphere 6.7 feature only! my friend at VMware confirmed, it will not be backported to the older versions.

Those who were waiting for this, even those who didn’t know they need it, you are welcome 🙂


Update December 11th 2018: It was backported to the vSphere 6.5 as well 🙂 Although under different PR number, that’s why we thought it won’t be. esxi650-201811002

The following two tabs change content below.
Dusan has over 8 years experience in the Virtualization field. Currently working as Senior VMware plarform Architect at one of the biggest retail bank in Slovakia. He has background in closely related technologies including server operating systems, networking and storage. Used to be a member of VMware Center of Excellence at IBM, co-author of several Redpapers. His main scope of work consists from designing and performance optimization of business critical virtualized solutions on vSphere, including, but not limited to Oracle WebLogic, MSSQL and others. He holds several IT industry leading certifications like VCAP-DCD, VCAP-DCA, MCITP and the others. Honored with #vExpert2015-2018 awards by VMware for his contribution to the community. Opinions are my own!

About Dusan Tekeljak

Dusan has over 8 years experience in the Virtualization field. Currently working as Senior VMware plarform Architect at one of the biggest retail bank in Slovakia. He has background in closely related technologies including server operating systems, networking and storage. Used to be a member of VMware Center of Excellence at IBM, co-author of several Redpapers. His main scope of work consists from designing and performance optimization of business critical virtualized solutions on vSphere, including, but not limited to Oracle WebLogic, MSSQL and others. He holds several IT industry leading certifications like VCAP-DCD, VCAP-DCA, MCITP and the others. Honored with #vExpert2015-2018 awards by VMware for his contribution to the community. Opinions are my own!
Bookmark the permalink.

One Comment

  1. Pingback: Just Another ESXi 6.0 Storage APD Handling Bug - The Virtualist

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.