ESXi 6.7 U1 fixes: APD and VMCP is not triggered even when no paths can service I/Os

Since there are no more details I wasn’t 100% sure it is fixing my issue. I tested it and it indeed fixes my bug!

As expected in default configuration with /Scsi/ExtendAPDCondition disabled, APD handling is not working for SCSI NOT READY

vmkernel.log flood messages like:

2018-11-23T11:50:15.495Z cpu26:2097964)lpfc: lpfc_handle_status:5066: 1:(0):3271: FCP cmd x89 failed <14/135> sid xd80f00, did x733600, oxid x105 iotag x40b SCSI Chk Cond - Not Ready: Data(x2:x2:x4:xc)
2018-11-23T11:50:15.495Z cpu27:2098141)NMP: nmp_ThrottleLogForDevice:3781: Cmd 0x89 (0x45a37928ec00, 2097213) to dev "naa.600507680c8000689000000000002f49" on path "vmhba6:C0:T14:L135" Failed: H:0x0 D:0x2 P:0x0 Valid sense data: 0x2 0x4 0xc. Act:FAILOVER
2018-11-23T11:50:15.495Z cpu27:2098141)WARNING: NMP: nmp_DeviceRetryCommand:133: Device "naa.600507680c8000689000000000002f49": awaiting fast path state update for failover with I/O blocked. No prior reservation exists on the device.
2018-11-23T11:50:16.454Z cpu26:2097689)WARNING: vmw_psp_rr: psp_rrSelectPathToActivate:1844: Could not select path for device "naa.600507680c8000689000000000002f49".
2018-11-23T11:50:16.454Z cpu42:2098022)WARNING: NMP: nmpDeviceAttemptFailover:640: Retry world failover device "naa.600507680c8000689000000000002f49" - issuing command 0x45a37928ec00
2018-11-23T11:50:16.454Z cpu42:2098022)WARNING: vmw_psp_rr: psp_rrSelectPath:2177: Could not select path for device "naa.600507680c8000689000000000002f49".
2018-11-23T11:50:16.454Z cpu42:2098022)WARNING: NMP: nmpDeviceAttemptFailover:715: Retry world failover device "naa.600507680c8000689000000000002f49" - failed to issue command due to Not found (APD), try again...
2018-11-23T11:50:16.454Z cpu42:2098022)WARNING: NMP: nmpDeviceAttemptFailover:765: Logical device "naa.600507680c8000689000000000002f49": awaiting fast path state update...
2018-11-23T11:50:17.454Z cpu26:2097689)WARNING: vmw_psp_rr: psp_rrSelectPathToActivate:1844: Could not select path for device "naa.600507680c8000689000000000002f49".
2018-11-23T11:50:17.454Z cpu42:2098022)WARNING: NMP: nmpDeviceAttemptFailover:640: Retry world failover device "naa.600507680c8000689000000000002f49" - issuing command 0x45a37928ec00
2018-11-23T11:50:17.454Z cpu42:2098022)WARNING: vmw_psp_rr: psp_rrSelectPath:2177: Could not select path for device "naa.600507680c8000689000000000002f49".
2018-11-23T11:50:17.454Z cpu42:2098022)WARNING: NMP: nmpDeviceAttemptFailover:715: Retry world failover device "naa.600507680c8000689000000000002f49" - failed to issue command due to Not found (APD), try again...
2018-11-23T11:50:17.454Z cpu42:2098022)WARNING: NMP: nmpDeviceAttemptFailover:765: Logical device "naa.600507680c8000689000000000002f49": awaiting fast path state update...

However once I enabled /Scsi/ExtendAPDCondition

vmkernel log showed APD handling messages immediately:

2018-11-23T11:54:36.470Z cpu36:2100104 opID=135c47ca)Config: 703: "ExtendAPDCondition" = 1, Old Value: 0, (Status: 0x0)

2018-11-23T11:56:06.495Z cpu26:2098516)lpfc: lpfc_handle_status:5066: 1:(0):3271: FCP cmd x89 failed <14/135> sid xd80f00, did x733600, oxid x100 iotag x406 SCSI Chk Cond - Not Ready: Data(x2:x2:x4:xc)
2018-11-23T11:56:06.495Z cpu32:2098141)NMP: nmp_ThrottleLogForDevice:3781: Cmd 0x89 (0x45a3793678c0, 2097213) to dev "naa.600507680c8000689000000000002f49" on path "vmhba6:C0:T14:L135" Failed: H:0x0 D:0x2 P:0x0 Valid sense data: 0x2 0x4 0xc. Act:FAILOVER
2018-11-23T11:56:06.495Z cpu32:2098141)WARNING: NMP: nmp_DeviceRetryCommand:133: Device "naa.600507680c8000689000000000002f49": awaiting fast path state update for failover with I/O blocked. No prior reservation exists on the device.
2018-11-23T11:56:07.453Z cpu29:2097689)WARNING: vmw_psp_rr: psp_rrSelectPathToActivate:1844: Could not select path for device "naa.600507680c8000689000000000002f49".
2018-11-23T11:56:07.453Z cpu29:2097689)LVM: 5748: Received APD EventType: APD_START (3) for device <naa.600507680c8000689000000000002f49:1> (gen 1)
2018-11-23T11:56:07.453Z cpu29:2097689)LVM: 5356: Handling APD EventType: APD_START (3) for device <naa.600507680c8000689000000000002f49:1> (unlocked, gen 1, cur apd state UNKNOWN, cur dev state 1)
2018-11-23T11:56:07.453Z cpu29:2097689)HBX: 6287: APD EventType: APD_START (3) for vol 'APDTEST'
2018-11-23T11:56:07.453Z cpu29:2097689)ScsiDevice: 5978: Device state of naa.600507680c8000689000000000002f49 set to APD_START; token num:1
2018-11-23T11:56:07.453Z cpu29:2097689)StorageApdHandler: 1203: APD start for 0x4307894417b0 [naa.600507680c8000689000000000002f49]
2018-11-23T11:56:07.453Z cpu42:2098022)WARNING: NMP: nmpDeviceAttemptFailover:640: Retry world failover device "naa.600507680c8000689000000000002f49" - issuing command 0x45a3793678c0
2018-11-23T11:56:07.453Z cpu10:2097674)StorageApdHandler: 419: APD start event for 0x4307894417b0 [naa.600507680c8000689000000000002f49]
2018-11-23T11:56:07.453Z cpu42:2098022)WARNING: vmw_psp_rr: psp_rrSelectPath:2177: Could not select path for device "naa.600507680c8000689000000000002f49".
2018-11-23T11:56:07.453Z cpu10:2097674)StorageApdHandlerEv: 110: Device or filesystem with identifier [naa.600507680c8000689000000000002f49] has entered the All Paths Down state.
2018-11-23T11:56:07.453Z cpu42:2098022)WARNING: NMP: nmpDeviceAttemptFailover:715: Retry world failover device "naa.600507680c8000689000000000002f49" - failed to issue command due to Not found (APD), try again...
2018-11-23T11:56:07.453Z cpu42:2098022)WARNING: NMP: nmpDeviceAttemptFailover:765: Logical device "naa.600507680c8000689000000000002f49": awaiting fast path state update...

And exactly after 140 seconds which is default APD handling timeout value:

2018-11-23T11:58:27.455Z cpu10:2097674)LVM: 5748: Received APD EventType: APD_TIMEOUT (5) for device <naa.600507680c8000689000000000002f49:1> (gen 2)
2018-11-23T11:58:27.455Z cpu10:2097674)LVM: 5356: Handling APD EventType: APD_TIMEOUT (5) for device <naa.600507680c8000689000000000002f49:1> (unlocked, gen 2, cur apd state APD_START, cur dev state 1)
2018-11-23T11:58:27.455Z cpu10:2097674)HBX: 6287: APD EventType: APD_TIMEOUT (5) for vol 'APDTEST'
2018-11-23T11:58:27.455Z cpu10:2097674)StorageApdHandler: 609: APD timeout event for 0x4307894417b0 [naa.600507680c8000689000000000002f49]
2018-11-23T11:58:27.455Z cpu10:2097674)StorageApdHandlerEv: 126: Device or filesystem with identifier [naa.600507680c8000689000000000002f49] has entered the All Paths Down Timeout state after being in the All Paths Down state for 140 seconds. I/Os will

You may also observe similar backtraces:

2018-11-23T11:58:30.457Z cpu0:2097622)BC: 410: write to vmware.log (f532 28 3 5bf7e56e ee182ee d37c63c1 83aed0a 2000c44 e 0 0 0 0 0) 1054 bytes failed: No connection
2018-11-23T11:58:30.457Z cpu0:2097622)Log: 1437: Generating backtrace for 2097622: bcflushd
2018-11-23T11:58:30.457Z cpu0:2097622)Backtrace for current CPU #0, worldID=2097622, fp=0x2001d6
2018-11-23T11:58:30.457Z cpu0:2097622)0x451b0eb1ba40:[0x4180160f72c2]Log_Backtrace@vmkernel#nover+0x10e stack: 0xf532, 0x60050, 0xbad0040, 0x418016056bd9, 0x3
2018-11-23T11:58:30.457Z cpu0:2097622)0x451b0eb1ba70:[0x418016056bd8]BCFileIO@vmkernel#nover+0x1f9 stack: 0x451a0ee182ee, 0x4180d37c63c1, 0x451b083aed0a, 0x451a02000c44, 0x262b60000000e
2018-11-23T11:58:30.457Z cpu0:2097622)0x451b0eb1bd50:[0x418016058047]BCWriteBuffer@vmkernel#nover+0x200 stack: 0x1f2bf6, 0xffffffffffffffff, 0x418000000001, 0x37000, 0x809fdb7000
2018-11-23T11:58:30.457Z cpu0:2097622)0x451b0eb1bf60:[0x4180160581b3]BCFlushFile@vmkernel#nover+0x44 stack: 0x417fd6133d40, 0x4180160581e6, 0x4180160581e6, 0x418016058295, 0x451b0eb1bfaf
2018-11-23T11:58:30.457Z cpu0:2097622)0x451b0eb1bf90:[0x418016058294]BCFlushAll@vmkernel#nover+0xc5 stack: 0x451b0e4a3100, 0x451b0eb23100, 0x0, 0x0, 0x0
2018-11-23T11:58:30.457Z cpu0:2097622)0x451b0eb1bfd0:[0x4180160582c9]BCFlusherWorld@vmkernel#nover+0xe stack: 0x0, 0x0, 0x0, 0x0, 0x0
2018-11-23T11:58:30.457Z cpu0:2097622)0x451b0eb1bfe0:[0x418016309112]CpuSched_StartWorld@vmkernel#nover+0x77 stack: 0x0, 0x0, 0x0, 0x0, 0x0

I was a bit worried they may event lead into PSoD, so I checked with VMware and it looks like they are perfectly normal and safe during the APD condition.

And here we have recovery from the APD

2018-11-23T12:11:28.453Z cpu31:2097689)ScsiDevice: 6000: Setting Device naa.600507680c8000689000000000002f49 state back to 0x2
2018-11-23T12:11:28.453Z cpu31:2097689)LVM: 5748: Received APD EventType: APD_EXIT (4) for device <naa.600507680c8000689000000000002f49:1> (gen 3)
2018-11-23T12:11:28.453Z cpu31:2097689)LVM: 5356: Handling APD EventType: APD_EXIT (4) for device <naa.600507680c8000689000000000002f49:1> (unlocked, gen 3, cur apd state APD_TIMEOUT, cur dev state 1)
2018-11-23T12:11:28.453Z cpu31:2097689)ScsiDevice: 6021: Device naa.600507680c8000689000000000002f49 is Out of APD; token num:1
2018-11-23T12:11:28.453Z cpu0:2097339)LVM: 5356: Handling APD EventType: APD_EXIT (4) for device <naa.600507680c8000689000000000002f49:1> (locked, gen 3, cur apd state APD_TIMEOUT, cur dev state 1)
2018-11-23T12:11:28.453Z cpu0:2097339)WARNING: vmw_psp_rr: psp_rrSelectPath:2177: Could not select path for device "naa.600507680c8000689000000000002f49".
2018-11-23T12:11:28.453Z cpu0:2097339)WARNING: NMP: nmp_IssueCommandToDevice:5082: I/O could not be issued to device "naa.600507680c8000689000000000002f49" due to Not found
2018-11-23T12:11:28.453Z cpu0:2097339)WARNING: NMP: nmp_DeviceRetryCommand:133: Device "naa.600507680c8000689000000000002f49": awaiting fast path state update for failover with I/O blocked. No prior reservation exists on the device.
2018-11-23T12:11:28.453Z cpu0:2097339)WARNING: NMP: nmp_DeviceStartLoop:729: NMP Device "naa.600507680c8000689000000000002f49" is blocked. Not starting I/O from device.
2018-11-23T12:11:28.457Z cpu42:2098022)WARNING: NMP: nmpDeviceAttemptFailover:640: Retry world failover device "naa.600507680c8000689000000000002f49" - issuing command 0x459b410438c0
2018-11-23T12:11:28.459Z cpu2:2098139)NMP: nmpCompleteRetryForPath:327: Retry world recovered device "naa.600507680c8000689000000000002f49"
2018-11-23T12:11:28.463Z cpu36:2098134)ScsiDeviceIO: 3082: Cmd(0x459b41003cc0) 0x35, CmdSN 0x144fc from world 0 to dev "eui.0050430000000000" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0.
2018-11-23T12:11:28.465Z cpu0:2097339)HBX: 6287: APD EventType: APD_EXIT (4) for vol 'APDTEST'
2018-11-23T12:11:28.474Z cpu7:2098358)HBX: 312: 'APDTEST': HB at offset 3178496 - Reclaimed heartbeat [Timeout]:
2018-11-23T12:11:28.474Z cpu7:2098358) [HB state abcdef02 offset 3178496 gen 7 stampUS 259800031085 uuid 5bf3f7c7-bb9c97bc-97d5-7cd30aed3a00 jrnl <FB 9> drv 24.82 lockImpl 4 ip 10.212.5.127]
2018-11-23T12:11:28.474Z cpu7:2098358)HBX: 5509: Completed APD/PDL event processing on vol 'APDTEST'.
2018-11-23T12:11:28.474Z cpu10:2097674)StorageApdHandler: 507: APD exit event for 0x4307894417b0 [naa.600507680c8000689000000000002f49]
2018-11-23T12:11:28.474Z cpu10:2097674)StorageApdHandlerEv: 117: Device or filesystem with identifier [naa.600507680c8000689000000000002f49] has exited the All Paths Down state.

Even VMware Component Protection works now (Response recovery set to Restart VM)

2018-11-23T12:11:29.502Z verbose fdm[2177615] [Originator@6876 sub=Policy opID=SWI-3cfbb7a2] [MonitorComponentState::ChangeStateIfTimersExpired] Go to reset state for VM /vmfs/volumes/5bf7e56e-0ee182ee-63c1-7cd30aed3a08/_apdtest/_apdtest.vmx because APD recovered before VMCP reacts
2018-11-23T12:11:29.502Z info fdm[2177615] [Originator@6876 sub=Policy opID=SWI-3cfbb7a2] [VmOperationsManager::PerformResets] Resetting a list of 1 VMs.

~~Its shame it is vSphere 6.7 feature only! my friend at VMware confirmed, it will not be backported to the older versions.~~

Those who were waiting for this, even those who didn’t know they need it, you are welcome 🙂

Update December 11th 2018: It was backported to the vSphere 6.5 as well 🙂 Although under different PR number, that’s why we thought it won’t be. esxi650-201811002

Bio
Latest Posts

Dusan Tekeljak

With over 12 years of experience in the Virtualization field, currently working as a Senior Consultant for Evoila, contracted to VMware PSO, helping customers with Telco Cloud Platform bundle. Previous roles include VMware Architect for Public Cloud services at Etisalat and Senior Architect for the VMware platform at the largest retail bank in Slovakia. Background in closely related technologies includes server operating systems, networking, and storage. A former member of the VMware Center of Excellence at IBM and co-author of several Redpapers. The main scope of work involves designing and optimizing the performance of business-critical virtualized solutions on vSphere, including, but not limited to, Oracle WebLogic, MSSQL, and others. Holding several industry-leading IT certifications such as VCAP-DCD, VCAP-DCA, VCAP-NV, and MCITP. Honored with #vExpert2015-2019 awards by VMware for contributions to the community. Opinions are my own!

Latest posts by Dusan Tekeljak (see all)

VM Latency Sensitivity set to High still fails with no (proper) warning - June 27, 2024
ESXi 6.7 U1 fixes: APD and VMCP is not triggered even when no paths can service I/Os - November 30, 2018
Update manager error: hosts could not enter maintenance mode - November 19, 2018

ESXi 6.7 U1 fixes: APD and VMCP is not triggered even when no paths can service I/Os

Since there are no more details I wasn’t 100% sure it is fixing my issue. I tested it and it indeed fixes my bug!

However once I enabled /Scsi/ExtendAPDCondition

You may also observe similar backtraces:

And here we have recovery from the APD

Even VMware Component Protection works now (Response recovery set to Restart VM)

Dusan Tekeljak

Latest posts by Dusan Tekeljak (see all)

About Dusan Tekeljak

One Comment

Leave a ReplyCancel reply

Last Posts

Top Posts & Pages

Categories

ESXi 6.7 U1 fixes: APD and VMCP is not triggered even when no paths can service I/Os

Since there are no more details I wasn’t 100% sure it is fixing my issue. I tested it and it indeed fixes my bug!

However once I enabled /Scsi/ExtendAPDCondition

You may also observe similar backtraces:

And here we have recovery from the APD

Even VMware Component Protection works now (Response recovery set to Restart VM)

Dusan Tekeljak

Latest posts by Dusan Tekeljak (see all)

Share this:

About Dusan Tekeljak

One Comment

Leave a ReplyCancel reply

Last Posts

Top Posts & Pages

Categories

Tags