HDD failed in vSAN cluster (Softlayer) – how to replace it

Share this:

In this article I would like to show you how to replace failed HDD (or SSD) drive in VMware vSAN cluster running on Softlayer Cloud.

1. Open web-client and navigate to: Hosts and Clusters -> choose Cluster -> Manage -> Settings -> Virtual SAN -> Disk Management

You will see all ESXi hosts in the vSAN enabled cluster and all vSAN disk groups.

2. Choose Disk group, identify failed HDD

3. Remove disk from diskgroup.

Most probably Migration mode “Full data migration” will not work as disk is failed and data can’t be read. So you need to choose “No data migration”.

Wait until failed disk disappear from the list in disk group.

4. Login to ESXi host by SSH

Enter the command: /opt/lsi/storcli/storcli /c0 show

(you need to have storcli installed on the ESXi host, download it from here, find “Latest MegaRAID Storcli” and install StorCLI)

 

You will get something like below. I’m skipping some non-important info here. You need “VD LIST” and “PD LIST” :

[root@hostname:~] /opt/lsi/storcli/storcli /c0 show
Generating detailed summary of the adapter, it may take a while to complete.

[...]

VD LIST :
=======

-------------------------------------------------------------------
DG/VD TYPE  State Access Consist Cache Cac sCC       Size Name
-------------------------------------------------------------------
0/0   RAID1 Optl  RW     Yes     RWBD  -   ON    931.0 GB RAID1-A
1/1   RAID0 Optl  RW     Yes     NRWTD -   ON  744.687 GB VSAN-SSD
2/2   RAID0 Optl  RW     Yes     RWTD  -   ON    1.818 TB VSAN
3/3   RAID0 Optl  RW     Yes     RWTD  -   ON    1.818 TB VSAN
4/4   RAID0 OfLn  RW     No      RWTD  -   ON    1.818 TB VSAN
5/5   RAID0 Optl  RW     Yes     RWTD  -   ON    1.818 TB VSAN
-------------------------------------------------------------------

[...]

PD LIST :
=======

-------------------------------------------------------------------------------------
EID:Slt DID State  DG       Size Intf Med SED PI SeSz Model                  Sp Type
-------------------------------------------------------------------------------------
8:0       9 Onln    0   931.0 GB SATA HDD N   N  512B ST1000NM0033-9ZM173    U  -
8:1      20 Onln    0   931.0 GB SATA HDD N   N  512B ST1000NM0033-9ZM173    U  -
8:2      11 Onln    1 744.687 GB SATA SSD N   N  512B INTEL SSDSC2BA800G4    U  -
8:3      16 Onln    2   1.818 TB SATA HDD N   N  512B WDC WD2000FYYZ-01UL1B2 U  -
8:4      10 Onln    3   1.818 TB SATA HDD N   N  512B WDC WD2000FYYZ-01UL1B2 U  -
8:5      18 Failed  4   1.818 TB SATA HDD N   N  512B -                      U  -
8:6      12 Onln    5   1.818 TB SATA HDD N   N  512B WDC WD2000FYYZ-01UL1B2 U  -
-------------------------------------------------------------------------------------

[...]

5. Identify the failed raid group in VD LIST. In this case it’s 4/4 in VD LIST and 8:5 in PD LIST.
6. Verify once again if you have chosen the correct VD:

/opt/lsi/storcli/storcli /c0/v4 show

[root@hostname:~] /opt/lsi/storcli/storcli /c0/v4 show
Virtual Drives :
-------------------------------------------------------------
DG/VD TYPE  State Access Consist Cache Cac sCC     Size Name
-------------------------------------------------------------
4/4   RAID0 OfLn  RW     No      RWTD  -   ON  1.818 TB VSAN
-------------------------------------------------------------

7. Now delete the failed raid:

/opt/lsi/storcli/storcli /c0/v4 del

[root@hostname:~] /opt/lsi/storcli/storcli /c0/v4 del
Controller = 0
Status = Success
Description = Delete VD succeeded

8. Open the ticket with Softlayer support for HW failure and paste the evidence:
PD LIST :
=======
————————————————————————————
EID:Slt DID State DG Size Intf Med SED PI SeSz Model Sp Type
————————————————————————————
8:0 9 Onln 0 931.0 GB SATA HDD N N 512B ST1000NM0033-9ZM173 U –
8:1 20 Onln 0 931.0 GB SATA HDD N N 512B ST1000NM0033-9ZM173 U –
8:2 11 Onln 1 744.687 GB SATA SSD N N 512B INTEL SSDSC2BA800G4 U –
8:3 16 Onln 2 1.818 TB SATA HDD N N 512B WDC WD2000FYYZ-01UL1B2 U –
8:4 10 Onln 3 1.818 TB SATA HDD N N 512B WDC WD2000FYYZ-01UL1B2 U –
8:5 18 UBad – 1.818 TB SATA HDD N N 512B – U –
8:6 12 Onln 4 1.818 TB SATA HDD N N 512B WDC WD2000FYYZ-01UL1B2 U –
————————————————————————————

9. Wait for confirmation that the disk was replaced.
10. Do rescan Storage and Refresh. Sometimes ESXi host reboot is required to get correct information about replaced HDD.

11. In case VD is recreated automatically during ESXi host reboot, remove it. Use commands in steps 5-7 (check which VD number you are going to remove and use the appropriate syntaxes).

12. Now Run: /opt/lsi/storcli/storcli /c0 show
Finally PD LIST should be like this (replaced drive is 8:5):

PD LIST :
=======

————————————————————————————
EID:Slt DID State DG Size Intf Med SED PI SeSz Model Sp Type
————————————————————————————
8:0 9 Onln 0 931.0 GB SATA HDD N N 512B ST1000NM0033-9ZM173 U –
8:1 20 Onln 0 931.0 GB SATA HDD N N 512B ST1000NM0033-9ZM173 U –
8:2 11 Onln 1 744.687 GB SATA SSD N N 512B INTEL SSDSC2BA800G4 U –
8:3 16 Onln 2 1.818 TB SATA HDD N N 512B WDC WD2000FYYZ-01UL1B2 U –
8:4 10 Onln 3 1.818 TB SATA HDD N N 512B WDC WD2000FYYZ-01UL1B2 U –
8:5 18 UGood – 1.818 TB SATA HDD N N 512B WDC WD2000FYYZ-01UL1B2 U –
8:6 12 Onln 4 1.818 TB SATA HDD N N 512B WDC WD2000FYYZ-01UL1B2 U –
————————————————————————————

13. Create new VD.

In case of HDD (cache mode RWTD):

/opt/lsi/storcli/storcli /c0 add vd type=raid0 name=VSAN drive=8:5 ra wt direct strip=256

In case of SSD (cache mode NRWTD):

/opt/lsi/storcli/storcli /c0 add vd type=raid0 name=VSAN drive=8:5 nora wt direct strip=256

14. If everything was done correctly you should get similar result (VD 5/4 and PD 8:5).

/opt/lsi/storcli/storcli /c0 show

VD LIST :
=======

-------------------------------------------------------------------
DG/VD TYPE  State Access Consist Cache Cac sCC       Size Name
-------------------------------------------------------------------
0/0   RAID1 Optl  RW     Yes     RWBD  -   ON    931.0 GB RAID1-A
1/1   RAID0 Optl  RW     Yes     NRWTD -   ON  744.687 GB VSAN-SSD
2/2   RAID0 Optl  RW     Yes     RWTD  -   ON    1.818 TB VSAN
3/3   RAID0 Optl  RW     Yes     RWTD  -   ON    1.818 TB VSAN
4/5   RAID0 Optl  RW     Yes     RWTD  -   ON    1.818 TB VSAN
5/4   RAID0 Optl  RW     Yes     RWTD  -   ON    1.818 TB VSAN
-------------------------------------------------------------------

PD LIST :
=======

------------------------------------------------------------------------------------
EID:Slt DID State DG       Size Intf Med SED PI SeSz Model                  Sp Type
------------------------------------------------------------------------------------
8:0       9 Onln   0   931.0 GB SATA HDD N   N  512B ST1000NM0033-9ZM173    U  -
8:1      20 Onln   0   931.0 GB SATA HDD N   N  512B ST1000NM0033-9ZM173    U  -
8:2      11 Onln   1 744.687 GB SATA SSD N   N  512B INTEL SSDSC2BA800G4    U  -
8:3      16 Onln   2   1.818 TB SATA HDD N   N  512B WDC WD2000FYYZ-01UL1B2 U  -
8:4      10 Onln   3   1.818 TB SATA HDD N   N  512B WDC WD2000FYYZ-01UL1B2 U  -
8:5      18 Onln   5   1.818 TB SATA HDD N   N  512B WDC WD2000FYYZ-01UL1B2 U  -
8:6      12 Onln   4   1.818 TB SATA HDD N   N  512B WDC WD2000FYYZ-01UL1B2 U  -
------------------------------------------------------------------------------------

 

15. Now you can add new HDD/SSD to vSAN Disk group.

 

The following two tabs change content below.

Yevgeniy Steblyanko

Yevgeniy Steblyanko is an Infrastructure Architect/SME with experience in virtualization area for more than 15 years. His areas of interest are VMware vSphere, vSAN, NSX, automation on PowerCLI/PowerNSX. He has VMware certifications: VCIX-DCV, VCIX-NV.

About Yevgeniy Steblyanko

Yevgeniy Steblyanko is an Infrastructure Architect/SME with experience in virtualization area for more than 15 years. His areas of interest are VMware vSphere, vSAN, NSX, automation on PowerCLI/PowerNSX. He has VMware certifications: VCIX-DCV, VCIX-NV.
Bookmark the permalink.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.