NSX-V Manager Cross Site Failover

Share this:


Special thanks to Jack Cherkas(@jackcherkas) for helping with this blog.

The story

So let me cover the scope here:
1. The Customer has a Stretched Cross-Site environment, with dedicated Management and Edge clusters in each site, and a Stretched vSAN cluster for Compute (customer workloads).
2. There is also a third site, which is used to host Witness components.
3. And I guess it’s obvious but worth mentioning, the whole environment is managed by single vCenter Server and NSX-V Manager.

Here is a very high level diagram of the environment.

NSX-V Manager is hosted on the Management Cluster in Site 1, so how to ensure that in case of a Site failure, we will have NSX manager available and manageable? I mean, in case of Site 2 Failure it’s pretty straight forward, but what if Site 1 fails?

We all know that one of the main features which is still missing in NSX-V Manager is the ability to create some sort of HA, and I am not even talking about cross site HA.  Well, actually there is no real way to protect NSX-V Manager at all, except taking a back-up and then restoring it.

Wait, backup and restore…… Actually that’s the key. Starting NSX 6.4 the NSX-V Manager can be restored with different IP from the one it had configured during installation. Link

So why not to use that option.

Solution

The approach I will describe is very straight forward.
Here is what you need to do:
1. Deploy and configure a blank NSX-V Manager appliance of the same version in the second site.
2. Configure it to use same Hostname but different IP.
3. Configure you Primary NSX manager to perform scheduled backups to a certain Backup location.
4. Replicate you backup files to a backup server in the Second Site.
5. Configure your Secondary NSX manager to be able to read Backup files from Backup Server on the Second Site. Do not schedule any backups here.

Here is a simple picture to make it more visible:

Now, so how do we restore NSX manager if we have all that? Actually you need to do a couple of steps, and these can be either automated or performed manually.

So lets imagine Site 1 failed.
1. Rewrite DNS record for you NSX-V manager to resolve to the IP of NSX-V Manager in Secondary site.
2. Log-in to the Admin web interface of NSX-V Manager in Secondary site and perform a restore to the last known good backup.
3. You might need to restart the vCenter appliance to pickup the changed IP of NSX-V Manager.
4. Go to the host preparation tab and click fix on each cluster so agents on ESXi hosts pickup the NSX-V Manager IP change.

How to restore vCenter in this case is a different story, I might cover it in a different article. Another thing to worry about here is the fact you need to make sure at least 2 of your controller are up during site failure otherwise you might get in trouble especially if you use a lot of dynamic routing. But assuming everything else is covered that’s all you need to do to recover NSX-V Manager.

Some Details

To give a bit more on the technical side. Here are some screenshots from running config.

Active NSX-v Manager
Stand-by NSX-V Manager

Pay close attention to the highlighted areas. Same version, same hostname, but different IPs and Different Backup servers. One has scheduled backup, second one not.

What comes to replication of backup files, there are many ways to achieve that. I wrote a really simple rsync script and scheduled it to run hourly.

#!/bin/bash
echo '#######################################################' >> /tmp/rsync.log
date >> /tmp/rsync.log
rsync -rav -e ssh --delete /vbackup root@10.45.143.201:/ >> /tmp/rsync.log

In conclusion I want to repeat, that recovering NSX-V Manager on its own makes no sense. you need to also make sure your vCenter is recovered, and at least 2 of your controllers are available. vCenter recovery can be achieved using vCenter HA, although doing that over L3 networks, where primary and secondary vCenter appliances are on different subnets, is quite challenging. That solution is somehow documented, but that documentation is far from perfect. I will try to spare some time and document the exact steps I had to perform to make vCenter HA over L3 work.   

The following two tabs change content below.
Aram Avetisyan is an IT specialist with more than 18 years experience. He has rich background in various IT related fields like Cloud, Virtualization and SDN. He holds several industry level certifications including but not limited to VCIX-DCV, VCIX-NV. He is also a vEXPERT in years 2014-2021.

About Aram Avetisyan

Aram Avetisyan is an IT specialist with more than 18 years experience. He has rich background in various IT related fields like Cloud, Virtualization and SDN. He holds several industry level certifications including but not limited to VCIX-DCV, VCIX-NV. He is also a vEXPERT in years 2014-2021.
Bookmark the permalink.

3 Comments

  1. Pingback: vCenter HA over Layer 3 Network: Step-by-Step Guide - The Virtualist

  2. Aram, you say thay it is very important to have at least 2 controllers up. But if the site 1 was completely failed, it is supposed that also the controllers are down. Couldn’t you restore the NSX Manager in site 2 without controllers in order to redeploy the controllers in site 2 also?
    I’m wondering how to do a disaster recovery procedure if you need the controllers to restore the NSX Manager in site 2 but you need the NSX Manager to redeploy the controllers at the same site 2!!!
    Thanks.
    Guido.

    • Hello Guido, Thanks for comment.
      My note relates to “having 2 controllers” is not directly related to NSX Manager restore.
      As a matter of fact if we revisit that paragraph you will see that i am talking about general site failure, about restoring vCenter and about Dynamic routing. Here is a quote “””you need to make sure at least 2 of your controller are up during SITE failure otherwise you might get in trouble especially if you use a lot of dynamic routing”””
      Remember this article talks about Stretched environment. In stretched environment you need to make sure that the VMs restarted by vSphere HA on another Site have proper routing table to communicate, hence you need to make sure dynamic routing works, hence you need at least 2 controllers up.
      If your use case is different, if you have DR case rather than active/active Stretched cluster case, your option is to restore NSX Manager and then redeploy 2 missing controllers. But keep in mind that VMs which you will failover from Site to Site will not be able to communicate until you do that. All routing table updates require Controllers to be up. Hope this helps.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.