Save to My DOJO
If you’re running a particularly important application that simply must be available at all times, VMware fault tolerance enhances the protection of a Virtual Machine and provides a higher level of business continuity than failover alternatives. In this article, I’ll explain how to get started with VMware fault tolerance in vSphere and what you need to know to ensure you get set up properly.
I’ve had this one in the works for a while and had to find the time to get it finished up. It’s one of those features that has some great use cases but some people simply are not aware of it. In my recent experience, you have either used it a ton or rarely used it at all. To be honest, it was always one of those neat but not super functional tools, that is until vSphere 6 came out.
Before we go any further let’s discuss what is fault tolerance in VMware for anyone wondering. VMware vSphere fault tolerance works by continuously replicating an entire running VM from one physical server to another. The VMware fault tolerance enabled VM has two instances:
- Primary VM
- Secondary VMUse cases for FT
- Applications that need to be available at all times, especially those that have long-lasting client connections that users want to maintain during hardware failure.
- Clustering applications can be complex and take time & additional funding to set up. VMware fault tolerance configuration is just a few clicks.
- Legacy applications might not support native clustering.
- Cases where high availability might be provided through custom clustering solutions, which are too complicated to configure and maintain.
During certain critical periods over the life cycle of a VM, you might want to enhance the protection of the virtual machine. For example, you might be executing a script that, if interrupted, might delay the availability of mission-critical information. With VMware fault tolerance, you can protect this virtual machine prior to running this report and then turn off or disable VMware fault tolerance after the report has been produced. You also can leave VMware vSphere fault tolerance running all the time but it’s important to note that you can use it on demand if you want.
Each VM is running on a different ESXi host. The machines are logically identical; they represent a single VM state and a single network identity, but they are physically separate instances. This helps in the event of a host failure. If you’ve designed it right, your VMs won’t have any downtime. Think of VMware vSphere fault tolerance as a supercharged High Availability!
Back to what seems like an eternity now, in vSphere 4, VMware fault tolerance had a ton of limitations. It only allowed single-core virtual machines and required thick provisioned disks on shared storage, which wasn’t ideal. Most of the people that I worked with, both as a consultant and a VMware Trainer, did not find it particularly useful. Mostly because of that single core limitation. The production workloads they ran generally required more than a single core and they wanted to make sure they had redundant copies of the data. Because the older version of Fault Tolerance required that the VMs be on a shared disk, you couldn’t do it.
So when VMware overhauled it in vSphere 6.x, they improved both of those above negatives. vSphere brought not only the possibility to configure four vCPUs for FT-enabled VMs, but also completely changed the technology (and marketing lingo!) under the covers and improved the network latency. Previous releases of vSphere used vLockstep technology where the primary and secondary VMs were in sync. vSphere is now using a technology called “Fast Checkpointing.”
Each machine has its own virtual machine files (vmx), and virtual machine disk files (vmdk). After turning on VMware fault tolerance, the first synchronization of the virtual machine disk files happens using vSphere Storage vMotion. Subsequently, VMware vSphere fault tolerance will mirror vmdk writes between the primary and secondary VM over the VMware fault tolerance logging network. This FT network needs to be a 10 Gigabit network if you’re going to be doing multi vCPU machines. I think of VMware fault tolerance now like a vMotion that never ends. When you kick it off, it’s constantly sending those changes and does not quit sending them until you turn it off. It’s why you need to be fairly picky about what machine(s) you enable it on.
The official VMware fault tolerance requirements are listed below:
- CPUs that are used in host machines for fault-tolerant VMs must be compatible with vSphere vMotion. Also, CPUs that support Hardware MMU virtualization (Intel EPT or AMD RVI) are required. The following CPUs are supported.
- Intel Sandy Bridge or later. Avoton is not supported.
- AMD Bulldozer or later.
- Use a 10-Gbit VMware fault tolerance logging network for FT and verify that the network is low latency. A dedicated FT network is highly recommended. (Don’t use VLANs on a shared network uplink!)
Now that we’ve covered the requirements and why you’d want to use it, let’s cover some of the limitations. I’ll go ahead and put this out there now since you’re probably thinking WOW, this tool is amazing! I’ll just turn this option on for all of my virtual machines and NEVER have downtime. Unfortunately, it doesn’t work that way. You’ll have to be selective on which VMs you’ll want to enable it on.
- The maximum number of fault-tolerant VMs allowed on a host in the cluster is 4. Both Primary VMs and Secondary VMs count toward this limit.
- The maximum number of vCPUs aggregated across all fault-tolerant VMs on a host is 8. vCPUs from both Primary VMs and Secondary VMs count toward this limit.
- The number of vCPUs supported by a single fault-tolerant VM is limited by the level of fault tolerance VMware licensing that you have purchased for vSphere. VMware fault tolerance is supported as follows:
- vSphere Standard and Enterprise. Allows up to 2 vCPUs
- vSphere Enterprise Plus. Allows up to 8 vCPUs
You will first need to create a VMkernel port. A VMkernel port is used for back-end host things. Like management, vMotion, vSAN, and VMware fault tolerance logging. If a host has to talk to another host, it’s likely that you will need a VMkernel port.
Vmware fault tolerance enabled vmkernel
You can create one using the following steps in the vSphere Web Client.
- In the vSphere Web Client, navigate to the Host.
- Under Manage, select Networking and then select VMkernel adapters.
- Click Add host networking.
- On the Select connection type page, select VMkernel Network Adapter and click Next.
- On the Select target device page, select either an existing standard switch or a New vSphere standard switch.
- On the Port properties, enable VMware fault tolerance and select Next.
- Configure network for the vMotion VMkernel interface and click Next.
- Review the settings and click Finish.
After you have your VMkernel port group created, you turn on VMware Fault tolerance by simply right-clicking the virtual machine and selecting VMware fault tolerance from the menu.
Enable Fault tolerance on a VM
You then need to select a datastore to place the secondary VM disks and configuration files. It is recommended to select a different datastore than the primary VM, even a different backing device if at all possible.
Select a datastore for the secondary VM
Then you are asked to specify a host on which to create the secondary VM.
Select a host for the secondary VM
Finally, review the changes and click Next if you are happy with them.
Review the VMware fault tolerance changes
When VMware fault tolerance is running, you’ll notice the icon of the VM becomes dark blue as opposed to the regular grey icon. If you go to the VM Summary tab, it will indicate what host the primary and secondary virtual machines are running on.
The status of the VMware fault tolerance enabled VM is displayed in the summary tab.
Once VMware vSphere Fault Tolerance is enabled on a VM, you can manage the feature to ensure that it is working properly or for maintenance purposes.
Right-click on the VM and display the choices.
- Suspend Fault Tolerance: pause the logging traffic for various reasons.
- Migrate Secondary: If you want to move the secondary VM to another vSphere host.
- Test Failover: Make sure that your secondary can take on the role of primary correctly.
- Test Restart Secondary: Make sure the secondary VM can restart fine.
VMware Fault Tolerance offers options to manage the feature on a VM
To properly protect your VMware environment, use Altaro VM Backup to securely backup and replicate your virtual machines. We work hard perpetually to give our customers confidence in their VMware backup strategy.
To keep up to date with the latest VMware best practices, become a member of the VMware DOJO now (it’s free).
As an added bonus, vSphere 6.7 added storage failure protection to VMware fault tolerance. Should there be an APD (All Paths Down) event to storage, this will now trigger a failover of VMware fault tolerance protected VMs.
Hopefully, this post has helped clear up any confusion you might have had over VMware fault tolerance. It really is a great tool if used in the right case!
Not a DOJO Member yet?
Join thousands of other IT pros and receive a weekly roundup email with the latest content & updates!