Save to My DOJO
First, let’s get this tiny bit of information out of the way. vSphere proactive HA is not actually a part of vSphere HA protection, it’s actually a part of vSphere DRS. Ironically though, you configure it under the availability section of vSphere HA protection! Hey, that’s how we like it in IT, right? A tad confusing? Now that we have that bit of information out of the way, let’s dive in.
This is actually one of my favorite features of vSphere. Everyone was so excited about the new HTML 5 based web client that this feature has flown under the radar completely! It’s not nearly as glamorous and pretty but it can and will save you someday! Host hardware can begin to fail without the administrator’s knowledge, we all know that. It’s 2 hours from the weekend and suddenly your weekend plans change.
Enter vSphere proactive Ha…let’s keep your weekend plans intact, right?
vSphere proactive HA will detect hardware conditions of a host and allow you to evacuate the VMs before the issue causes a vSphere HS failover in progress. Failure happens at the most inopportune times. It’s possible that degraded hardware goes on for minutes, hours, or even days, and when it eventually fails, workloads need to be restarted by vSphere HA, causing a service outage. In reality, if only vCenter or the administrator had known, it could have kept the workloads from failing.
vSphere proactive HA can respond to different types of failures events reported by the hardware provider that it uses such as:
- Power Supply.
In your typical server failure, the server just goes down and HA restarts virtual machines. However, with vSphere proactive HA, it allows you to configure certain actions for events that MAY lead to server failure.
For instance, let’s say a power supply has gone down. Your server has redundant power supplies so your server is still up but now has a single point of failure and is in a degraded state. When this occurs vSphere proactive HA will be triggered and evacuation of remaining VMs will be moved to a healthy host in the cluster, the failed host will then be put either in maintenance mode or quarantine mode (mixed mode added as well). Great idea isn’t it?!
vSphere Proactive HA lets you configure various automation and remediation settings.
How do we know the host is degraded, you ask? There are new items called Health Providers that come into play. The Health Providers as of this writing are the main server vendors so you want to check your hardware’s documentation with regard to VMware interaction. However, I am sure that there will be more added in the future. vSphere enabled proactive HA provider on cluster always seems to start with the biggest partners and generally things filter down. If you don’t see your hardware vendor just now, check back in the future.
The health provider reads all the sensor data from the server and analyzes the results and sends the state of the host to a vCenter Server.
vSphere Proactive HA Providers appear when their corresponding vSphere Client plugin has been installed and the providers monitor every host in the cluster
These states are Healthy, Moderate Degradation, Severe Degradation and Unknown. Also known as Green, Yellow and Red! Each provider will be different depending on what server vendor they are distributed by and may have additional features/functionality vs. what their competitors offer, so be aware of that. Note that some hardware vendors will let you override the default criticality of an event, this is the case with Dell OpenManage Integrated for VMware vCenter for instance.
Dell OpenManage Integrated for VMware vCenter lets you change the criticality of an alert
Once vCenter is in the loop and aware of the degraded host, DRS can now take action based on the state of the hosts in a cluster. As with traditional DRS, it evaluates where VMs can go and migrates them to their new hosts.
There are three options for partially failed hosts:
- Quarantine mode for all failures – Do not add new VMs to the host. Balances performance and availability, by avoiding the usage of partially degraded hosts provided that virtual machine performance is unaffected.
- Maintenance mode for all failures – Migrate all VMs of the host and place them in maintenance mode. This ensures that virtual machines do not run on partially failed hosts.
- Quarantine mode for moderate and Maintenance mode for severe failure (Mixed) – Considered a moderate failure, keep VMs running. But for severe failures, it will migrate VMs. Balances performance and availability, by avoiding the usage of moderately degraded hosts provided that virtual machine performance is unaffected. Ensures that virtual machines do not run on severely failed hosts.
Let’s talk about Quarantine mode first. The quarantine mode state allows you to configure vMotion of VMs of the cluster if there will be:
- No performance impact on any other VMs in the cluster.
- None of the DRS rules are compromised.
Quarantine mode also makes sure that none of the newly built VMs in the cluster are placed on that host. It can evacuate off the VMs entirely (Maintenance Mode) and not allow any new machines to be placed on the failed host. When you build a new machine it also will take it into consideration and not put new machines on that host.
Now that we’ve covered quarantine mode, let’s cover maintenance mode in a bit more detail. Maintenance Mode will evacuate all the VMs off of the host, no questions asked. You might be familiar with this mode already as it’s been around for a while. Often used for patching hosts. It does not allow any VMs to run whatsoever.
With Quarantine Mode a full evacuation is not guaranteed. Quarantine Mode is considered the new middle ground. An ESXi host in quarantine can and will be used to satisfy VM demand where needed as opposed to Maintenance Mode.
To properly protect your VMware environment, use Altaro VM Backup to securely backup and replicate your virtual machines. We work hard perpetually to give our customers confidence in their VMware backup strategy.
To keep up to date with the latest VMware best practices, become a member of the VMware DOJO now (it’s free).
Now that we’ve discussed your options, I’m not sure there is an exact 100% concrete best practice here. What you choose will probably be dependent on your environment and its tolerance to risks but vSphere proactive HA gives you extra options in terms of hardware host failures in vSphere 6.5.
Not a DOJO Member yet?
Join thousands of other IT pros and receive a weekly roundup email with the latest content & updates!