Common Issues and Pitfalls | Hyper-V Clusters Part 4
If you’ve followed all the articles in the series leading up to this point, you should have a happily functioning cluster. Now we’ll look at some of the common problems and concerns you’ll face.
Fortunately, Microsoft Failover Clusters are very solid, mature technology and you won’t need to do a lot with them. There are mistakes to be avoided and issues to be aware of.
Care and Feeding of your Cluster
One of the benefits of having a cluster is LiveMigration. Using this technology, you can empty a cluster node of all virtual machines. That gives you the ability to perform maintenance on that node with interruption of service on the virtual machines. While convenient, this is not without risks. Unless your cluster has sufficient capacity to handle the loss of two nodes, operating on one means that you have no failover capability.
The desired configuration of any cluster is to have all hardware and software be identical. That includes BIOS, driver, hypervisor, and parent partition patch levels. This is not required, however. If you don’t have sufficient resources to properly test those patches, then it is permissible to run one host out of step with the others for a short time before bringing the remaining hosts up to the same level. Make sure that no matter what approach you take that you have a rollback plan. A sensible approach is to only apply critical security updates within the same week of release; wait for others to “age” so that bugs can be worked out of them.
Between update cycles, the cluster will generally care for itself. You can quickly check on it by accessing Failover Cluster Manager, which has a “Cluster Events” node. Of course, you can also use any event monitoring software that you might have available to you.
The easiest way to eliminate configuration errors is to use Failover Cluster Manager to validate your cluster prior to final deployment and after any configuration changes.
The most common way to misconfigure the cluster is involves the networking setup. Refer to parts 2 and 3 of this series for guidance. The different cluster-related networks (management, cluster communications, LiveMigration, and Cluster Shared Volumes) must each have their own subnet and one host must have one NIC in each of them. If these aren’t configured correctly to begin with, cluster validation will fail. Virtual networks are another story. Because they do not get IPs of their own, they cannot become cluster-controlled resources. This also means that the cluster validator will not capture any problems. Hyper-V tracks virtual networks by name. When a virtual machine is moved from one host to another, the local Hyper-V installation will try to connect its virtual network cards to a virtual network with a particular name. If that network name doesn’t exist, the virtual machine will be disconnected. System Center Virtual Machine Manager will catch this problem and warn you about it when you attempt a migration.
In general, anything that goes wrong in your cluster will be hardware-related. Component failure is quite rare, so driver issues are the most likely. The preferred approach to driver handling is to only update when a release solves a specific issue that you are experiencing.
One issue that arises and can be difficult to troubleshoot is if a host loses its connection to a LUN that contains a CSV. CSVs work by using iSCSI-3 persistent reservations, but other factors may cause a disconnect. If all software and hardware components are functioning, this will never happen to a connection hosting a live virtual machine. It may happen on a connection that has been dormant for a long period of time. You are unlikely to know that it’s a problem until you try to LiveMigrate a system on that LUN to the node whose connection has broken. It is possible that a hypervisor-level backup will throw VSS errors as it encounters problems trying to set Redirected Access mode. One certain way to tell is that you won’t be able to use Failover Cluster Manager to transfer ownership of the CSV to that node. Because this shouldn’t happen at all, most systems have no traps for it. Therefore, there’s no guaranteed way to watch for it, detect it, or predict it. The most likely place you’ll find any indication will be in your shared storage device’s logs, if it tracks disconnects. A reboot of the affected host will generally clear it up.
For Further Reading
This concludes our series on Hyper-V R2 clusters. Feel free to use the comments section for questions.
(It’s safe, we hate spam too!)