It’s simple to add and remove nodes in a Microsoft Failover Cluster. That ease can hide a number of problems that can spell doom for your virtual machines, though. I’ve put together a quick guide/checklist for you to check if your last node addition/replacement didn’t go as smoothly as expected.

1. Did You Validate Your Cluster?

Cluster validation can be annoying and it can suck up time that you don’t feel that you have to spare, but it must be done.

troublenode_validation

Cluster validation runs far more tests in far less time than you could ever hope to do on your own. It’s not perfect by any means, and it can seem obnoxiously overbearing in some regards, but it can also point out issues before they break your cluster.

Cluster validation can be performed while the cluster is in production, with one exception: storage. None of the storage tests can be safely run against online virtual machines unless they’re on SMB storage. If you can move them to a suitable file share, do so. If you can’t, then schedule a time when you can save all of the virtual machines for a few minutes. If you can’t do that either, then run validation without the storage tests. It’s better to have a partial test than none at all.

2. Do You Need to Clear the Node?

Sometimes, you can’t even re-add a node. You get a message that the computer is already in a cluster, and the wizard blocks you from proceeding! From any node that’s still in the cluster, run this from an elevated PowerShell prompt:

I’ve had to take that step every time, regardless of what else I did in advance.

3. Did You Fix DNS?

Hyper-V cluster nodes typically use at least two IP addresses: Management and Live Migration. You might well be using at least one other for cluster communications. If you’re connected via iSCSI, there will be at least one more IP address there. Many of those IPs may reside on isolated IP networks that don’t have utilize a router. That makes those IPs unreachable from other IP networks. If those IPs are being registered in DNS, then it’s only a matter of time before they cause problems.

I typically design a custom script to assign IPs to a host and ensure that it only registers the management address in DNS. You can make those changes manually, if you prefer. Just remember to get it done.

4. Did You Match All Windows Roles and Features?

On a recent rebuild, I moved quickly due to a great deal of urgency. All seemed well, but then I tested my first Live Migration to the rebuilt node, and it crashed the VM! The error code was: 0x80070780 (The file cannot be accessed by the system). It didn’t say which file (because, you know, why would that be useful information?), so I began by verifying that all of the VM’s files were in the same highly available location.

I’ll spare you the details of my fairly frantic searching, but it turned out that I had neglected to update my deployment script and had missed one very critical component: this cluster hosted virtual desktops, so each node ran the Data Deduplication role… except the one that I had newly rebuilt. I quickly whipped out a role/feature deployment script that I keep on hand, and all was well.

5. Do You Need to Fix Permissions?

Always be on the lookout for the lovely 0x80070005 — otherwise known as “access denied”. When you rebuild a cluster node using the same name, it should slide right back into Active Directory without any fuss. Deleting the Active Directory object before re-adding the node doesn’t really help things, so I’d avoid that. Either way, you might need to rebuild permissions. I would pay special attention to delegation. I wouldn’t spend a great deal of time guessing at it. If you think delegation might be an issue, then apply the fix and test.

Usually, you do not need to re-apply file level permissions after a node add/rebuild. If you feel that it’s necessary, I would work at the containing folder level as much as possible. It can be maddening trying to set ACLs on individual virtual machine locations.

6. Are You Having an SPN Issue?

Look in the Event Viewer on other nodes for event ID 4 from Security-Kerberos regarding failures around Kerberos tickets and SPNs (service principal names). This can happen whether or not you deleted the Active Directory object beforehand, although it seems to sort itself out more easily when you re-use the existing object.

If you continue having trouble with this message, you’ll find many references and fix suggestions by searching on the event ID and text. Everywhere that I went, I saw different answers. No one seemed to have gathered a nice little list of things to try.

7. Did You Set Your PowerShell Execution and Remoting Policies?

I have a long list of “issues” that I solve by group policy. If you’re not doing that, then you could miss a number of small things. For instance, if you have built up a decent repertoire of PowerShell scripts to handle automation, you might suddenly find that they don’t work after a node replacement. This should help (if run from an elevated PowerShell prompt):

Of course, the battle over what constitutes the “best” PowerShell execution policy continues to rage on, and will likely do so for as long as people like to argue. “RemoteSigned” has served me well. Use what you like. Just remember that the default “AllSigned” will restrict you.

8. Do You Just Need To Rebuild the Cluster?

I have never once needed to completely destroy and rebuild a cluster outside of my test lab. I wouldn’t take such a “nuclear” option off of the table, however. Each cluster node maintains its own small database about the cluster configuration. If you’re using a quorum mode that includes disk storage, a copy of the database exists there as well. Like any other database, I’m forced to accept the possibility that the database could become damaged beyond repair. If that happens, a complete rebuild might be your best bet.

Try all other options first.

If you must destroy a cluster to rebuild it, remember this:

  • The contents of CSVs and cluster disks will not be altered, but you won’t be able to keep them online.
  • If a cluster’s virtual machines are kept on SMB shares, they can remain online during the rebuild through careful adding and removing. You can add and remove HA features to/from a virtual machine without affecting its running state.
  • You must run the Clear-ClusterNode command against each node.
  • Delete the HKLMCluster key before re-adding a node.
  • Format any witness disks before re-adding to a cluster.
  • Delete any cluster-related data from witness shares before re-adding to a cluster.

Fortunately, Microsoft made their clustering technology simple enough that such drastic measures should never be necessary.