Windows Server Failover Clustering (ClusDb) Backup and Recovery

Save to My DOJO

Understanding the Server Clustering Database (ClusDB)
How to Back Up a Server Clustering Database
How to Restore a Server Clustering Database

Today businesses must offer their services twenty-four hours a day to remain competitive, which means that their applications need to be highly available. If the organization is running its workloads in a public cloud, then its services usually stay online because the cloud provider manages its highly available infrastructure. For enterprises that are running their own servers, they need to provide high availability (HA) to their applications and virtual machines (VMs), which is usually done through clustering groups of physical servers (nodes), a practice called server clustering. The clustering services will monitor the health of each node and automatically restart or move workloads within the group of servers to ensure that they are always running.

Failover Clustering is the built-in HA solution for Windows Server and Hyper-V. To create this distributed system, administrators need to deploy and manage networks, shared storage, servers, operating systems, virtual machines, and applications (see Altaro’s How to set up and manage a Hyper-V Failover Cluster Step by Step).

Since clusters are business-critical and fairly complex, it is important to back up not only the applications and VMs but also the configuration of the cluster so that it can be quickly redeployed in the event of a disaster. This blog post will review the best practices for a failover cluster configuration database backup and recovery, which I learned from spending four years designing clusters while on the product team at Microsoft.

If you’re simply looking for further information on doing backups, guidance on backing up data, operating systems, virtual machines, and applications is available from Altaro.

Understanding the Server Clustering Database (ClusDB)

Although a cluster is a collection of distributed services, it needs to function as a unified system. It needs to understand the state and properties of every workload on every node, such as which host is managing a particular VM and whether that VM is online. To accomplish this, Windows Server Failover Clustering has a database containing this information that resides on every node, known as ClusDB. This database is stored in the cluster’s registry to remove other dependencies and ensure that its operating system access is prioritized. ClusDB is continually updated whenever there is a change to any component which the cluster is managing. For example, if a new service is deployed, property changes, or an application fails over to another node.

A key trait about this database is that it must be identical on every host so that each node has a consistent view of the state of every clustered object. This is to ensure that there is a single owner of each workload in the event that nodes cannot communicate with each other.

This is important because if there were a clustered SQL server, and two or more hosts started simultaneously writing to a single database in an uncoordinated fashion, then it could certainly cause disk corruption. The Cluster Database is critical to ensure that every service operates correctly, so this database must be synchronized across every cluster node. Any time a workload is added or removed, brought online or taken offline, or if any of its dozens of properties changes, the cluster will immediately update ClusDB on every node.

How to Back Up a Server Clustering Database

It is a good best practice to regularly back up the cluster’s database, in addition to the host and application data for each clustered workload or VM. This should be done before and after changing the cluster’s configuration, applications, or properties.

Since clusters operate differently than standalone servers, it is important to ensure that your backup provider is “cluster-aware” so that it follows the proper steps. This includes validating that a cluster is active, healthy, and can offer a complete and current copy of the Cluster Database. The backup provider should also identify the best node to create the backup to maintain service availability and minimize disruption on other workloads. The built-in Windows Server Backup is cluster-aware, along with Altaro’s offerings.

Once the backup provider has determined the optimal cluster node, it will call the Volume Shadow Copy Service (VSS), which is the built-in backup framework for Windows Server. The Cluster Service VSS Writer then performs a series of tasks to ensure that the ClusDb backup is complete and consistent. Since this database is stored in the registry, additional considerations will be made by VSS while creating the backup so that this data is injected back into the registry when an image is restored.

How to Restore a Server Clustering Database

If you are just repairing a single node within a cluster, and the rest of the cluster is still operational, then you do not actually need to restore the cluster database. Simply repair the operating system on that faulty node, restart it, and make sure that it rejoins the cluster. This node will then synchronize with the rest of the cluster and receive a current version of the ClusDb, then add it into its own registry.

If you need to restore the entire cluster, first ensure that all the host operating systems are repaired and functioning correctly. Next, you must stop the cluster service on every node, which means that all of your clustered workloads will incur downtime. Using your cluster-aware backup provider, you will restore the ClusDb by forcing the cluster to use this older database version through an “authoritative restore.” The node which was selected for the authoritative restore will receive the new ClusDb, inject it into its registry, and start the cluster service on that node. The other nodes will then come online and synchronize their cluster databases with this restored version. Finally, the clustered workloads will apply the states and properties defined in the ClusDb, which usually means they will come online. After the recovery, make sure that you validate that all the nodes and services are operational and accessible to your customers.

Here is a quick workflow that lays out the two options. Sometimes it’s better to see it than just read about it:

Backing up and restoring the cluster database is not too challenging if the backup provider is cluster-aware, even though the underlying process of server clustering has some complexities. Keep in mind that you are only saving the state and properties of the cluster itself. The data for the operating systems, applications, and VMs must also be backed up via some other method, such as Altaro VM Backup.

Was this helpful?
Yes