How to Ensure Your Disaster Recovery Strategy Will Actually Work When You Need It to

Save to My DOJO

How to Ensure Your Disaster Recovery Strategy Will Actually Work When You Need It to

Table of contents

Systems fail. That’s an unfortunate fact of computing. If things were, otherwise, this book would have been much shorter. However, what happens when the systems that we build as a failsafe against failure suffer failures of their own? 

Just as you watch over your production systems, you must also put forth the effort to ensure that you always have at least one complete, undamaged, and retrievable copy of your data. You cannot stow away your backup data and hope for the best. 

I took a phone call from a customer whose sole line-of-business application server lost both of its mirrored drives before they knew that they had a problem. 

Because this business closed at five PM and didn’t re-open until 8 AM, we had set them up to perform fully daily backups that automatically ejected the tape upon completion. While trying to help them, I learned that they had hired an onsite administrator for a short time.  

During his stay, he switched them over to weekly full backups with daily incremental jobs. He also disables the automatic eject feature. 

When he left, he neglected to train anyone on the backup system. No one knew that they were supposed to change the backup tapes daily. Every night for over a year, their backups overwrote the previous night’s data, usually only with a tiny subset of changes. 

So, when they needed it most, the backup data was not there. 

To safeguard yourself against problems such as these (and the attendant horror stories), build testing schedules into your data recovery plan. Assign staff to perform those tests. At the scheduled meetings to update the disaster recovery documentation, require that testers require a synopsis of their test activities. 

This will give your organization external accountability control over the backup process. You will have a chance to discover if no one has performed a backup without the need to encounter an emergency.

Testing Backup Data with Restore Operations

You have a very straightforward way to uncover problems in backup: try to restore something. Most modern backup software has some built-in way to help. 

Exact steps depend upon your software. Follow these guidelines:

  • Redirect to an alternative, non-production, “sandbox” location. If your backup somehow has corrupted data, you don’t want to find out with a “test” overwrite of valid production data. If you’re ensuring that you can retrieve a virtual machine, you don’t want it to collide with the “real” system. 
  • Test restoring multiple types of data. Bring back individual files, entire SQL databases, domain controllers, virtual machines, and any other type of logical unit that you rely on. 
  • Rotate the items that you check at each testing interval. 
  • Test from more than one repository. 
  • Verify the restored information by accessing it yourself. Do not interpret a successful restore operation as proof that the data survived.  

The major problem with this type of testing is its scope. You will always work with a sample of the backed-up data, not the entire set. You will likely be instructed to test the most important business components at every step. Make sure to test representative items from lower priority systems as well. 

Data corruption is sneaky. Unless your equipment suffered a major failure or someone accidentally degaussed the wrong drive pack, the odds are that you will never uncover any degradation or errors.  

Take heart; since you probably won’t find any corruption, you probably won’t ever need to restore anything that happened to become corrupted. However, do not take this condition as a reason to skip test restores.

We insist on multiple full copies of backup data as the primary way to protect against small-scale corruption. Unless the production data is corrupted, there is almost no chance that two distinct copies will have problems in the same place. 

The purpose of a test restore is not to try to find these minor errors. We are looking for big problems. A routine test would have caught the problem in the anecdote that opened this section.  

If someone accidentally (or maliciously) unchecked the option to back up your customer database, you will notice when you attempt a test restore. If a backup drive has a mechanical failure, you will either get nothing or blatantly corrupt data from it. 

Use manual test restores to spot-check for corruption, verify that backup covers the data that you need, and that your media contains the information that you expect. 

To shield against the ever-present threat of ransomware, only use operations that can read from the backup (not write to it) and work within an isolated sandbox. These types of tests are the only true way to verify the validity of offline media.

Testing Backup Data with Automated Operations 

Manual tests leave you with the problem of minor data corruption. Backup data sets have only increased in size through the years, compounding the problem. With the fortuitous gradual deprecation of tape, backup application vendors have seized on opportunities to add health-check routines to their software. 

They can scan through data to ensure that the bit signatures in storage match the bit signatures that they initially recorded.

Features like these call out the importance of a specific distinction in the usage of the word “automation”. It certainly applies to a process the computer performs in order to remove the burden on a human. 

 It does not necessarily mean “happens automatically”. For that connotation, stick to the word “scheduling”. In this context, do not assume that any mention of automated testing in your backup program’s interface means that it will handle everything without your help. Some programs have that capability, but this will never be a “set it and forget it” activity.

End-to-end data validation is time-consuming and an intense load on resources. Otherwise, we would happily do it ourselves and not need the backup program’s help. Also, some of them block other backup operations while in progress. So, such processes need three things from you:

  1. A specific start time 
  2. Sufficient time to complete 
  3. A human-led procedure for verifying and recording the results  

In a few cases, especially at smaller organizations, there may not be a major reason to avoid scheduling the start of a validation job, due to concerns for its impact on any overlapping backup jobs. If you have a slow backup system that will require multiple days to process everything, then validation is probably not feasible.

It is better to capture as many backups as possible and rely on manual spot-checking than to allow an automated verification process to disrupt jobs. At best, these automatic checks can add some peace of mind. But they will never replace manual work.

You also have the option to create custom checks. You can use scripts or software tools to scan through temporarily restored data. It can look for problems or ensure that it can find the expected data. You can potentially interface it with your backup software.

For instance, you can restore data to an alternative location and have the backup create another copy near it. A comparison tool can show where the data differs. Always keep ransomware top-of-mind. If you set up something like this, no process should have written access to production data and the backup location. 

Systems administrators tend to be a clever, intelligent group. When we read guides like this, many of us think to ourselves, “I can script that!” That’s great, don’t let anything here discourage you. A virtualized domain controller that runs dcdiag on itself after a restore operation? 

A SQL Server that runs through DBCC on restored databases? Your own system for creating and validating checksums on your most important file repository? Things like that are awesome!

You can never have too many helping checks. However, you can never rely solely on them, either. In the event of any kind of failure that backup does not recover, management will ask, “Did you verify that yourself?” 

They will not recognize the value of your scripts. Your skills will not impress them. A solid track record of twenty years without a failure will not make any difference. Worse, if a data loss exposes your company to litigation, judges, attorneys, and jurors will care even less.

You must employ the sort of manual processes that non-technical people understand. An answer of, “That particular data set rotates to human validation every three months and the disaster hit at just the wrong time,” would help to pull attention away from you. 

At the same time, doing the work to prepare against such conditions properly can help to ensure that you never have to face them.

Remember that automated routines can only supplement manual, personal operations. They will never stand on their own. 

Geographically Distributed Clusters

Due to the logistics involved, few organizations will utilize geographically distributed clusters (sometimes called stretched clusters). Combined with synchronously replicated storage and a very highspeed site interconnect, they offer a high degree of automated protection. 

Properly configuring one requires many architectural decisions and an intimate understanding of the necessary hardware and software components. This book will not dive that far into the topic. 

The basic concepts of a geographically distributed cluster:

  • These clusters are built specifically for business continuity. They are not an efficient solution for making resources available in two sites at once. 
  • Geographically stretched clusters must use synchronously replicated storage for effectiveness 
  • Administrators often set resources so that they operate only in location except in the event of a failover. Individual resources are configured to run in the location where they are closest to their users, if possible. 
  • Each location should be able to run all cluster resources 
  • Each location should have sufficient capacity to allow for local failures. As an example, if the resources on your cluster require 5 nodes to operate and you want N+1 protection, then all sites require 6 nodes. 
  • Resources must be prioritized so that, in the event that the cluster does not have enough nodes, that the most important resources remain online

If you have created such a cluster, you must periodically test it to ensure that it can meet your criteria. Because the sudden loss of an inter-site link or storage device will almost certainly trigger a resource crash, it would be best to perform these tests with only non-production resources. 

The easiest way to accomplish this goal is to schedule downtime for the entire system, take the production resources offline gracefully, and do all of your work with test resources. 

If your protected resources do not allow that much downtime, then you can use cross-cluster migration tools to evacuate the resources to other clusters during the test.

In many cases, you will not have any good options available. Alternative options:

  • Use test systems with the same fundamental configuration as your production systems and test with those 
  • Remove a node or two from each site, create a secondary cluster from them, perform your testing, then rejoin the nodes to the production cluster

These alternatives have problems and risks. Test systems let you know how a site failure would theoretically work, but they do not prove that your production cluster will survive.

Individual nodes could have undiscovered problems, recently added resources might take you slightly over your minimum required node count, and the cluster configuration may not function as expected in a site failure.

A common problem that’s not immediately obvious without testing is that your cluster configuration might take everything offline in all sites because it can’t establish a quorum. Worse, it might keep disconnected sites online simultaneously, running on storage units that can longer synchronize, leading to a split-brain situation. 

You will catch those problems in pre-production testing, but changing conditions can affect them (adding nodes, unplanned outages in multiple sites, etc.).  

Testing geographically distributed clusters 

When establishing the tests to run, start with probable events. Look specifically at the resources that the cluster operates and poke at their weaknesses. A few ideas:

  • Take a storage location offline without notifying the cluster 
  • Unplug network cables 
  • Disable the LAN for the cluster nodes in one site 
  • Reboot the device that connects the sites. Assuming redundant links, try them separately, then all in one site at the same time. 

 

Use your imagination. Also, don’t forget to perform the same sorts of tests that you would for a single site cluster (node removal, etc.). 

Coping with the challenges of geographically distributed clusters

In reality, most organizations cannot adequately test production clusters of any type. Do not use that as an excuse to do nothing. You always have some things to try. For instance, if you skip the storage tests, you can perform validation on a Microsoft failover cluster almost any time without impact. Research the clustering technologies that you use. Look to user forums or support groups for ideas. 

Make sure not to over-promise on the capabilities of geographically distributed clusters. Take time to understand how to deal with conditions such as the aforementioned quorum outage. 

Above all else, take special care to understand how your storage and clustering technologies react to major failures. Do not rely on past experiences or general knowledge. Use strict change tracking and review the build at each disaster recovery update cycle. 

Backup will always be your best protection against anything that threatens your cluster. Make certain that clusters have adequate and operational backup.

Testing replication

Replication technologies are built specifically to deal with failovers, so they are not as difficult to test as geo-clusters. 

Testing almost always involves downtime, but usually has a manageable impact. Unlike geographically distributed clustering, testing failover of a resource that shares a common platform with others usually tells you enough that you don’t have to test everything.

When you first build a replication system, work through a complete failover scenario. This exercise helps you and your staff more than the technology does on its own. Replication and failover do not always work the way that administrators assume. 

If you see the entire procedure in action, then you will have a much better understanding of what would happen to you in a real-world situation. 

Document anything that seemed surprising. If you find a blocking condition, shift the parameters. Continue testing until failover works as seamlessly as possible before going into production.

Many times, small organizations set up Hyper-V Replica without a complete understanding of the technology. They follow the instructions and the prompts, and everything appears to work perfectly. Then, they try to fail over to their secondary site, and nothing works. 

On investigating these problems, we discovered that many of them were replicating a domain controller virtual machine and had no other DCss online at the secondary site during the failover. When the domain controller went offline, the secondary site could no longer authenticate anything, including the Hyper-V replica operations. 

This is why the earlier section on configuring replica called out the importance of using application-specific replication where possible. It also points out the importance of testing; sites that tried to fail over their replicas as a test fared much better than sites that didn’t try until catastrophe struck. 

For a clear example, consider virtual machines protected by Hyper-V Replica. If you have a test virtual machine that spans the same hosts and same storage locations as your production virtual machines, then start with it. 

Provided that all conditions match, it will give you a good idea of what would happen to the production virtual machines. If you have any low-priority production virtual machines, or some that you can take offline for a while, test with those next. 

When possible, test all resources. Failing over a sample does not uncover problems like data corruption in the targets. 

Unfortunately, testing with the real systems might not catch it either, or failing back to the main site could very well ship corrupted data back to the source. Use your monitoring tools and capture a good backup before testing any production systems. 

If possible, take the source resource offline, wait for synchronization to complete, and capture a hash of the files at both locations. For file-based resources, you can use PowerShell:  

Get-FileHash -Path C:\Source\File.txt -Algorithm MD5 

Use the same command on the destination file; if the results don’t match, something is not right. Do not perform a failover until you have found and eliminated the problem.

Note: the MD5 algorithm no longer has value in security applications, but still works well for file comparisons due to its speed and the near zero likelihood of a hash collision on two slightly different copies of the same file.

Once you have successfully failed a resource to your alternative site, bring it online and make certain that it works as expected. Some configurations require advanced settings and comparable testing. 

To return to our Hyper-V Replica example, you can set up the replica virtual machines to use a different IP address than the source VMs. If you have done that, ensure that the replicas have the expected connectivity.

After testing at the remote site, fail the resource back to the primary. Depending on the extent of your testing, it may take some time for any changes to cross. Return it to a service state and ensure that it works as expected. Have your backup data ready in case. 

Do Not Neglect Testing

We all have so much work to do that testing often feels like low-priority busy work. We have lots of monitoring systems to tell us if something failed. If nothing has changed since the last time that we tested, would another test tell us anything? 

Regardless of our workload and the tedium of testing, we cannot afford to skip it. 

We humans tend to predict the future based on the past, which means that we naturally expect functioning systems to continue to function. However, the odds of system failure increase over time. The only sure way to find problems is through testing. 

 Conclusion 

In conclusion, the inevitability of system failures underscores the critical importance of proactive measures. The anecdote highlights the consequences of neglecting backup procedures. 

Testing, both manual and automated, emerges as a vital aspect of disaster recovery plans, ensuring the efficacy of backup data. 

Geographically distributed clusters and replication technologies offer additional layers of protection, necessitating thorough testing to uncover potential pitfalls. 

Altaro Backup Solutions
Share this post

Not a DOJO Member yet?

Join thousands of other IT pros and receive a weekly roundup email with the latest content & updates!

Leave a comment

Your email address will not be published. Required fields are marked *