How to use Replication to Easily Achieve Business Continuity

Save to My DOJO

As costs for high-speed networking technology decline, we gain more ways to maintain operations through a catastrophe. Replication has changed disaster recovery more than anything else since the backup tape was first introduced.

Tapes once granted us the power to conveniently move data to a safe distance from its origin. Now, we can instantly transmit changes offsite as they occur or after a short delay.

A Short Introduction to Replication

Replication was discussed back in an earlier article as part of a backup strategy, but in terms of disaster recovery it requires a bit more exploration. The name says most of it: replication makes a “replica”, or “copy”. “Copy” invokes the idea of backup, but they have differences.

On the one hand, replication makes a unique, independent copy of data, just like backup. However, replicas do not have much of a historical record, nor do they have a long useful life.

Replication involves some sort of software running within the operating system or on a smart storage platform. You start by making an initial copy, called a “seed”. The replication software then watches the original for changes and transmits them to another instance of the same software, which incorporates the changes into the replica.
Short Introduction to Replication

Features of typical replication software

Runs continuously or on a short interval schedule
Functions one way at a time
May act as a component of another piece of software

Creates a genuine duplicate of the original, not wrapped in a format proprietary to the replication engine
Replicates without human intervention; failover to replica requires intervention

You will encounter occasional exceptions, primarily with replication systems such as Active Directory that do not treat any replica as the original. However, even in those systems, a change always occurs in one replica first, then the software transmits it to the others.

Also, the product of a replica might be in a proprietary format, but typically only when the replication mechanism belongs to a larger program. As examples, some SQL server software has built-in replication mechanisms and some backup applications, like Hornetsecurity’s VM Backup, include a replication component.

In those cases, the format belongs to the program, not its replication engine.

Synchronous replication

High-end storage systems and some software offer synchronous replication. Details vary between implementations, but all end up transmitting changes from the origin to the replica in real time.

Synchronous replication processes have significant monetary, processing, and transmission costs. They allow for hot sites to pick up right from a failure point. Some synchronous replication systems allow for geographically distributed or “stretched” clusters. With these, you can reliably operate resources within the same cluster across distant datacenters.

For clustered roles like databases, you can have almost zero-downtime failovers. For items such as virtual machines, a failed datacenter will cause its virtual machines to crash, but remote nodes with synchronous storage can bring the VMs back online almost immediately.

Such protections allow you to architect an active/active design that keeps resources close to their users when all is well but to continue running in an alternative location when all is not.

Asynchronous replication

You will find a broader offering of asynchronous replication solutions. As the name implies, they operate with delay. The replication mechanism accumulates changes at the origin for a period ranging from a few seconds to a few minutes.

When it reaches a specified volume or time threshold, the system packages and transmits the changes to the remote point. The receiving replication system unpacks the changes and applies them to the replica.

Asynchronous replication’s primary advantage over synchronous is cost. It can transfer, test, and acknowledge paced large data chunks instead of a rapid series of small blocks, so it reduces network load. Also, because of the convenient packaging system, some replication software will save a bit of history.

In case the system detects a corrupted data block, it might be able to walk back the recent changes to a good point.

Asynchronous replication can only function in active/passive mode. It does not mix with stretched clusters, although it can create a replica of a cluster at the origin site.

Choosing synchronous or asynchronous replication

Sometimes, you will have only one viable choice with replication. Software with a high IOPS profile may not function correctly with synchronous replication. You may uncover instances in which a software-agnostic synchronous replication does not work as well as a software package’s built-in asynchronous mechanism.

A line-of-business vendor may prohibit supporting any installation that sits atop a synchronous replication system. In cases such as those, conditions make the decision for you.

In other cases, you have three primary factors:

Price differences
Recovery point objectives (RPO)
Data value

Synchronous replication usually costs substantially more than asynchronous replication when you only compare the mechanisms. Synchronous replication also demands more from hardware, within the compute layer, the storage subsystems, and the network stack.

Cost often sets the parameters before you even consider the other factors. Recall the discussion on RPOs from an earlier article. If a system or data set has a large RPO tolerance, then do not rush to put synchronous replication on it without some other driving force, such as stretched clusters.

The shorter the RPO, the more you can justify synchronous replication. Asynchronous replication typically allows for very short delays, down to a few minutes or even a few seconds. If that satisfies your RPO, then prefer an asynchronous solution.

Asynchronous Replication

Even with a relatively short desired RPO, low-value data won’t justify the higher cost of synchronous replica. As an example, think of a freezer unit at a food distribution company. The historical record of its temperatures has value, especially if you have outstanding litigation over food storage.

However, the temperature of the freezer in the last five minutes before the facility collapsed in an earthquake probably does not matter to anyone. So, the current information only matters operationally, so it only has value when there are current operations. Asynchronous replication can adequately protect this type of data.

Avoid mixing synchronous and asynchronous replication for the same data. It might work without error, but nothing comes without cost. Replication can place a high toll on system resources. Layering replication makes it all worse and may not have any positives.

Choosing Replication Solutions

You will almost certainly use a mixture of replication technologies in order to achieve the best balance of support, functionality, protection levels, and resource usage. Even before looking at dedicated replication hardware or software, you have access to some replication technologies. A few that you might have right now

Microsoft Active Directory
Microsoft SQL Server
Microsoft Exchange Server
Microsoft Hyper-V Server
Backup software application, as an example, Hornetsecurity’s Total Backup
Some SAN and NAS devices
Windows Server Datacenter Edition provides Storage Replica
Windows Server 2019/2022 Standard Edition has a limited implementation of Storage Replica

Look at your major software servers and packages to see if any of them have replication capabilities. Prefer the most specific replication technology that satisfies your requirements. Follow this decision process:

If you have virtual or physical machine running software that has its own replication mechanism (like Active Directory), then use the application’s mechanism only.
If your hypervisor has a replica function and the software in the virtual machine cannot replicate itself, use the hypervisor’s replication tool.
If the machine is physical or you can’t use the hypervisor’s replication (perhaps because you do not have a target system running the same hypervisor), then use operating system replication (like Storage Replica).
If you cannot use replication in the operating system, use NAS or SAN replication.

If, like most organizations, you have many virtual machines running a range of server applications (like AD, Exchange, SQL, etc.), then you should decide on replication separately. You will not get the best results by trying to force everything into the same solution. Some things will not work under some replication configurations that other programs can use without trouble.

The decision factors do not directly include backup replication. Most of the replication features in backup applications only make additional copies of backed-up data, not general-purpose data replicas. In that case, they only count as applications themselves (step 1) and only for the archives that they create.

If your backup program has a general data replication feature, then you can prioritize it before or after step 4. This order of preference exists for several reasons:

If the software manufacturer went to the trouble of building a replication mechanism into their software, then it’s probably the best. Many of Microsoft’s technologies have been developed over decades. External replication cannot know the inner workings of these programs, so it will not work as effectively.
Vendors will not always support their software in conjunction with certain replication technologies. For example, Microsoft does not support using Hyper-V Replica on Exchange.

Replication functions provided by SANs, NAS devices, and hypervisors require the target to run the same or very similar system. If you decide to switch vendors, you’ll have to start replication processes created with their functions all over.

If you cannot get a sufficient budget to maintain the same services or equipment at all locations, you may run into some last-minute or mid-stream problems. Synchronous replication might present an overriding decision point.

You must remain wary of support concerns and other problems. In the absence of such barriers, synchronous replica bumps itself into the #2 preference. You will also need it anytime you intend to use fully functional stretched clusters.

Do Not Replace Backup with Replication

Backup and replication have similar features, but you cannot use them interchangeably. If you must choose between them, always choose backup. Replication exists to enable rapid failover. Replication characteristics that preclude its use as a backup tool:

Little historical information
Usually only one complete copy
Limited testing ability
No capability for regular offline copies
Replication does not always utilize quiescing technology such as VSS

If undesirable data, such as encrypting ransomware, travels to the replica, then it will probably invalidate the entire thing. You will then need to use your standard backup restoration process. Data deleted more than a short time ago will not exist anywhere in the replica’s files.

You will always need the long-term and offline protection features of a true backup.

Considering Replication Licensing Implications

Since replication is different from backup, its use may impose some licensing considerations. Microsoft does not consider a replica virtual machine as an “offline” or “cold” copy since the replication mechanism constantly updates it and the replica is a fully functional entity distinct from the original.

For that reason, hosts that maintain a replica of a Windows Server virtual machine require a separate license from the source host’s license that covers the original.

Above, we mentioned that Active Directory replication serves it better than other replication types, such as Hyper-V Replica. If you use Hyper-V Replica to protect a domain controller, you must still license the replica host as though the virtual machine were online.

So, running one distinct domain controller in each site gives you the best replication technology and makes no difference to your licensing. Note: this rule applies to any Windows Server instance in a virtual machine, not specifically to Active Directory.

You will need to investigate the licensing rules of your software and consider them in the context of replication. This can become complicated quickly, as it can also depend on the type of replication in use (application, hypervisor, operating system, dedicated software, or hardware) and other factors.

For instance, if you add Software Assurance to a Windows Server host license, you can replicate its virtual machines to other systems without additional licensing costs. For the most comprehensive answers, work with trained licensing specialists at authorized resellers or contact software vendors directly.

Hornetsecurity provides full 24/7 support as part of its services to help users achieve the perfect configuration. Make use of services like this to ensure your licensing matches your requirements.

Configuring Replication

Despite the plethora of replication solutions, they share common configuration points. The exact steps will depend on your hardware or software, so we will give a generic overview of the process.

Establishing replication sources and targets

Replication requires at least two endpoints capable of acting as replication partners. That requires a mirroring configuration of hardware and possibly software on each end. To begin, install and configure the hardware and, if you use a software-based mechanism, configure that as well. Necessary steps depend on your replication solution. A few examples:

For generic data replication, configure the hardware or software as an endpoint using the system’s directions.
For Active Directory replication, install the Active Directory Domain Services role on a system in each location. Follow the necessary steps to add them to the same domain and ensure that they have IP connectivity over your inter-site link. Use Active Directory Sites and Services or PowerShell to logically separate the sites. Active Directory will automatically set up its own replication, applying special rules for traffic that crosses sites in the expectation that they have less than gigabit speeds, that the links might have high contention, and that sites may periodically lose connection.
For Microsoft SQL replication, you first need to fully install SQL on each endpoint. You will then select one of SQL Server’s many data synchronization options and configure it accordingly.
Hyper-V Replica requires you to first configure each participating server to receive replica. For clustered Hyper-V hosts, you must create the Hyper-V Replica Broker role and configure it instead of working on any individual node. Once you complete that step on all relevant systems, you can then configure individual virtual machines to replicate to specific target replica partners.

If your backup application includes a replication function, follow its directions for setup and configuration. As mentioned previously, you will likely have a mixture of replication configurations. Create a checklist of the items to cover in order to ensure that you configure them all.

Creating an initial seed

After enabling replication, the first thing that must happen is a complete build-up of the starting replica. That will probably amount to a very significant chunk of data. Perform some rough calculations based on the data size and the speed of your intersite connection.

If you discover that it might take several days to finish transmission of the beginning replicas, you can create an offline initial seed. The process works like this:

Establish the replication partners
Define the data or objects that will replicate from the current site and configuration replication
Use the application or device’s process to create an initial seed on a transportable device (such as a USB hard drive)
Physically transfer the seed to the target system
Establish the replica from the seed data
Start the replication process

Because operations will continue while the seed is in transit and building the replica, the replication system will need some time to catch up. You should not need to do anything else manually.

Most dedicated replication software uses the term “initial seed” or something recognizably similar. Software with built-in replication typically uses other wording. For example, follow the “Install From Media” (IFM) procedure when setting up an Active Directory Domain Controller.

Maintaining Replica

Most replication technologies work unsupervised after setup. Regardless of your confidence in the tools that you use, you should set up automated monitoring to keep an eye on them. You could also rely on some sort of daily manual verification process. However, your organization probably would not want a 24-hour or weekend period to pass without viable synchronization.

Also, the more a process tends to succeed, the less inclination tech staff will have to check on it. Your monitoring method depends on the architecture of the replication system. Set up alerts for:

Windows event logs
Linux error logs
Unexpected service halts
Inter-site connection breaks
Storage capacity

Some systems may have an option to send notification e-mails. While you should take advantage of those, do not rely on them. If the service fails completely, it will not send anything

It’s easier to forget about a message that you never received than to ignore one that you did. Use active external monitoring

If you don’t have much experience with a particular tool, it may take some trial and error to properly balance its monitoring. As with all other things, you want it to tell you what you need to know without overwhelming you. The technology may introduce some concepts that are new to you.

For instance, most asynchronous replication systems involve some sort of “logging” and “playback” technique. For instance, Hyper-V Replica (HVR) builds changes at the source into a log file. At the designated time interval, it ships the log file to the target replica host. Once received, the replica host “replays” the contents of the log file into the replica.

If something goes wrong with HVR, you will see symptoms in the directory that contains the replica and its log files. HVR keeps the log files for a time, but eventually cleans them up. If you’re accumulating log files, that signals a problem. If you have zero log files, you will want to investigate. In the case of HVR, you should have accompanying event log entries that provide detail.

However, monitoring the storage location in addition to the logs gives you additional opportunities to detect a problem before it has a permanent effect. It will also set up a best practice for you in case you have a different tool that does not write to the event log or some environmental problem that causes logs to roll over more quickly than you can process them.

Create a plan to accommodate resource discontinuation. Most replication systems will not automatically perform cleanup when you stop replicating. Whatever procedures you have developed for decommissioning applications and systems, append a process that describes how to stop and clean up replication.

As part of discovery and initial testing, find out how to handle these situations in your replication tools. Take time to learn if and how you can safely move replicas. If replication is a subsystem of another application, then it typically follows the application’s resource moving rules.

For example, you can use Hyper-V’s Storage Migration to move a replica virtual machine and HVR will automatically deliver replica files to the new location.

If you follow the supported steps to move the NTDS.DIT file on any domain controller, it will not break Active Directory replication. For application-agnostic replication technologies, you may have more work. Research in advance so that you never need to try to figure out how to move items under pressure.

Correcting Problems with Replication

You will encounter three main problem categories with replication:

Broken connections

Overwhelmed destination

Synchronization collisions

Most mature replication technologies deal with broken connections gracefully. They wait until they can reach the destination again and pick up where they left off. Test new systems before deployment and learn how they cope with and report these events. Use this information to shape your monitoring plan and responses.

With some technologies, the replica system can fall so far behind the primary that it simply gives up and breaks out of the partnership. The exact technique to recover depends entirely on the product. Check its literature for information.

Usually, the fix involves a resynchronization, which you can target for a quieter period. Discovering the root cause is equally as important as correcting the condition.

If it resulted from a broken inter-site link, then you know why and probably no other recourse have than to fix it and move on. However, if the link stayed active, then simple corrective action may only set it up to fail again. Some ways to address repeatedly overwhelmed replication:

Adjust the delay between transmissions. A natural instinct is to increase the delay to give the target system more time to process log files. However, it sometimes helps to reduce the interval so that the link and secondary system work with smaller files.
Reduce the load on the inter-site link.
Increase the speed of the inter-site link.
Upgrade the target hardware.

The first option is the easiest but involves potentially frustrating trial and error. The last two items will likely involve capital expenditures and contractors. To discover where to focus your efforts, set up monitoring on the resources.

Learn if the target becomes overwhelmed because it doesn’t receive the data in time to process it ahead of the next package, or if it receives it quickly enough but doesn’t have sufficient speed to handle it before another arrives.

You need to find the bottlenecks before you start trying to fix them. A common technique for load reduction is removal of non-essential resources from the replication chain. For virtual machines, you can relocate swap data to separate virtual disks and exclude those disks from replication.

Preventing “split brain” and synchronization collisions

Cluster technologies use some form of external arbiter to prevent access to the same object from multiple locations with the expectation that a completely isolated member will not come online without extraordinary steps.

Controls might be complex, like Microsoft’s dynamic quorum, or simple, like a lock file. In contrast, replication works with linked but unique objects. Any replication partner must have the freedom to operate on its own replica even when completely isolated. The only arbiter is human operators.

Replication functions properly when one partner processes a change to its object and transmits that change to the other partner(s). When two or more partners in replication receive changes to their local copy of the same item, you have the potential for a collision.

Replication - Preventing Split Brain

Active/active replication systems have some capability to minimize these problems. Active Directory uses timestamps and other arbitration techniques to choose the one change that it will keep and records the others as historical changes. Active/ passive replication typically does not have such robust protection.

Consider a situation in which Site A replicates to Site B. The inter-site link drops. Site A continues operating as normal because it was the original. An operator at Site B has built a script that automatically fails to the local replica when the link drops, on the assumption that such a drop means that Site A has gone down.

Unfortunately, that assumption was incorrect. The script runs, resulting in both sites actively making changes to their local replica. We call this condition “split brain”. When the link is restored. Site A will try to resume synchronization. If B’s replica is not in the condition that Site A expects, synchronization will fail with no automatic way to recover.

Replication - Synchronization Collisions

Depending on the replication technology in play, you may have a great deal of clean-up work to look forward to. Complete recovery may not be possible. In the case of Hyper-V Replica, you will need to choose one replica as the origin and resynchronize to the other as if it were new.

You can copy any data that you want to save out of the replica first, then put it back into the origin. File-by-file replication systems will only have troubles with competing file changes. More complex systems with no viable repair path may suffer permanent data loss.

Even active/active mechanisms like Active Directory have some risks. It should have no problems surviving the above scenario because it was designed with those types of failures in mind. However, you can cause permanent damage to Active Directory in other ways.

In the past, rolling a virtualized domain controller back to a previous state could cause irreparable damage to the directory. Research “USN rollback” for more information on that problem. For the purposes of this discussion, understand that you can break any kind of replication technology by using it in an unsupported fashion.

Most such breakdowns require restoring to an earlier backup. A few best practices can keep you out of split-brain conditions:

Do not automate failover for replication systems that have no automated arbitration
Create a defined process for initiating failover (see the upcoming section on Business Process for Disaster Recovery for more information)
Do not mix virtual machine snapshot/checkpoint technologies with replication technologies

As a note on the last bullet point, Hyper-V incorporates its checkpoint technology to facilitate backup operations, including Hyper-V Replica.

These special-purpose checkpoints pose no risk to replication. Many synchronization collisions occur because a change was made, duplicated to a replica, rolled back at the source, and then the source changed again prior to the next replication interval.

The new changes appear to conflict with an earlier change, which throws the replica into an unknown state. Because Hyper-V’s backup and replication checkpoint functions never revert, they do not cause collisions.

Fundamentally, replication exists to enable rapid failover to an alternative site. When used correctly, it can allow nearly uninterrupted data services even in a major catastrophe. When used incorrectly, it adds a lot of overhead at best and causes a great deal of damage at worst.

Leveraging Replication in Disaster Recovery

While replication can address the offsite requirements of backup, it does not replace any of its other components. You cannot maintain a series of offline replicas, nor will replication software have a simple way to retrieve historical data (like an e-mail or a single file).

Replication software will overwrite good data with corrupted data without hesitation and then delete its previous state. Replica supplements backup well, but it will never replace it. If you have sufficient funding and at least one viable alternative site, replication enhances your business continuity solution.

Conclusion

Replication transforms disaster recovery by swiftly creating independent data copies for seamless offsite updates. Whether synchronous or asynchronous, it enables rapid failover, minimizing downtime during crises.

While a valuable supplement, it doesn’t replace backup, excelling in maintaining uninterrupted data services during catastrophic events, making it a crucial element for robust disaster recovery strategies.

Was this helpful?
Yes

Provide feedback about this article

Share this post

Not a DOJO Member yet?

Join thousands of other IT pros and receive a weekly roundup email with the latest content & updates!

About the Author

Paul Schnackenburg

Paul Schnackenburg started in IT when DOS and 286 processors were the cutting edge. He runs Expert I..

View Full Profile

How to use Replication to Easily Achieve Business Continuity

Table of contents