IBM SmartCloud Enterprise+ disaster recovery considerations for DB2

Disaster recovery on IBM SmartCloud Enterprise + (SCE+) is usually referring to infrastructure based disaster recovery. Disaster recovery (DR) solutions on the infrastructure level intend to replicate while virtual machines (VMs) include all data from the main production site to the DR site. The advantage of that kind of solution approach is that if a disaster occurs, an exact copy of the production environment including all OS settings and patches is available on the DR site. The VMs on the DR site can than be started and take over the load quite seamlessly (beside the nasty reconfiguration of site specific network settings like IP ranges).

It is planned to provide an infrastructure as a service (IaaS) DR solution as part of the SmartCloud Enterprise + offering in a later release.

Although IaaS DR solutions do their job well, they are rather expensive and complex. Mirroring complete virtual machine images does not only cost a lot of storage space but also appropriates network bandwidth and traffic. So, the question that solution architects should ask is if IaaS based DR is really required!

In many scenarios, a more cost effective and less complex solution is, to consider application level disaster recovery. Let’s take DB2 as an example for many middleware or applications that provide the ability to be either clustered on application level or keep a cold standby aside. DB2 allows us to leverage its HADR function to collect all database update operations and queue them for distribution to other nodes.

Those collected updates can be sent over the network, using a variety of technologies or protocols. The interval between send operations depends on the recovery point objective (RPO) target. A shorter interval between send operations provides a better RPO but might generate more data traffic.

DB2 HADR setup

DB2 HADR setup

Another advantage of such a setup is the ability to provide fail over tests easily. Because the DR systems are up and running all the time, their proper function can be tested at any time by just accessing them.

However, the drawback of such an application level DR solution is the fact that the standby system is required to be up and running all the time and it must be ensured that all updates to the primary system itself (like changed configurations and software patching) are also done on the standby servers.

Summary

 Application level disaster recovery is not the solution for each and every scenario, but can be a valid, cost effective and less complex alternative. Sometimes a combination of infrastructure and application level DR might be the best solution for an environment.