What are the differences among the different disaster recovery options for databases?

623 Views Asked by At

In the context of AWS databases, how do the following disaster recovery strategies differ from one another:

  • point-in-time recovery
  • backup
  • snapshot
  • Aurora backtrack

When should we choose one over the others?

Why do we need so many different options when one will suffice?

Should we try to use all of them?

3

There are 3 best solutions below

0
On BEST ANSWER

One key difference between a manual snapshot and an automatic backup is that a snapshot doesn't expire, whereas an automatic backup are usually stored for a maximum of 35 days.

When you enable automated backups for your AWS database, AWS takes periodic backups of your database and stores them in Amazon S3. These backups serve as the starting point for PITR. AWS keeps transaction logs in S3 for up to 35 days, allowing you to perform point-in-time recovery (PITR) to any point within that timeframe.

When you initiate a PITR restore operation, AWS uses the selected backup and the transaction log to restore your database to the desired point in time. AWS first restores the backup and then applies the relevant transactions from the transaction log to the restored backup. This process brings the database to the desired point in time, allowing you to recover your data as it existed at that time.

Aurora Backtrack allows you to easily undo unintended or incorrect changes to your database by rolling back the database to a specific point in time without needing to restore from a backup. This allows fast rollbacks without the need to create a new database instance. However, Aurora Backtrack has a maximum backtrack window of 72 hours, which means you can only roll back your database to any point in time within the last 72 hours. This is because Aurora Backtrack uses the transaction log to roll back changes, and transaction logs are only kept for 72 hours.

0
On

'Disaster Recovery' is very old-world. It implies having to fail-over when a problem happens. In the cloud, however, you can focus on High Availability so that systems can recover automatically when there is a failure, without the need to 'fail-back' to the original system.

Therefore, the best option is do not do disaster recovery.

Instead, take advantage of the cloud-first design of Amazon Aurora, which automatically replicates data between multiple Availability Zones (each being a different data center).

From High availability for Amazon Aurora - Amazon Aurora:

Aurora stores copies of the data in a DB cluster across multiple Availability Zones in a single AWS Region. Aurora stores these copies regardless of whether the instances in the DB cluster span multiple Availability Zones.

When data is written to the primary DB instance, Aurora synchronously replicates the data across Availability Zones to six storage nodes associated with your cluster volume. Doing so provides data redundancy, eliminates I/O freezes, and minimizes latency spikes during system backups. Running a DB instance with high availability can enhance availability during planned system maintenance, and help protect your databases against failure and Availability Zone disruption.

If you want to use a traditional database instead (eg SQL Server), you can use Amazon RDS to run a Multi-AZ Database. This consists of two databases servers in the same Region but in different Availability Zones (which means different data centers):

  • A Primary server in one AZ that is serving traffic
  • A Secondary server in a different AZ (in the same Region) that is being continuously updated by the Primary server

If a failure happens with the Primary server, the Secondary server becomes the new Primary server. There is a brief outage, but no data is lost. The RDS service will then launch a new Secondary server.

Failure recovery vs Data recovery

The other options you mention (point-in-time recovery, snapshots) are focussed on recovering data that was in the database at a particular time. This is normally because somebody/something accidentally deleted or changed data and you wish to recovery the data as it was at a previous time. It is good to combine both High Availability and Snapshots, although Amazon Aurora almost makes Snapshots irrelevant due to its ability to go back to a previous point in time.

Bottom line: Instead of Disaster Recovery, think High Availability.

0
On

First of all, you need to identify the Recovery Time Objective (RTO) and Recovery Point Objective (RPO) for your workload. RTO is the amount of time from a disaster event to when your system must be fully operational again. RPO is the maximum amount of data loss that you can tolerate after a disaster event. These objectives help you determine the appropriate level of risk and cost for your disaster recovery (DR) plan.

According to AWS documentation, there are four main DR strategies that you can use on AWS:

  1. Backup and restore – back up your systems and restore them from backup if disaster strikes. This is low-cost but high-risk, as it has a high RTO and RPO.
  2. Pilot light – replicate your data and core elements to another Region and scale up when needed. This reduces the RTO and RPO but requires some manual intervention.
  3. Warm standby – run a scaled-down version of your system in another Region that can handle minimal traffic. This allows you to switch over quickly with minimal downtime. This further reduces the RTO and RPO but increases the cost and complexity.
  4. Multi-site active/active – run your system across multiple Regions with load balancing and synchronization. This provides the highest availability and resilience, as well as the lowest RTO and RPO possible. However, this also requires the most cost and complexity.

Your question only focuses on different backup and restore strategies. They are all different ways of restoring your database state from a specific point in time using AWS services such as Amazon Relational Database Service (RDS), Amazon Aurora, or Amazon DynamoDB.

However, these options do not cover other aspects of DR such as scaling up resources, switching over traffic, or synchronizing data across Regions. Some services like AWS Aurora natively support multi-site active/active DR, but others like RDS do not. Therefore, you need to first focus on the RTO and RPO objectives for your workload before choosing a DR strategy. Also please refer to Disaster Recovery on AWS.