Cold Disaster Recovery for Kubernetes applications

Introduction

Since my last exploration of the topic in mid-2021, there have been significant advancements in cold disaster recovery for applications on Kubernetes. Since then, there have been remarkable improvements in both the technical aspects and the user experience of cold disaster recovery. Notably, the Kubernetes operator that manages the technical infrastructure for cold disaster recovery on Kubernetes has made noticeable progress, transitioning from version 0.2.1 to 1.2.0. In addition to other benefits, this has streamlined the initial setup process and minimized the steps that are necessary for initiating backup and recovery. Additionally, Red Hat customers now have a supported option available, which should give users the confidence to be able to effectively meet their data protection requirements for their critical cloud-native applications.

Furthermore, it may be useful to note that the Red Hat portfolio related to the whole spectrum of disaster recovery scenarios has evolved as well. While this particular blog post primarily concentrates on cold disaster recovery, there are options to realize approaches such as warm and hot disaster recovery. Acknowledging an abundance of definitions online, Table 1 tries to encapsulate Kubernetes-specific definitions of these terms and how they relate to Red Hat solutions.

Type	Description	Subscriptions required	Further reading
Cold Disaster Recovery	This typically refers to scenarios where the infrastructure components are in place, but manual steps are required to restore the service. This includes moving data and deploying applications.	Red Hat OpenShift Kubernetes Engine or Red Hat OpenShift Container Platform or Red Hat OpenShift Platform Plus and Some sort of compatible storage, for example delivered via Red Hat OpenShift Platform Plus, which includes OpenShift Data Foundation essentials	https://docs.openshift.com/container-platform/4.12/backup_and_restore/index.html
Warm Disaster Recovery	Sometimes referred to as Regional Disaster Recovery. Implies that a redundant set of infrastructure components is deployed and that application data is replicated asynchronously. Manual steps are required to trigger a failover.	Red Hat OpenShift Kubernetes Engine or Red Hat OpenShift Container Platform or Red Hat OpenShift Platform Plus and Red Hat OpenShift Data Foundation advanced or Red Hat OpenShift Container Platform and Red Hat Advanced Cluster Management for Kubernetes and Red Hat OpenShift Data Foundation advanced	https://red-hat-storage.github.io/ocs-training/training/ocs4/odf411-multisite-ramen.html https://access.redhat.com/documentation/en-us/red_hat_openshift_data_foundation/4.12/html/configuring_openshift_data_foundation_disaster_recovery_for_openshift_workloads/rdr-solution
Hot Disaster Recovery	Sometimes referred to as Metro Disaster Recovery. Effectively means a fully functional, redundant copy of the production environment elsewhere with synchronous application date replication.	Red Hat OpenShift Platform Plus and OpenShift Data Foundation advanced or Red Hat OpenShift Container Platform and Red Hat Advanced Cluster Management for Kubernetes and Red Hat OpenShift Data Foundation advanced	https://red-hat-storage.github.io/ocs-training/training/ocs4/odf411-metro-ramen.html https://access.redhat.com/documentation/en-us/red_hat_openshift_data_foundation/4.12/html/configuring_openshift_data_foundation_disaster_recovery_for_openshift_workloads/metro-dr-solution

Table1 : Options for Kubernetes-based Disaster Recovery

The objective of this blog post is to provide a simple and easy to understand starting point for individuals seeking guidance on how to go about the backup process for their Kubernetes applications while acknowledging that it is not a complete guide for all things data protection.

Scenario

In this blog post, I will reference various Red Hat specific technologies such as OpenShift and the OpenShift API for Data Protection (OADP). Since these technologies are not proprietary, their applicability extends to other Kubernetes distributions where similar APIs, specifically Velero, are readily accessible.

The steps provided later on assume the presence of certain components. Rather than providing detailed setup instructions, the focus will be on utilizing them:

Two OpenShift Container Platform 4.12 clusters with some form of local storage solution
A central and highly available container registry, in this case, Red Hat Quay, which is connected to both OpenShift clusters
A central MinIO object storage cluster with an endpoint accessible to both OpenShift clusters
One bucket along with the necessary credentials to read from and write data to the storage

In this scenario, we will utilize Velero to backup OpenShift and Kubernetes resources. These backups will be stored as archives in the Minio object storage bucket. Additionally, since we are leveraging NFS for storage in this example, backups of Persistent Volumes (PVs) will also be stored in the object storage using Restic.

Side note: Another approach for backing up PVs is through snapshots, which can be managed using the native snapshot API provided by cloud providers (if available) or via Container Storage Interface (CSI) snapshots. With this consideration, the high-level architecture depicted in Figure 1 emerges.

Figure 1: Conceptual view of the architecture

To conveniently install and configure various components in two Kubernetes environments from a client computer, we will utilize kubectl contexts. The primary site will be referred to as “primary” and the failover site as “failover,” although these names are arbitrary.

$ oc login --token=sha256~d2VyIGRhcyBsaWVzdCBpc3QgZG9vZgo --server=https://api.primary.example.com:6443

$ oc config rename-context $(oc config current-context) primary

$ oc --context primary get nodes
NAME        STATUS   ROLES           AGE      VERSION
compute-0   Ready    worker          2y333d   v1.25.8+37a9a08
compute-1   Ready    worker          2y333d   v1.25.8+37a9a08
compute-2   Ready    worker          2y333d   v1.25.8+37a9a08
compute-3   Ready    worker          2y333d   v1.25.8+37a9a08
master-0    Ready    master,worker   2y333d   v1.25.8+37a9a08
master-1    Ready    master,worker   2y333d   v1.25.8+37a9a08
master-2    Ready    master,worker   2y333d   v1.25.8+37a9a08

$ oc login --token=sha256~Zm9sbG93IHRoZSB3aGl0ZSByYWJiaXQK --server=https://api.failover.example.com:6443

$ oc config rename-context $(oc config current-context) failover

$ oc --context failover get nodes
NAME                                         STATUS   ROLES                         AGE   VERSION
ip-10-0-166-125.eu-west-1.compute.internal   Ready    control-plane,master,worker   30m   v1.25.7+eab9cc9
Code language: Shell Session (shell)

Installing the OADP Operator

To backup and restore applications running on the OpenShift Container Platform, one option is to use the OADP operator. OADP handles the deployment and management of the necessary components for implementing disaster recovery. To deploy the operator, please refer to the documentation for detailed instructions and remember to perform these steps on both the primary and the failover site. Finally, obtain the AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY that are used to access the MinIO object storage bucket and store them in a Kubernetes secret object:

$ cat << EOF > ./credentials-velero
[default]
aws_access_key_id=<AWS_ACCESS_KEY_ID>
aws_secret_access_key=<AWS_SECRET_ACCESS_KEY>
EOF
Code language: Shell Session (shell)

For simplicity’s sake, we will create the same secret in both clusters. However, in a real-world scenario, it would be optimal to use different credentials to access the object storage.

$ for i in primary failover; \
  do oc --context ${i} create secret generic cloud-credentials \
  -n openshift-adp --from-file cloud=credentials-velero; \
done

secret/cloud-credentials created
secret/cloud-credentials created
Code language: Shell Session (shell)

Configuring the Data Protection Application

The next step is to instruct the OADP operator to deploy the necessary components for backup and restore operations in both sites. This can be achieved by using the DataProtectionApplication API resource. In this specific example, a single Velero instance and a DaemonSet for deploying the Velero Node Agent will be deployed. The Velero Node Agent hosts the Restic library, which is responsible for conducting file system backups. Please note that the following configuration example is designed to use MinIO as the backup location. Please refer to the documentation for instructions on how to adjust the DataProtectionApplication CRD to work with alternative S3-compatible backup storage providers.

$ cat DataProtectionApplication.yaml

apiVersion: oadp.openshift.io/v1alpha1
kind: DataProtectionApplication
metadata:
  name: velero-sample
  namespace: openshift-adp
spec:
  backupLocations:
  - velero:
      config:
        insecureSkipTLSVerify: "true"
        profile: default
        region: minio
        s3ForcePathStyle: "true"
        s3Url: http://minio.example.com
      credential:
        key: cloud
        name: cloud-credentials
      default: true
      objectStorage:
        bucket: oadp-backup
        prefix: velero
      provider: aws
  configuration:
    restic:
      enable: true
    velero:
      defaultPlugins:
      - openshift
      - aws
      podConfig:
        resourceAllocations:
          requests:
            cpu: 500m
            memory: 256Mi
Code language: Shell Session (shell)

As mentioned earlier, OADP is responsible for both backup and restore operations. Therefore, we will deploy the DataProtectionApplication in both environments.

$ for i in primary failover; \
  do oc --context ${i} create -f DataProtectionApplication.yaml; \
done

dataprotectionapplication.oadp.openshift.io/velero-sample created
dataprotectionapplication.oadp.openshift.io/velero-sample created
Code language: Shell Session (shell)

If everything goes smoothly, we should expect to see something similar to the following.

$ for i in primary failover; \
  do oc --context ${i} -n openshift-adp get pods; \
echo; done
NAME                                                READY   STATUS    RESTARTS   AGE
node-agent-6kbp7                                    1/1     Running   0          2m54s
node-agent-bptcc                                    1/1     Running   0          2m55s
node-agent-cwvgc                                    1/1     Running   0          2m54s
node-agent-jglq9                                    1/1     Running   0          2m54s
node-agent-lsr9g                                    1/1     Running   0          2m54s
node-agent-th4jk                                    1/1     Running   0          2m54s
node-agent-zcvbc                                    1/1     Running   0          2m55s
openshift-adp-controller-manager-66cf6958d5-l5s68   1/1     Running   0          169m
velero-647b46bb9b-2c7lb                             1/1     Running   0          2m55s

NAME                                                READY   STATUS    RESTARTS   AGE
node-agent-4tj5d                                    1/1     Running   0          2m55s
openshift-adp-controller-manager-5d6d56f89f-bvqn4   1/1     Running   0          3m38s
velero-647b46bb9b-gkww6                             1/1     Running   0          2m55s
Code language: Shell Session (shell)

To ensure that Velero has access to the S3 repository, we will check the BackupStorageLocation API resource.

$ for i in primary failover; \
  do oc --context ${i} -n openshift-adp get backupstoragelocation; \
echo; done

NAME              PHASE       LAST VALIDATED   AGE     DEFAULT
velero-sample-1   Available   14s              4m47s   true

NAME              PHASE       LAST VALIDATED   AGE     DEFAULT
velero-sample-1   Available   8s               4m47s   true
Code language: Shell Session (shell)

In real-world scenarios, it is common to use multiple backup locations, granular credentials, and different Velero plugins depending on the environment. Fortunately, all of these requirements can be accommodated and configured in a straightforward way via the previously mentioned APIs.

Application deployment

To demonstrate the disaster recovery capabilities of OADP, we need two essential elements: a stateful application and a catastrophic event that renders the service unrecoverable without a backup.

Here is the most basic example of a stateful application that I can think of: A single pod that writes data to a file residing on a Persistent Volume (PV). In the event of a disaster or data loss, having a backup of the PV becomes crucial for successful recovery and restoration of the service.

$ oc --context primary new-project persistent-database-prod
Code language: Shell Session (shell)

$ oc --context primary -n persistent-database-prod create deployment \
  persistent-database-prod --replicas 1 \
  --image=registry.access.redhat.com/ubi8/ubi-minimal:8.8-1014 \
  -- /bin/bash -c "sleep infinity"

deployment.apps/persistent-database-prod createdCode language: Shell Session (shell)

$ oc --context primary -n persistent-database-prod set volumes \
  deploy/persistent-database-prod --add -t pvc --claim-size 1G --mount-path=/data

info: Generated volume name: volume-524mq
deployment.apps/persistent-database-prod volume updated
Code language: Shell Session (shell)

$ oc --context primary -n persistent-database-prod get po,pvc

NAME                                            READY   STATUS    RESTARTS   AGE
pod/persistent-database-prod-6d58c68dbc-pjtn9   1/1     Running   0          36s

NAME                              STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS          AGE
persistentvolumeclaim/pvc-ctcdk   Bound    pvc-a8b3c1e3-4bfd-466d-8bca-b2061b49eb52   1G         RWO            managed-nfs-storage   36s
Code language: Shell Session (shell)

To enable easy verification of the successful application restore later, writing a timestamp to the file on the Persistent Volume (PV) should suffice.

$ oc --context primary -n persistent-database-prod exec -it persistent-database-prod-6d58c68dbc-pjtn9 -- /bin/bash -c "date > /data/criticaldata.txt"
Code language: Shell Session (shell)

$ oc --context primary -n persistent-database-prod exec -it persistent-database-prod-6d58c68dbc-pjtn9 -- /bin/bash -c "cat /data/criticaldata.txt"

Thu Jul  6 14:41:04 UTC 2023
Code language: Shell Session (shell)

Application backup

To initiate an application backup using OADP, we utilize the Backup API resource. As the application is currently running on the primary site, the Backup resource will be deployed there.

$ cat backup.yaml

apiVersion: velero.io/v1
kind: Backup
metadata:
  name: backup-persistent-database-prod
  namespace: openshift-adp
spec:
  includedNamespaces:
    - persistent-database-prod
  DefaultVolumesToFsBackup: true
  ttl: 720h0m0s
Code language: Shell Session (shell)

$ oc --context primary create -f backup.yaml

backup.velero.io/backup-persistent-database-prod createdCode language: Shell Session (shell)

After a certain period, the API should indicate a status of “Completed” after initially being in an “InProgress” state for some time.

$ oc --context primary -n openshift-adp get backup \
  backup-persistent-database-prod -o jsonpath='{.status.phase}'; echo

Completed
Code language: Shell Session (shell)

Alternatively, one can check the logs of the backup controller to monitor the progress.

$ oc --context primary -n openshift-adp logs \
  $(oc -n openshift-adp get pods -l app.kubernetes.io/component=server -o name) \
  | grep "Backup completed"

time="2023-07-06T14:41:40Z" level=info msg="Backup completed" backup=openshift-adp/backup-persistent-database-prod logSource="/remote-source/velero/app/pkg/controller/backup_controller.go:780"
Code language: Shell Session (shell)

Yet another place to verify the successful backup is by checking the object storage itself. The object storage should now contain the static files such as API resources and logs, as well as the content of the Persistent Volume.

$ aws --endpoint-url=http://minio.example.com s3 ls \
  s3://oadp-backup/velero/backups/backup-persistent-database-prod/

2023-07-06 16:41:40         29 backup-persistent-database-prod-csi-volumesnapshotclasses.json.gz
2023-07-06 16:41:40         29 backup-persistent-database-prod-csi-volumesnapshotcontents.json.gz
2023-07-06 16:41:40         29 backup-persistent-database-prod-csi-volumesnapshots.json.gz
2023-07-06 16:41:40         27 backup-persistent-database-prod-itemoperations.json.gz
2023-07-06 16:41:40      11365 backup-persistent-database-prod-logs.gz
2023-07-06 16:41:40        940 backup-persistent-database-prod-podvolumebackups.json.gz
2023-07-06 16:41:40        604 backup-persistent-database-prod-resource-list.json.gz
2023-07-06 16:41:40         49 backup-persistent-database-prod-results.gz
2023-07-06 16:41:40         29 backup-persistent-database-prod-volumesnapshots.json.gz
2023-07-06 16:41:40      83097 backup-persistent-database-prod.tar.gz
2023-07-06 16:41:40       2707 velero-backup.jsonCode language: Shell Session (shell)

$ aws --endpoint-url=http://minio.example.com s3 ls \
  s3://oadp-backup/velero/restic/persistent-database-prod/

                           PRE data/
                           PRE index/
                           PRE keys/
                           PRE snapshots/
2023-07-06 16:41:37        155 config
Code language: Shell Session (shell)

Disaster simulation

After confirming that the application is still functioning in the primary site, deliberately deleting the namespace will serve as a simulated disaster. It is worth noting that since the backup is stored externally from the cluster, recovering from a complete cluster outage would also be a feasible scenario.

$ oc --context primary -n persistent-database-prod get po,pvc

NAME                                            READY   STATUS    RESTARTS   AGE
pod/persistent-database-prod-6d58c68dbc-pjtn9   1/1     Running   0          4m55s

NAME                              STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS          AGE
persistentvolumeclaim/pvc-ctcdk   Bound    pvc-a8b3c1e3-4bfd-466d-8bca-b2061b49eb52   1G         RWO            managed-nfs-storage   4m55s
Code language: Shell Session (shell)

$ oc --context primary delete project persistent-database-prod

project.project.openshift.io "persistent-database-prod" deleted
Code language: Shell Session (shell)

$ oc --context primary -n persistent-database-prod get all

No resources found in persistent-database-prod namespace.
Code language: Shell Session (shell)

Application restore

After confirming the outage and deciding to initiate the restore process on the failover site, the Restore API resource is used to instruct OADP to trigger the restoration process.

$ cat restore.yaml 

apiVersion: velero.io/v1
kind: Restore
metadata:
  name: restore-persistent-database-prod
  namespace: openshift-adp
spec:
  backupName: backup-persistent-database-prod
  restorePVs: true
Code language: Shell Session (shell)

$ oc --context failover create -f restore.yaml

restore.velero.io/restore-persistent-database-prod created
Code language: Shell Session (shell)

Once again, querying the API can help retrieve the status of the restore process.

$ oc --context failover -n openshift-adp get restore \
  restore-persistent-database-prod -o jsonpath='{.status.phase}'; echo

Completed
Code language: Shell Session (shell)

Similar to the backup procedure, checking the logs of the controller manager pods can likewise provide valuable insights into the status of the restore operation.

$ oc --context failover -n openshift-adp logs \
  $(oc --context failover -n openshift-adp get pods \
    -l app.kubernetes.io/component=server -o name) | grep "restore completed"

time="2023-07-06T14:45:54Z" level=info msg="restore completed" logSource="/remote-source/velero/app/pkg/controller/restore_controller.go:513" restore=openshift-adp/restore-persistent-database-prod
Code language: Shell Session (shell)

In addition, the log files associated with restore operations are persistently stored in S3, allowing them to be accessed and consumed even after the lifecycle of the OADP controller manager pod.

$ aws --endpoint-url=http://minio.example.com s3 ls \
  s3://oadp-backup/velero/restores/restore-persistent-database-prod/

2023-07-06 16:45:55         27 restore-restore-persistent-database-prod-itemoperations.json.gz
2023-07-06 16:45:54      11369 restore-restore-persistent-database-prod-logs.gz
2023-07-06 16:45:54        449 restore-restore-persistent-database-prod-resource-list.json.gz
2023-07-06 16:45:54        255 restore-restore-persistent-database-prod-results.gz
Code language: Shell Session (shell)

Upon reviewing the output, one would expect the application to be operational, and fortunately, that is indeed the case.

$ oc --context failover -n persistent-database-prod get po,pv

NAME                                            READY   STATUS    RESTARTS   AGE
pod/persistent-database-prod-6d58c68dbc-pjtn9   1/1     Running   0          2m26s

NAME                                                        CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM                                STORAGECLASS          REASON   AGE
persistentvolume/pvc-a8b32d75-6c18-4c57-aa34-7f10b64c4fa7   1Gi        RWO            Delete           Bound    persistent-database-prod/pvc-ctcdk   managed-nfs-storage            2m23s
Code language: Shell Session (shell)

The same applies to the data that was written to the application at the time of the backup.

$ oc --context failover -n persistent-database-prod exec -it persistent-database-prod-6d58c68dbc-pjtn9 -- /bin/bash -c "cat /data/criticaldata.txt"

Thu Jul  6 14:41:04 UTC 2023
Code language: Shell Session (shell)

Regular application backups

Manually scheduling application backups is susceptible to human error, time-consuming, and prone to inconsistencies, thereby increasing the risk of data loss. This problem becomes more pronounced as the number of applications grows, worsening the challenges associated with managing and ensuring reliable backups. To tackle this issue, one can leverage the Scheduling API resource, which ensures regular and automated creation of backups.

$ cat backup-schedule.yaml

apiVersion: velero.io/v1
kind: Schedule
metadata:
  name: backup-persistent-database-prod-schedule
  namespace: openshift-adp
spec:
  schedule: '*/10 * * * *'
  template:
    includedNamespaces:
    - persistent-database-prod
    DefaultVolumesToFsBackup: true
    ttl: 1h0m0s
Code language: Shell Session (shell)

$ oc --context primary create -f backup-schedule.yaml

schedule.velero.io/backup-persistent-database-prod-schedule created
Code language: Shell Session (shell)

$ oc --context primary get schedule -n openshift-adp backup-persistent-database-prod-schedule

NAME                                       STATUS    SCHEDULE       LASTBACKUP   AGE   PAUSED
backup-persistent-database-prod-schedule   Enabled   */10 * * * *                12s
Code language: Shell Session (shell)

When using schedules, each backup is assigned a unique name, which must be referenced in the Restore API resource when needed. The Time-to-Live (TTL) specified in the Schedule API resource determines the duration for which backups will be retained. Velero automatically handles the cleanup of older backups, ensuring efficient management of backup storage.

$ oc --context primary get backup -n openshift-adp

NAME                                                      AGE
backup-persistent-database-prod                           13m
backup-persistent-database-prod-schedule-20230706145009   4m37s
Code language: Shell Session (shell)

Summary and Outlook

This blog post hopefully delivered on its promise by offering a straightforward starting point for cold disaster recovery on Kubernetes. It showcased a backup procedure, a simulated disaster scenario, and a successful recovery.

I acknowledge that this post may not cover all possible scenarios or address every question that could arise. There are some unanswered questions, including:

Restoring applications deployed and managed by a Kubernetes Operator which may not be included in the backup itself.
Integration of the above concepts within the principles of GitOps.
Handling large-scale application restoration and testing in a real disaster scenario.
What would other aspects of disaster recovery, such as processes, documentation, RTO objective setting, and capacity planning look like?

Furthermore, it may be useful to note that, while we focused on backing up and restoring applications, OADP can also be utilized to backup and restore entire clusters. For a comprehensive understanding of backing up and restoring hub clusters, you can refer to the following blog post.

Finally, OADP is not the only option for Kubernetes backups. One of Red Hat’s strengths lies in engaging with partners, and there are multiple alternatives available for disaster recovery on OpenShift. Kasten K10 by Veeam, for example, is a leading provider in this field and well-integrated with OpenShift. Additionally, vendors like Portworx and Trilio offer their own solutions, but it’s worth exploring all options available through the Red Hat software catalog, as there are numerous alternatives to choose from.