Disaster Recovery

Disaster Recovery Workflows

Disaster Recovery Workflows

Once the DellCSIReplicationGroup & PersistentVolume objects have been replicated across clusters (or within the same cluster), users can exercise the general Disaster Recovery workflows.

Ensure automatic remapping of PVCs is enabled by setting “disablePVCRemap” to “false” in the driver manifest (Refer: pvc-remap).

Planned Migration to the target cluster/array

This scenario is the typical choice when you want to try your disaster recovery plan or you need to switch activities from one site to another:

a. Execute “failover” action on selected ReplicationGroup using the cluster name

./repctl failover --rg rg-id-site-1 --target rg-id-site-2
 ./repctl --rg rg-id failover --target target-cluster-name

b. Execute “reprotect” action on selected ReplicationGroup which will resume the replication from new “source”

./repctl reprotect --rg rg-id-site-2
 ./repctl --rg rg-id reprotect --at new-source-cluster-name
state_changes1

Unplanned Migration to the target cluster/array

This scenario is the typical choice when a site goes down:

a. Scale down the application pods to zero replicas by editing the application manifest yaml file and changing the replicas count to 0.

b. The application pods might go to Error state. Force delete the application pods by running

kubectl delete pods <pod-name> -n <namespace> --force

c. Execute “failover” action on selected ReplicationGroup using the cluster name

./repctl failover --rg rg-id-site-1 --target rg-id-site-2 --unplanned
 ./repctl --rg rg-id failover --target target-cluster-name --unplanned 

d. (PowerMax driver only) Execute “swap” action on selected ReplicationGroup which would swap personalities of R1 and R2.

 ./repctl --rg rg-id-site-2 swap
 ./repctl --rg rg-id swap --at target-cluster-name

Note: Unplanned migration usually happens when the original “source” cluster is unavailable. The following action makes sense when the cluster is back.

e. (PowerScale arrays only) Before initiating a failback or reprotect operation, storage administrators must manually disable the SyncIQ policy when bringing the failed-over source array back online. Failing to do so may result in unexpected behavior.

f. Execute “reprotect” action on selected ReplicationGroup which will resume the replication.

 ./repctl --rg rg-id-site-1 reprotect
 ./repctl --rg rg-id reprotect --at new-source-cluster-name
state_changes2

g. Scale up the application pods to desired count by editing the application manifest yaml file and changing the replicas count to desired value.

NOTE: When users do Failover and Failback, the tests pods on the source cluster may go “CrashLoopOff” state since it will try to remount the same volume which is already mounted. To get around this problem, bring down the number of replicas to 0 and then after that is done, bring it up to 1.