Design
CSM for Resiliency Design
Container Storage Modules (CSM) for Resiliency is part of the open-source suite of Kubernetes storage enablers for Dell products.
User applications can have problems if you want their Pods to be resilient to node failure. This is especially true of those deployed with StatefulSets that use PersistentVolumeClaims. Kubernetes guarantees that there will never be two copies of the same StatefulSet Pod running at the same time and accessing storage. Therefore, it does not clean up StatefulSet Pods if the node executing them fails.
For the complete discussion and rationale, you can read the pod-safety design proposal.
For more background on the forced deletion of Pods in a StatefulSet, please visit Force Delete StatefulSet Pods.
CSM for Resiliency and Non graceful node shutdown are mutually exclusive. One shall use either CSM for Resiliency or Non graceful node shutdown feature provided by Kubernetes.
CSM for Resiliency is designed to make Kubernetes Applications, including those that utilize persistent storage, more resilient to various failures. The first component of the Resiliency module is a pod monitor that is specifically designed to protect stateful applications from various failures. It is not a standalone application, but rather is deployed as a sidecar to CSI (Container Storage Interface) drivers, in both the driver’s controller pods and the driver’s node pods. Deploying CSM for Resiliency as a sidecar allows it to make direct requests to the driver through the Unix domain socket that Kubernetes sidecars use to make CSI requests.
Some of the methods CSM for Resiliency invokes in the driver are standard CSI methods, such as NodeUnpublishVolume, NodeUnstageVolume, and ControllerUnpublishVolume. CSM for Resiliency also uses proprietary calls that are not part of the standard CSI specification. Currently, there is only one, ValidateVolumeHostConnectivity that returns information on whether a host is connected to the storage system and/or whether any I/O activity has happened in the recent past from a list of specified volumes. This allows CSM for Resiliency to make more accurate determinations about the state of the system and its persistent volumes. CSM for Resiliency is designed to adhere to pod affinity settings of pods.
Accordingly, CSM for Resiliency is adapted to and qualified with each CSI driver it is to be used with. Different storage systems have different nuances and characteristics that CSM for Resiliency must take into account.
CSM for Resiliency provides the following capabilities:
Capability | PowerScale | Unity XT | PowerStore | PowerFlex | PowerMax |
---|---|---|---|---|---|
Detect pod failures when: Node failure, K8S Control Plane Network failure, K8S Control Plane failure, Array I/O Network failure | yes | yes | yes | yes | no |
Cleanup pod artifacts from failed nodes | yes | yes | yes | yes | no |
Revoke PV access from failed nodes | yes | yes | yes | yes | no |
COP/OS | Supported Versions |
---|---|
Kubernetes | 1.26, 1.27, 1.28 |
Red Hat OpenShift | 4.13, 4.14 |
PowerFlex | Unity XT | PowerScale | PowerStore | |
---|---|---|---|---|
Storage Array | 3.6.x, 4.0.x, 4.5 | 5.1.x, 5.2.x, 5.3.0 | OneFS 9.3, 9.4, 9.5.0.x (x >= 5) | 3.0, 3.2, 3.5, 3.6 |
CSM for Resiliency supports the following CSI drivers and versions.
Storage Array | CSI Driver | Supported Versions |
---|---|---|
CSI Driver for Dell PowerFlex | csi-powerflex | v2.0.0 + |
CSI Driver for Dell Unity XT | csi-unity | v2.0.0 + |
CSI Driver for Dell PowerScale | csi-powerscale | v2.3.0 + |
CSI Driver for Dell PowerStore | csi-powerstore | v2.6.0 + |
PowerFlex is a highly scalable array that is very well suited to Kubernetes deployments. The CSM for Resiliency support for PowerFlex leverages these PowerFlex features:
Dell Unity XT is targeted for midsized deployments, remote or branch offices, and cost-sensitive mixed workloads. Unity XT systems are designed to deliver the best value in the market. They support all-Flash, and are available in purpose-built (all Flash or hybrid Flash), converged deployment options (through VxBlock), and software-defined virtual edition.
All three deployment options, Unity XT, UnityVSA, and Unity-based VxBlock, enjoy one architecture, one interface with consistent features and rich data services.
PowerScale is a highly scalable NFS array that is very well suited to Kubernetes deployments. The CSM for Resiliency support for PowerScale leverages the following PowerScale features:
PowerStore is a highly scalable array that is very well suited to Kubernetes deployments. The CSM for Resiliency support for PowerStore leverages the following PowerStore features:
This file contains information on Limitations and Exclusions that users should be aware of. Additionally, there are driver specific limitations and exclusions that may be called out in the Deploying CSM for Resiliency page.
The following provisioning types are supported and have been tested:
Pods that use persistent volumes from multiple CSI drivers. This cannot be supported because multiple controller-podmons (one for each driver type) would be trying to manage the failover with conflicting actions.
ReadWriteMany volumes. This may have issues if a node has multiple pods accessing the same volumes. In any case once pod cleanup fences the volumes on a node, they will no longer be available to any pods using those volumes on that node. We will endeavor to support this in the future.
Multiple instances of the same driver type (for example two CSI driver for Dell PowerFlex deployments.)
PowerFlex with Resiliency is not supported for NFS protocol.
The first thing to remember about CSM for Resiliency is that it only takes action on pods configured with the designated label. Both the key and the value have to match what is in the podmon helm configuration. CSM for Resiliency emits a log message at startup with the label key and value it is using to monitor pods:
labelSelector: {map[podmon.dellemc.com/driver:csi-vxflexos]
The above message indicates the key is: podmon.dellemc.com/driver and the label value is csi-vxflexos. To search for the pods that would be monitored, try this:
kubectl get pods -A -l podmon.dellemc.com/driver=csi-vxflexos
NAMESPACE NAME READY STATUS RESTARTS AGE
pmtu1 podmontest-0 1/1 Running 0 3m7s
pmtu2 podmontest-0 1/1 Running 0 3m8s
pmtu3 podmontest-0 1/1 Running 0 3m6s
If CSM for Resiliency detects a problem with a pod caused by a node or other failure that it can initiate remediation for, it will add an event to that pod’s events:
kubectl get events -n pmtu1
...
61s Warning NodeFailure pod/podmontest-0 podmon cleaning pod [7520ba2a-cec5-4dff-8537-20c9bdafbe26 node.example.com] with force delete
...
CSM for Resiliency may also generate events if it is unable to clean up a pod for some reason. For example, it may not clean up a pod because the pod is still doing I/O to the array.
Similarly, the label selector for csi-powerscale and csi-unity would be as shown respectively.
labelSelector: {map[podmon.dellemc.com/driver:csi-isilon]
labelSelector: {map[podmon.dellemc.com/driver:csi-unity]
Before putting an application into production that relies on CSM for Resiliency monitoring, it is important to do a few test failovers first. To do this take the node that is running the pod offline for at least 2-3 minutes. Verify that there is an event message similar to the one above is logged, and that the pod recovers and restarts normally with no loss of data. (Note that if the node is running many CSM for Resiliency protected pods, the node may need to be down longer for CSM for Resiliency to have time to evacuate all the protected pods.)
It is recommended that pods that will be monitored by CSM for Resiliency be configured to exit if they receive any I/O errors. That should help achieve the recovery as quickly as possible.
CSM for Resiliency does not directly monitor application health. However, if standard Kubernetes health checks are configured, that may help reduce pod recovery time in the event of node failure, as CSM for Resiliency should receive an event that the application is Not Ready. Note that a Not Ready pod is not sufficient to trigger CSM for Resiliency action unless there is also some condition indicating a Node failure or problem, such as the Node is tainted, or the array has lost connectivity to the node.
As noted previously in the Limitations and Exclusions section, CSM for Resiliency has not yet been verified to work with ReadWriteMany or ReadOnlyMany volumes. Also, it has not been verified to work with pod controllers other than StatefulSet.
Normally CSM for Resiliency should be able to move pods that have been impacted by Node Failures to a healthy node. After the failed nodes have come back online, CSM for Resiliency cleans them up (especially any potential zombie pods) and then automatically removes the CSM for Resiliency node taint that prevents pods from being scheduled to the failed node(s). There are a few cases where this cannot be fully automated and operator intervention is required, including:
CSM for Resiliency expects that when a node failure occurs, all CSM for Resiliency labeled pods are evacuated and rescheduled on other nodes. This process may not complete however if the node comes back online before CSM for Resiliency has had time to evacuate all the labeled pods. The remaining pods may not restart correctly, going to “Error” or “CrashLoopBackoff”. We are considering some possible remediation for this condition but have not implemented them yet.
If this happens, try deleting the pod with “kubectl delete pod …”. In our experience this normally will cause the pod to be restarted and transition to the “Running” state.
Podmon-node is responsible for cleaning up failed nodes after the nodes’ communication has been restored. The algorithm checks to see that all the monitored pods have terminated and their volumes and mounts have been cleaned up.
If some of the monitored pods are still executing, node-podmon will emit the following log message at the end of a cleanup cycle (and retry the cleanup after a delay):
pods skipped for cleanup because still present: <pod-list>
If this happens, DO NOT manually remove the CSM for Resiliency node taint. Doing so could possibly cause data corruption if volumes were not cleaned up, and a pod using those volumes was subsequently scheduled to that node.
The correct course of action in this case is to reboot the failed node(s) that have not removed their taints in a reasonable time (5-10 minutes after the node is online again.) The operator can delay executing this reboot until it is convenient, but new pods will not be scheduled to it in the interim. This reboot will cancel any potential zombie pods. After the reboot, node-podmon should automatically remove the node taint after a short time.
A three tier testing methodology is used for CSM for Resiliency:
CSM for Resiliency Design
Dell Container Storage Modules (CSM) release notes for resiliency
CSM for Resiliency Use Cases
Dell Container Storage Modules (CSM) for Resiliency - Troubleshooting