Troubleshooting
- Can Container Storage Module Operator manage existing drivers installed using Helm charts or the CSI Operator?
- Why do some of the Custom Resource fields show up as invalid or unsupported in the OperatorHub GUI?
- How can I view detailed logs for the Container Storage Module SM Operator?
- My Dell CSI Driver install failed. How do I fix it?
- My CSContainer Storage ModuleM Replication install fails to validate replication prechecks with ’no such host’.
- How to update resource limits for Container Storage Module Operator when it is deployed using Operator hub
Can Container Storage Module Operator manage existing drivers installed using Helm charts or the CSI Operator?
The Container Storage Module Operator is unable to manage any existing driver installed using Helm charts or the CSI Operator. If you already have installed one of the Dell CSI driver in your cluster and want to use the CSM operator based deployment, uninstall the driver and then redeploy the driver via Container Storage ModuleM Operator
Why do some of the Custom Resource fields show up as invalid or unsupported in the OperatorHub GUI?
The Container Storage Module Operator is not fully compliant with the OperatorHub React UI elements. Due to this, some of the Custom Resource fields may show up as invalid or unsupported in the OperatorHub GUI. To get around this problem, use kubectl/oc
commands to get details about the Custom Resource(CR). This issue will be fixed in the upcoming releases of the Container Storage Module Operator.
How can I view detailed logs for the Container Storage Module Operator?
Detailed logs of the Container Storage Module Operator can be displayed using the following command:
kubectl logs <csm-operator-controller-podname> -n <namespace>
My Dell CSI Driver install failed. How do I fix it?
Describe the current state by issuing:
kubectl describe csm <custom-resource-name> -n <namespace>
In the output refer to the status and events section. If status shows pods that are in the failed state, refer to the CSI Driver Troubleshooting guide.
Example:
Status:
Controller Status:
Available: 0
Desired: 2
Failed: 2
Node Status:
Available: 0
Desired: 2
Failed: 2
State: Failed
Events
Warning Updated 67s (x15 over 2m4s) csm (combined from similar events): at 1646848059520359167 Pod error details ControllerError: ErrImagePull= pull access denied for dellem/csi-isilon, repository does not exist or may require 'docker login': denied: requested access to the resource is denied, Daemonseterror: ErrImagePull= pull access denied for dellem/csi-isilon, repository does not exist or may require 'docker login': denied: requested access to the resource is denied
The above event shows dellem/csi-isilon does not exist, to resolve this user can kubectl edit the csm and update to correct image.
To get details of driver installation: kubectl logs <dell-csm-operator-controller-manager-pod> -n dell-csm-operator
.
Typical reasons for errors:
- Incorrect driver version
- Incorrect driver type
- Incorrect driver Spec env, args for containers
- Incorrect RBAC permissions
My CSM Replication install fails to validate replication prechecks with ’no such host'.
In replication environments that utilize more than one cluster, and utilize FQDNs to reference API endpoints, it is highly recommended that the DNS be configured to resolve requests involving the FQDN to the appropriate cluster.
If for some reason it is not possible to configure the DNS, the /etc/hosts file should be updated to map the FQDN to the appropriate IP. This change will need to be made to the /etc/hosts file on:
- The bastion node(s) (or wherever
repctl
is used). - Either the CSM Operator Deployment or ClusterServiceVersion custom resource if using an Operator Lifecycle Manager (such as with an OperatorHub install).
- Both dell-replication-controller-manager deployments.
To update the ClusterServiceVersion, execute the command below, replacing the fields for the remote cluster’s FQDN and IP.
kubectl patch clusterserviceversions.operators.coreos.com -n <operator-namespace> dell-csm-operator-certified.v1.3.0 \
--type=json -p='[{"op": "add", "path": "/spec/install/spec/deployments/0/spec/template/spec/hostAliases", "value": [{"ip":"<remote-IP>","hostnames":["<remote-FQDN>"]}]}]'
To update the dell-replication-controller-manager deployment, execute the command below, replacing the fields for the remote cluster’s FQDN and IP. Make sure to update the deployment on both the primary and disaster recovery clusters.
kubectl patch deployment -n dell-replication-controller dell-replication-controller-manager \
-p '{"spec":{"template":{"spec":{"hostAliases":[{"hostnames":["<remote-FQDN>"],"ip":"<remote-IP>"}]}}}}'
How to update resource limits for CSM Operator when it is deployed using Operator Hub
In certain environments where users have deployed CSM Operator using Operator hub, they have encountered issues related to Container Storage Module Operator pods reporting ‘OOM Killed’. This issue is attributed to the default resource requests and limits configured in the CSM Operator, which fail to meet the resource requirements of the user environments. In this case users can update the resource limits from Openshift web console by following the steps below:
- Login into OpenShift web console
- Navigate to
Operators
section in the left pane and expand it and click on ‘Installed Operators’ - Select the
Dell Container Storage Modules
operator - Click on the
YAML
tab under the operator and you will seeClusterServiceVersion(CSV)
file opened in an YAML editor - Update the resource limits in the opened YAML under the section
spec.install.spec.deployments.spec.template.spec.containers.resources
- Save the CSV and your changes should be applied
Symptoms | Prevention, Resolution or Workaround |
---|---|
After installation vxflexos-node pods are in an Init:CrashLoopBackOff state in OpenShift 4.16 with error message: Back-off restarting failed container sdc in pod vxflexos-node on non-supported kernel versions. |
Use SDC version 4.5.2.1 in OpenShift 4.16. |
The installation fails with the following error message: Node xxx does not have the SDC installed |
Install the PowerFlex SDC on listed nodes. The SDC must be installed on all the nodes that need to pull an image of the driver. |
When you run the command kubectl describe pods vxflexos-controller-* –n vxflexos , the system indicates that the driver image could not be loaded. |
- If on Kubernetes, edit the daemon.json file found in the registry location and add { "insecure-registries" :[ "hostname.cloudapp.net:5000" ] } - If on OpenShift, run the command oc edit image.config.openshift.io/cluster and add registries to yaml file that is displayed when you run the command. |
The kubectl logs -n vxflexos vxflexos-controller-* driver logs show that the driver is not authenticated. |
Check the username, password, and the gateway IP address for the PowerFlex system. |
The kubectl logs vxflexos-controller-* -n vxflexos driver logs show that the system ID is incorrect. |
Use the get_vxflexos_info.sh to find the correct system ID. |
The kubectl logs vxflexos-controller-* -n vxflexos driver logs show that the system ID is incorrect. |
Use the get_vxflexos_info.sh to find the correct system ID. Add the system ID to myvalues.yaml script. |
CreateVolume error System |
Powerflex name if used for systemID in StorageClass ensure same name is also used in array config systemID |
Defcontext mount option seems to be ignored, volumes still are not being labeled correctly. | Ensure SElinux is enabled on a worker node, and ensure your container run time manager is properly configured to be utilized with SElinux. |
Mount options that interact with SElinux are not working (like defcontext). | Check that your container orchestrator is properly configured to work with SElinux. |
Installation of the driver on Kubernetes v1.25/v1.26/v1.27 fails with the following error: Error: unable to build kubernetes objects from release manifest: unable to recognize "": no matches for kind "VolumeSnapshotClass" in version "snapshot.storage.k8s.io/v1" |
Kubernetes v1.23/v1.24/v1.25 requires v1 version of snapshot CRDs to be created in cluster, see the Volume Snapshot Requirements |
The kubectl logs -n vxflexos vxflexos-controller-* driver logs show x509: certificate signed by unknown authority |
A self assigned certificate is used for PowerFlex array. See certificate validation for PowerFlex Gateway |
When you run the command kubectl apply -f snapclass-v1.yaml , you get the error error: unable to recognize "snapclass-v1.yaml": no matches for kind "VolumeSnapshotClass" in version "snapshot.storage.k8s.io/v1" |
Check to make sure that the v1 snapshotter CRDs are installed, and not the v1beta1 CRDs, which are no longer supported. |
The controller pod is stuck and producing errors such as" Failed to watch *v1.VolumeSnapshotContent: failed to list *v1.VolumeSnapshotContent: the server could not find the requested resource (get volumesnapshotcontents.snapshot.storage.k8s.io) |
Make sure that v1 snapshotter CRDs and v1 snapclass are installed, and not v1beta1, which is no longer supported. |
Driver install or upgrade fails because of an incompatible Kubernetes version, even though the version seems to be within the range of compatibility. For example: Error: UPGRADE FAILED: chart requires kubeVersion: >= 1.21.0 <= 1.28.0 which is incompatible with Kubernetes V1.21.11-mirantis-1 |
If you are using an extended Kubernetes version, see the helm Chart at helm/csi-vxflexos/Chart.yaml and use the alternate kubeVersion check that is provided in the comments. Note: this is not meant to be used to enable the use of pre-release alpha and beta versions, which is not supported. |
Volume metrics are missing | Enable Volume Health Monitoring |
When a node goes down, the block volumes attached to the node cannot be attached to another node | This is a known issue and has been reported at https://github.com/kubernetes-csi/external-attacher/issues/215. Workaround: 1. Force delete the pod running on the node that went down 2. Delete the volumeattachment to the node that went down. Now the volume can be attached to the new node. |
CSI-PowerFlex volumes cannot mount; are being recognized as multipath devices | CSI-PowerFlex does not support multipath; to fix: 1. Remove any multipath mapping involving a powerflex volume with multipath -f <powerflex volume> 2. Blacklist CSI-PowerFlex volumes in multipath config file |
When attempting a driver upgrade, you see: spec.fsGroupPolicy: Invalid value: "xxx": field is immutable |
You cannot upgrade between drivers with different fsGroupPolicies. See upgrade documentation for more details |
When accessing ROX mode PVC in OpenShift where the worker nodes are non-root user, you see: Permission denied while accesing the PVC mount location from the pod. |
Set the securityContext for ROX mode PVC pod as below, as it defines privileges for the pods or containers.securityContext: runAsUser: 0 runAsGroup: 0 |
After installing version v2.6.0 of the driver using the default powerflexSdc image, sdc:3.6.0.6, the vxflexos-node pods are in an Init:CrashLoopBackOff state. This issue can happen on hosts that require the SDC to be installed manually. Automatic SDC is only supported on Red Hat CoreOS (RHCOS), RHEL 7.9, RHEL 8.4, RHEL 8.6. |
The SDC is already installed. Change the images.powerflexSdc value to an empty value in the values and re-install. |
After installing version v2.8.0 of the driver using the default powerflexSdc image, sdc:3.6.1, the vxflexos-node pods are in an Init:CrashLoopBackOff state. This issue can happen on hosts that require the SDC to be installed manually. Automatic SDC is only supported on Red Hat CoreOS (RHCOS), RHEL 7.9, RHEL 8.4, RHEL 8.6. |
The SDC is already installed. Change the images.powerflexSdc value to an empty value in the values and re-install. |
In version v2.6.0, the driver is crashing because the External Health Monitor sidecar crashes when a persistent volume is not found. | This is a known issue reported at kubernetes-csi/external-health-monitor#100. |
In version v2.6.0, when a cluster node goes down, the block volumes attached to the node cannot be attached to another node. | This is a known issue reported at kubernetes-csi/external-attacher#215. Workaround: 1. Force delete the pod running on the node that went down. 2. Delete the pod’s persistent volume attachment on the node that went down. Now the volume can be attached to the new node. |
A CSI ephemeral pod may not get created in OpenShift 4.13 and fail with the error "error when creating pod: the pod uses an inline volume provided by CSIDriver csi-vxflexos.dellemc.com, and the namespace has a pod security enforcement level that is lower than privileged." |
This issue occurs because OpenShift 4.13 introduced the CSI Volume Admission plugin to restrict the use of a CSI driver capable of provisioning CSI ephemeral volumes during pod admission. Therefore, an additional label security.openshift.io/csi-ephemeral-volume-profile in csidriver.yaml file with the required security profile value should be provided. Follow OpenShift 4.13 documentation for CSI Ephemeral Volumes for more information. |
Standby controller pod is in crashloopbackoff state | Scale down the replica count of the controller pod’s deployment to 1 using kubectl scale deployment <deployment_name> --replicas=1 -n <driver_namespace> |
CSM object vxflexos is in falied state and CSI-Powerflex driver is not in running state |
Verify the secret name: kubectl get secret -n <namespace_name> it should be in <CR-name>-config format. 1. Retrieve the existing secret: kubectl get secret old-secret-name -n <namespace_name> -o yaml > secret.yaml 2. Edit the secret.yaml file: Change metadata.name to 3. Apply the new secret: kubectl apply -f secret.yaml 4. Delete the old secret: kubectl delete secret old-secret-name |