Troubleshooting
- Can Container Storage Module Operator manage existing drivers installed using Helm charts or the CSI Operator?
- Why do some of the Custom Resource fields show up as invalid or unsupported in the OperatorHub GUI?
- How can I view detailed logs for the Container Storage Module SM Operator?
- My Dell CSI Driver install failed. How do I fix it?
- My CSContainer Storage ModuleM Replication install fails to validate replication prechecks with ’no such host’.
- How to update resource limits for Container Storage Module Operator when it is deployed using Operator hub
Can Container Storage Module Operator manage existing drivers installed using Helm charts or the CSI Operator?
The Container Storage Module Operator is unable to manage any existing driver installed using Helm charts or the CSI Operator. If you already have installed one of the Dell CSI driver in your cluster and want to use the CSM operator based deployment, uninstall the driver and then redeploy the driver via Container Storage ModuleM Operator
Why do some of the Custom Resource fields show up as invalid or unsupported in the OperatorHub GUI?
The Container Storage Module Operator is not fully compliant with the OperatorHub React UI elements. Due to this, some of the Custom Resource fields may show up as invalid or unsupported in the OperatorHub GUI. To get around this problem, use kubectl/oc
commands to get details about the Custom Resource(CR). This issue will be fixed in the upcoming releases of the Container Storage Module Operator.
How can I view detailed logs for the Container Storage Module Operator?
Detailed logs of the Container Storage Module Operator can be displayed using the following command:
kubectl logs <csm-operator-controller-podname> -n <namespace>
My Dell CSI Driver install failed. How do I fix it?
Describe the current state by issuing:
kubectl describe csm <custom-resource-name> -n <namespace>
In the output refer to the status and events section. If status shows pods that are in the failed state, refer to the CSI Driver Troubleshooting guide.
Example:
Status:
Controller Status:
Available: 0
Desired: 2
Failed: 2
Node Status:
Available: 0
Desired: 2
Failed: 2
State: Failed
Events
Warning Updated 67s (x15 over 2m4s) csm (combined from similar events): at 1646848059520359167 Pod error details ControllerError: ErrImagePull= pull access denied for dellem/csi-isilon, repository does not exist or may require 'docker login': denied: requested access to the resource is denied, Daemonseterror: ErrImagePull= pull access denied for dellem/csi-isilon, repository does not exist or may require 'docker login': denied: requested access to the resource is denied
The above event shows dellem/csi-isilon does not exist, to resolve this user can kubectl edit the csm and update to correct image.
To get details of driver installation: kubectl logs <dell-csm-operator-controller-manager-pod> -n dell-csm-operator
.
Typical reasons for errors:
- Incorrect driver version
- Incorrect driver type
- Incorrect driver Spec env, args for containers
- Incorrect RBAC permissions
My CSM Replication install fails to validate replication prechecks with ’no such host'.
In replication environments that utilize more than one cluster, and utilize FQDNs to reference API endpoints, it is highly recommended that the DNS be configured to resolve requests involving the FQDN to the appropriate cluster.
If for some reason it is not possible to configure the DNS, the /etc/hosts file should be updated to map the FQDN to the appropriate IP. This change will need to be made to the /etc/hosts file on:
- The bastion node(s) (or wherever
repctl
is used). - Either the CSM Operator Deployment or ClusterServiceVersion custom resource if using an Operator Lifecycle Manager (such as with an OperatorHub install).
- Both dell-replication-controller-manager deployments.
To update the ClusterServiceVersion, execute the command below, replacing the fields for the remote cluster’s FQDN and IP.
kubectl patch clusterserviceversions.operators.coreos.com -n <operator-namespace> dell-csm-operator-certified.v1.3.0 \
--type=json -p='[{"op": "add", "path": "/spec/install/spec/deployments/0/spec/template/spec/hostAliases", "value": [{"ip":"<remote-IP>","hostnames":["<remote-FQDN>"]}]}]'
To update the dell-replication-controller-manager deployment, execute the command below, replacing the fields for the remote cluster’s FQDN and IP. Make sure to update the deployment on both the primary and disaster recovery clusters.
kubectl patch deployment -n dell-replication-controller dell-replication-controller-manager \
-p '{"spec":{"template":{"spec":{"hostAliases":[{"hostnames":["<remote-FQDN>"],"ip":"<remote-IP>"}]}}}}'
How to update resource limits for CSM Operator when it is deployed using Operator Hub
In certain environments where users have deployed CSM Operator using Operator hub, they have encountered issues related to Container Storage Module Operator pods reporting ‘OOM Killed’. This issue is attributed to the default resource requests and limits configured in the CSM Operator, which fail to meet the resource requirements of the user environments. In this case users can update the resource limits from Openshift web console by following the steps below:
- Login into OpenShift web console
- Navigate to
Operators
section in the left pane and expand it and click on ‘Installed Operators’ - Select the
Dell Container Storage Modules
operator - Click on the
YAML
tab under the operator and you will seeClusterServiceVersion(CSV)
file opened in an YAML editor - Update the resource limits in the opened YAML under the section
spec.install.spec.deployments.spec.template.spec.containers.resources
- Save the CSV and your changes should be applied
Symptoms | Prevention, Resolution or Workaround |
---|---|
kubectl describe pod powermax-controller-<xyz> –n <namespace> indicates that the driver image could not be loaded |
You may need to put an insecure-registries entry in /etc/docker/daemon.json or log in to the docker registry |
kubectl logs powermax-controller-<xyz> –n <namespace> driver logs show that the driver cannot authenticate |
Check your secret’s username and password |
kubectl logs powermax-controller-<xyz> –n <namespace> driver logs show that the driver failed to connect to the U4P because it could not verify the certificates |
Check the powermax-certs secret and ensure it is not empty or it has the valid certificates |
Driver install or upgrade fails because of an incompatible Kubernetes version, even though the version seems to be within the range of compatibility. For example: Error: UPGRADE FAILED: chart requires kubeVersion: >= 1.23.0 < 1.27.0 which is incompatible with Kubernetes V1.23.11-mirantis-1 | If you are using an extended Kubernetes version, please see the helm Chart and use the alternate kubeVersion check that is provided in the comments. Please note that this is not meant to be used to enable the use of pre-release alpha and beta versions, which are not supported. |
When a node goes down, the block volumes attached to the node cannot be attached to another node | 1. Force delete the pod running on the node that went down 2. Delete the volumeattachment to the node that went down. Now the volume can be attached to the new node. |
When attempting a driver upgrade, you see: spec.fsGroupPolicy: Invalid value: "xxx": field is immutable |
You cannot upgrade between drivers with different fsGroupPolicies. See upgrade documentation for more details |
Ater the migration group is in “migrated” state but unable to move to “commit ready” state because the new paths are not being discovered on the cluster nodes. | Run the following commands manually on the cluster nodes rescan-scsi-bus.sh -i rescan-scsi-bus.sh -a |
Failed to fetch details for array: 000000000000. [Unauthorized] " |
Please make sure that correct encrypted username and password in secret files are used, also ensure whether the RBAC is enabled for the user |
Error looking up volume for idempotence check: Not Found or Get Volume step fails for: (000000000000) symID with error (Invalid Response from API) |
Make sure that Unisphere endpoint doesn’t end with front slash |
FailedPrecondition desc = no topology keys could be generate |
Make sure that FC or iSCSI connectivity to the arrays are proper |
CreateHost failed with error initiator is already part of different host. |
Update modifyHostName to true in values.yaml Or Remove the initiator from existing host |
kubectl logs powermax-controller-<xyz> –n <namespace> driver logs says connection refused and the reverseproxy logs says “Failed to setup server.(secrets "secret-name" not found)” |
Make sure the given secret |
nodestage is failing with error Error invalid IQN Target iqn.EMC.0648.SE1F |
1. Update initiator name to full default name , ex: iqn.1993-08.org.debian:01:e9afae962192 2.Ensure that the iSCSI initiators are available on all the nodes where the driver node plugin will be installed and it should be full default name. |
Volume mount is failing on few OS(ex:VMware Virtual Platform) during node publish with error wrong fs type, bad option, bad superblock |
1. Check the multipath configuration(if enabled) 2. Edit Vm Advanced settings->hardware and add the param disk.enableUUID=true and reboot the node |
Standby controller pod is in crashloopbackoff state | Scale down the replica count of the controller pod’s deployment to 1 using kubectl scale deployment <deployment_name> --replicas=1 -n <driver_namespace> |
When running CSI-PowerMax with Replication in a multi-cluster configuration, the driver on the target cluster fails and the following error is seen in logs: error="CSI reverseproxy service host or port not found, CSI reverseproxy not installed properly" |
The reverseproxy service needs to be created manually on the target cluster. Follow the instructions here to create it. |
PVC creation is failing with error A problem occurred modifying the storage group resource: Failed to create batch task(s): The maximum allowed devices for a storage group has been exceeded . This is because of a hardware limit of 4k devices in a storage group. |
Create a separate Storage Class with a new unique ApplicationPrefix parameter (such as ApplicationPrefix: OCPX ) or add a new unique StorageGroup parameter (such as StorageGroup: "custom_SG_1" ) to place the provisioned volumes in a new Storage Group. |