Storage is a critical component in Kubernetes for stateful applications, but it can be challenging to troubleshoot when things go wrong. This guide will help you diagnose and resolve common Kubernetes storage issues.
Understanding Kubernetes Storage Components
Before diving into troubleshooting, let’s understand the key storage components in Kubernetes:
- PersistentVolume (PV): A cluster resource representing storage in the cluster
- PersistentVolumeClaim (PVC): A request for storage by a user
- StorageClass: Defines the provisioner and parameters for dynamically provisioned PVs
- Volume: A directory accessible to containers in a pod
- CSI (Container Storage Interface): Standard for exposing storage systems to containers
Step-by-Step Troubleshooting Process
1. Identify the Problem PVC and Its Status
# List all PVCs in the namespace
kubectl get pvc -n <namespace>
# Get details about a specific PVC
kubectl describe pvc <pvc-name> -n <namespace>
Check for:
- PVC status (Pending, Bound, Lost)
- Events section for error messages
- The PV that the PVC is bound to (if any)
- Storage class being used
2. Check the Associated PV
# List all PVs in the cluster
kubectl get pv
# Get details about a specific PV
kubectl describe pv <pv-name>
Look for:
- PV status (Available, Bound, Released, Failed)
- Reclaim policy
- Storage class
- Access modes
- Mount options
- Node affinity
3. Verify StorageClass Configuration
# List all storage classes
kubectl get storageclass
# Get details about a specific storage class
kubectl describe storageclass <storageclass-name>
Check for:
- Provisioner (must be running in the cluster)
- Parameters specific to the provisioner
- ReclaimPolicy
- VolumeBindingMode
4. Check for CSI Driver Issues
If your cluster uses CSI drivers:
# List CSI drivers
kubectl get csidrivers
# Check CSI driver pods
kubectl get pods -n <csi-namespace> -l app=<csi-driver-name>
# Check CSI driver logs
kubectl logs -n <csi-namespace> <csi-driver-pod> -c <container-name>
5. Investigate Pod Volume Mount Issues
If the pod can’t mount the volume:
# Check pod status and events
kubectl describe pod <pod-name> -n <namespace>
# Check pod logs
kubectl logs <pod-name> -n <namespace>
# Check kubelet logs on the node
kubectl get pod <pod-name> -n <namespace> -o wide
# Note the node name, then check kubelet logs on that node
ssh <node>
sudo journalctl -u kubelet | grep <pv-name>
Common Storage Issues and Solutions
1. PVC Stuck in Pending State
Issue: PVC remains in Pending state and doesn’t get bound to a PV.
Diagnosis:
kubectl describe pvc <pvc-name> -n <namespace>
# Look for events that explain why it's pending
Common causes and solutions:
- No matching PV available:
- For static provisioning: Create a PV with matching capacity and access modes
- For dynamic provisioning: Ensure the specified storage class exists and its provisioner is working
apiVersion: v1 kind: PersistentVolume metadata: name: manual-pv spec: capacity: storage: 10Gi accessModes: - ReadWriteOnce persistentVolumeReclaimPolicy: Retain storageClassName: manual hostPath: path: "/mnt/data"
- StorageClass doesn’t exist or has issues:
- Verify the storage class exists
- Check the provisioner is deployed correctly
# Create a standard storage class if needed kubectl apply -f - <<EOF apiVersion: storage.k8s.io/v1 kind: StorageClass metadata: name: standard provisioner: kubernetes.io/aws-ebs # Change as per your environment parameters: type: gp2 reclaimPolicy: Delete volumeBindingMode: Immediate EOF
- CSI driver or external provisioner issues:
- Check CSI driver logs
- Ensure cloud provider credentials are correct
2. Volume Mount Failures
Issue: Pod can’t mount volumes even though PVC is bound.
Diagnosis:
kubectl describe pod <pod-name> -n <namespace>
# Look for mount failure events
Common causes and solutions:
- Filesystem issues:
- Check if the filesystem is corrupted
- Use a debug pod to mount the volume and check the filesystem:
kubectl apply -f - <<EOF apiVersion: v1 kind: Pod metadata: name: volume-debug namespace: <namespace> spec: containers: - name: debug image: busybox command: ["sleep", "3600"] volumeMounts: - name: problematic-volume mountPath: /data volumes: - name: problematic-volume persistentVolumeClaim: claimName: <pvc-name> EOF
Then check the filesystem:kubectl exec -it volume-debug -n <namespace> -- sh # Inside the container ls -la /data df -h
- Permission issues:
- Check file ownership and permissions
- Adjust SecurityContext for the pod:
securityContext: runAsUser: 1000 fsGroup: 1000
- Node issues:
- Check if the node has access to the storage backend
- For zone-specific volumes, ensure pods are scheduled in the correct zone:
affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: topology.kubernetes.io/zone operator: In values: - us-east-1a
3. Volume Expansion Issues
Issue: PVC resize requests not being fulfilled.
Diagnosis:
kubectl describe pvc <pvc-name> -n <namespace>
# Look for resize-related events
Common causes and solutions:
- StorageClass doesn’t support volume expansion:
- Check if the storage class has
allowVolumeExpansion: true
- Create a new storage class with volume expansion enabled:
apiVersion: storage.k8s.io/v1 kind: StorageClass metadata: name: expandable-sc provisioner: kubernetes.io/aws-ebs parameters: type: gp2 allowVolumeExpansion: true
- Check if the storage class has
- CSI driver doesn’t support expansion:
- Upgrade the CSI driver to a version that supports expansion
- Check CSI driver documentation for expansion support
- Filesystem expansion needed:
- For some volume types, the filesystem needs to be expanded after the volume:
- Restart the pod to trigger filesystem expansion
4. Performance Issues
Issue: Storage performance is slower than expected.
Diagnosis:
# Deploy a benchmark pod
kubectl apply -f - <<EOF
apiVersion: v1
kind: Pod
metadata:
name: io-test
namespace: <namespace>
spec:
containers:
- name: io-test
image: nixery.dev/shell/fio/ioping
command: ["sleep", "3600"]
volumeMounts:
- name: test-volume
mountPath: /test-data
volumes:
- name: test-volume
persistentVolumeClaim:
claimName: <pvc-name>
EOF
# Run IO tests
kubectl exec -it io-test -n <namespace> -- fio --name=test --filename=/test-data/test --direct=1 --rw=randread --bs=4k --size=1G --numjobs=1 --time_based --runtime=60 --group_reporting
Common causes and solutions:
- Incorrect storage class or parameters:
- Use storage classes optimized for your workload:
apiVersion: storage.k8s.io/v1 kind: StorageClass metadata: name: high-performance provisioner: kubernetes.io/aws-ebs parameters: type: io1 iopsPerGB: "50"
- Resource contention:
- Check for noisy neighbors
- Consider using local volumes for performance-sensitive workloads:
apiVersion: storage.k8s.io/v1 kind: StorageClass metadata: name: local-storage provisioner: kubernetes.io/no-provisioner volumeBindingMode: WaitForFirstConsumer
- Network bottlenecks:
- For network-attached storage, check network throughput
- Consider colocation of pods with their volumes
Practical Example: Resolving a PVC in Pending State
Let’s work through a practical example where a PVC is stuck in Pending state:
# Check PVC status
kubectl get pvc app-data -n production
# NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
# app-data Pending fast-storage 1h
# Get more details
kubectl describe pvc app-data -n production
# Events:
# Warning ProvisioningFailed 5m (x12) persistentvolume-controller Failed to provision volume with StorageClass "fast-storage": StorageClass "fast-storage" not found
# Check available storage classes
kubectl get storageclass
# NAME PROVISIONER AGE
# standard kubernetes.io/gce-pd 30d
# ssd kubernetes.io/gce-pd 30d
# Create the missing storage class
kubectl apply -f - <<EOF
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: fast-storage
provisioner: kubernetes.io/gce-pd
parameters:
type: pd-ssd
reclaimPolicy: Delete
volumeBindingMode: Immediate
EOF
# Check PVC status again (should transition to Bound)
kubectl get pvc app-data -n production
# NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
# app-data Bound pvc-12345678-1234-... 10Gi RWO fast-storage 1h30m
Preventive Measures
- Create Storage Class Templates: Maintain documented templates for commonly used storage requirements.
- Use Storage Class Validation: Validate PVC and storage class compatibility before deployment:
# Simple validation script
kubectl get pvc <pvc-name> -n <namespace> -o jsonpath='{.spec.storageClassName}' | xargs kubectl get storageclass
- Monitor Storage Usage: Set up alerts for PVCs approaching capacity:
# Using Prometheus query
kubelet_volume_stats_used_bytes / kubelet_volume_stats_capacity_bytes > 0.8
- Test Volume Resizing: Regularly test volume expansion capabilities.
- Document Storage Requirements: Maintain documentation about storage requirements for each application.
- Setup Regular Backups: Implement regular backups of persistent data:
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
name: app-data-snapshot
spec:
volumeSnapshotClassName: csi-snapshot-class
source:
persistentVolumeClaimName: app-data
By understanding the common storage issues and implementing these preventive measures, you can maintain reliable storage operations in your Kubernetes environment.