Kubernetes provisioning GCE Persistent disk sometimes fails

3.1k Views Asked by At

I'm currently using the GCE standard container cluster with lot of success and pleasure. But I had a question about the provisioning of GCE Persistent disks.

As described in this document form Kubernetes. I created two YAML files:

kind: StorageClass
apiVersion: storage.k8s.io/v1beta1
metadata:
    annotations:
      storageclass.beta.kubernetes.io/is-default-class: "true"
    name: slow
provisioner: kubernetes.io/gce-pd
parameters:
  type: pd-standard

and

kind: StorageClass
apiVersion: storage.k8s.io/v1beta1
metadata:
  name: fast
provisioner: kubernetes.io/gce-pd
parameters:
  type: pd-ssd

If I now create a following Volume Claim:

{
  "kind": "PersistentVolumeClaim",
  "apiVersion": "v1",
  "metadata": {
    "name": "claim-test",
    "annotations": {
        "volume.beta.kubernetes.io/storage-class": "hdd"
    }
  },
  "spec": {
    "accessModes": [
      "ReadWriteOnce"
    ],
    "resources": {
      "requests": {
        "storage": "3Gi"
      }
    }
  }
}

The disk gets created perfectly! And if I now start following unit

apiVersion: v1
kind: ReplicationController
metadata:
  name: nfs-server
spec:
  replicas: 1
  selector:
    role: nfs-server
  template:
    metadata:
      labels:
        role: nfs-server
    spec:
      containers:
      - name: nfs-server
        image: gcr.io/google_containers/volume-nfs
        ports:
          - name: nfs
            containerPort: 2049
          - name: mountd
            containerPort: 20048
          - name: rpcbind
            containerPort: 111
        securityContext:
          privileged: true
        volumeMounts:
          - mountPath: /exports
            name: mypvc
      volumes:
        - name: mypvc
          persistentVolumeClaim:

        claimName: claim-test

The disk gets mounted perfectly but many times I stumble upon the following error (not more can be found in the kubelet.log file):

Failed to attach volume "claim-test" on node "...." with: GCE persistent disk not found: diskName="....." zone="europe-west1-b" Error syncing pod, skipping: timeout expired waiting for volumes to attach/mount for pod "....". list of unattached/unmounted volumes=[....]

Sometimes the pod boots perfectly, but sometimes it crashes. The only thing I could find is that there needs to be enough time between creating the PVC and the RC itself. I tried this many times but with the same uncertain results.

I hope someone can give me some kind of suggestion or help.

Thanks in advance! Best regards,

Hacor

1

There are 1 best solutions below

0
On BEST ANSWER

Thanks in advance for your comments! After a few days of searching I was finally able to determine what the problem was, I'm posting it because it may be useful other users.

I was using the NFS example for Kubernetes as a replication controller to provide my apps with NFS storage, but it seems that when the NFS server and the PV,PVC get deleted sometimes the NFS share gets stuck on the node itself and I think it has to do with the fact that I didn't delete this elements in a particular order and therefore the node got stuck with the share becoming incapable of mounting new shares to itself or the pod!

I noticed that the problem always occurred after I deleted some app (NFS, PV, PVC and other components) from the cluster. If I created a new cluster on GCE it works perfectly to create apps, until I delete one and it goes wrong...

What the correct deletion order is I don't know for sure, but I think:

  • Pods using the NFS share
  • PV, PVC of the NFS share
  • NFS server

If the pod takes longer to delete, and it isn't completely gone before PV is deleted, the node hangs with a mount it can't delete because it's in use, and that's where the problems occur.

I must honestly say that now I'm moving to an externally provisioned GlusterFS cluster. Hope it helps someone!

Regards,

Hacor