During TKGi cluster upgrade, the kubelet post-start script fails and bosh reports:
Instance update failed: There was a problem completing your request. Please contact your operations team providing the following information: service: p.pks, service-instance-guid: ########-####-####-####-#########, broker-request-id: ########-####-####-####-##########, task-id: <TASK-ID>, operation: update, error-message: Action Failed get_task: Task c20d69cb-e730-47fd-7a3c-e35a182793ab result: 1 of 6 post-start scripts failed. Failed Jobs: kubelet. Successful Jobs: bosh-dns, telemetry-agent-image, csi-images, load-images, sink-resources-images
Kubelet is in running state but kubernetes node is NotReady.
Kubelet logs report an issue with containerd:
remote_runtime.go:633] "Status from runtime service failed" err="rpc error: code = Unknown desc = invalid UUID length: 0: unknown"
Containerd is in running state but logs reports an issue with UUID:
level=error msg="Status failed" error="invalid UUID length: 0: unknown"
TKGi v1.20
The /var/vcap/store/containerd/io.containerd.grpc.v1.introspection/uuid file is empty.
This can happen if there are storage problems at the time the node is created and containerd can't write to the uuid file.
Stop containerd and remove the uuid file. The uuid file will be updated on restart.
monit stop containerd
rm /var/vcap/store/containerd/io.containerd.grpc.v1.introspection/uuid
monit start containerd
crictl info
Restart the cluster upgrade
tkgi upgrade-cluster <Cluster name>
https://github.com/containerd/containerd/issues/10491
Fixed in containerd v1.7.21 which is bundled in TKGi v1.21 & v1.22