In the process of testing the resiliency of our staging environment I decided it would be a good idea to test out how kubernetes responds to the loss of a minion/kubelet node. Our cluster has 2 minions in total, each replication controller is specified with replication factor of 2.
kube cluster
My expectations were that Kubernetes would:

  1. kill the docker containers on that node
  2. rebalance and spin up the second replica on the second node

On minion-01 I ran

systemctl stop kubelet

followed by

docker ps

but to my surprise all the pods were still up

I went to minion-02 and saw that a second replica had been created there. So now we have one minion and three instances of the container (where there should only be two)!

My initial thought was that maybe our containers (which contain a java webapp managed by supervisor) were not responding to the SIGTERM sent by Kubernetes. To check this independently I used kubectl to directly stop a pod

kubectl stop replicationcontrollers example-service-release-33

From this I could see that the containers were being stopped successfully so I ruled that out.

Next on my list of thoughts was that perhaps the kubelet service unit file needed to be amended to kill all the containers when a stop command was issued. But before diving into this I thankfully decided to do a few more tests...

I did a simple restart of the kube minion service. Again none of the containers were stopped but interestingly I could see from the logs that Kubernetes was recovering the orphaned containers. This was very cool.

I confirmed via kubectl on the master that there was just two instances running, one on each machine. So with a quick restart, the timeout at which to spin up and other instance had not been reached and so the orphaned container was just picked up again.

Then I did another test where I stopped minion-01 and waited until the second instance was started on the still running minion-02 and when I saw this had happened I then started minion-01 again. Once more I saw the ‘Recovery completed’ message in the logs and wondered had it picked up the orphan and somehow rebalanced (or worse now had 3 replicas running)? But on running docker ps I could see that the orphaned containers had been removed by Kubernetes as they were no longer needed. Pretty neat. What actually happens here is that the NodeController on the master waits for a timeout and then marks the containers on the down node as ‘Failed’. When the minion comes back up it checks etcd for the status of the containers and seeing they are failed deletes them.

With our kubelet services set to restart on failure, this means that in terms of quick intermittent failures, the orphaned containers will be seamlessly picked up again and we won’t incur the overhead of them being started on the other nodes. In the case of a longer failure, if/when the node does eventually restart Kubernetes will take care of cleaning up the redundant containers.

In our small two node cluster, should a kubelet node fail we would be left in a situation where all the containers are running on one node and kubernetes does not rebalance, so for now this will be a manual step for us to run every so often.

For further reading on the life of a pod check out this link.