How to bring down a Kafka cluster without your fault.

I will keep it short as a message since sharing screens or details is a security breach 😄:


1 Google decides to migrate 2 of you VMs in a space of 5 min (lucky us)


2 The cluster becomes unbalanced, surely you also have Kafka Manager installed locally that doesn’t make your life any easier (Tnx Yahoo)


3 You receive a message from monitoring a couple of hours later that you have offline partitions (but why?)


4 You find out that one of you node is not responding anymore so you restart it


5 After replication you see that your other node has increased IOWait and load so you take action to restart it as well


6 News flash, your partitions are not replicating anymore, but why? So you take a look on the topics to see that ISR is only on the node you restarted the first time and that is also the leader of all partitions…..wow, strange….but it isn’t, because you changed a very important parameter…..unclean.leader.election.enabled=false , and now, yes, comes the AHA moment


7 What the hell do you do because that parameters constrains Kafka from moving the leader to a node that it’s not in ISR


8 You decide that the actual issue is with the first node you restarted which is also the controller, so you restart it to force the cluster to restart replication even if you will have downtime and offline partitions


9 The cluster restarts the replication process and the other nodes are brought in ISR


10 You manually rebalance the cluster partitions so that the first node does not get overwhelmed