Fix under replicated partitions with controller restart

Hi,

If you have a Kafka cluster with only one broker that has zero under-replicated partitions and the rest have a number that is not equal to that value, than please be aware that it is not properly registered to the cluster.
Taking a look in the state-change log on the instance that it’s the cluster controller will not show something particularly(all of them should be reported as being registered).

That broker does not have the possibility to take partitions because the controller epoch does not see it to be functional(this is different from reachable).

The easiest way to fix this is by a normal restart of the cluster controller. Please do not use kill or anything else, it will make things worse.

Now, if a tail -50f /opt/kafka/server.log is executed, you will see that the process is trying to stop and all of the components are going down on a controlled fashion.
What you will also see is that there are some errors regarding the partition that are hosted on the “problematic” broker, and that the errors are in a cycle. Please have patience with this processes and wait for it to be stopped as it should.

On a restart, taking a look in the state-change log for the new cluster controller, you will see from time to time that the old broker that you restarted is reported to be unreachable. This is not a real problem and ignore those errors.

After waiting for a while you will see that the under-replicated partition number will begin to decrease, hopefully to zero on all brokers.

Cheers