Morning,
If you ever deploy a server via puppet or other automation language that has also zookeeper installed and you already have a working cluster, please be aware of this.
Yesterday i rebuilt a node multiple times (there were some errors to fix), and after finally getting it right, the zookeeper instance did not behave as expected.
When i took a look in the /var/lib/zookeeper directory, there was the correct myid file, that it’s also present in the config file, and version-2 directory.
Normally the version-2 should host all the data stored for the zookeeper but there was only currentEpoch file with 0 in it.
Multiple restarts, no result. The log didn’t contain anything relevant. Since the server was not live yet, i rebuilt it one more time but it had the same behavior. It looked like the node was completely out of sync, and that was the truth 😀
I figured out eventually, by mistake, that the zookeeper was not yet registered (i tried to change the id of that zookeeper and restart the hole cluster)
In order to register it, well, you need to restart the leader. How do you find it? There are multiple methods, i guess,here are two that are working. Either by running the following command
echo stat | nc localhost 2181 | grep Mode
Or by checking the exposed ports
zookeeper start/running, process 15259 root@server1:/var/lib/zookeeper/version-2# netstat -tulpen | grep 15259 tcp6 0 0 :::42844 :::* LISTEN 107 1104606 15259/java tcp6 0 0 :::2181 :::* LISTEN 107 1114708 15259/java tcp6 0 0 :::2183 :::* LISTEN 107 1104609 15259/java tcp6 0 0 :::9998 :::* LISTEN 107 1104607 15259/java root@server2:/var/lib/zookeeper/version-2# netstat -tulpen | grep 28068 tcp6 0 0 :::48577 :::* LISTEN 107 3182780 28068/java tcp6 0 0 :::2181 :::* LISTEN 107 3185668 28068/java tcp6 0 0 :::2183 :::* LISTEN 107 3184651 28068/java tcp6 0 0 :::9998 :::* LISTEN 107 3182781 28068/java root@server3:/var/lib/zookeeper/version-2# netstat -tulpen | grep 20719 tcp6 0 0 :::2181 :::* LISTEN 107 5365296 20719/java tcp6 0 0 :::2182 :::* LISTEN 107 5382604 20719/java tcp6 0 0 :::2183 :::* LISTEN 107 5374105 20719/java tcp6 0 0 :::36008 :::* LISTEN 107 5371417 20719/java tcp6 0 0 :::9998 :::* LISTEN 107 5371418 20719/java
The leader always exposes the 2182(follower port) in order for the followers to grab the updates.
After a short restart of the leader, everything works as expected!
Cheers