Morning,
It’s important to share this situation with you. This morning i came to the office to see that a cluster that was upgraded/restarted had an issue with Zookeeper instances.
Symptoms were clear: instances won’t start completely. But why?
After a little bit of investigation, i went to the /var/log/syslog (/var/log/zookeeper did not contain any information at all) to see that there is a bad page table in the jvm.
Java version is:
java version "1.8.0_111" Java(TM) SE Runtime Environment (build 1.8.0_111-b14) Java HotSpot(TM) 64-Bit Server VM (build 25.111-b14, mixed mode)
So, the log showed following lines:
Aug 16 07:16:04 kafka0 kernel: [ 742.349010] init: zookeeper main process ended, respawning Aug 16 07:16:04 kafka0 kernel: [ 742.925427] java: Corrupted page table at address 7f6a81e5d100 Aug 16 07:16:05 kafka0 kernel: [ 742.926589] PGD 80000000373f4067 PUD b7852067 PMD b1c08067 PTE 80003ffffe17c225 Aug 16 07:16:05 kafka0 kernel: [ 742.928011] Bad pagetable: 000d [#1643] SMP Aug 16 07:16:05 kafka0 kernel: [ 742.928011] Modules linked in: dm_crypt serio_raw isofs crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd psmouse floppy
Why should the JVM throw a memory error? The main reason is incompatibility with kernel version.
Let’s take a look in the GRUB config file.
Looks like we are using for boot:
menuentry 'Ubuntu' --class ubuntu --class gnu-linux --class gnu --class os $menuentry_id_option 'gnulinux-simple-baf292e5-0bb6-4e58-8a71-5b912e0f09b6' { recordfail load_video gfxmode $linux_gfx_mode insmod gzio insmod part_msdos insmod ext2 if [ x$feature_platform_search_hint = xy ]; then search --no-floppy --fs-uuid --set=root baf292e5-0bb6-4e58-8a71-5b912e0f09b6 else search --no-floppy --fs-uuid --set=root baf292e5-0bb6-4e58-8a71-5b912e0f09b6 fi linux /boot/vmlinuz-3.13.0-155-generic root=UUID=baf292e5-0bb6-4e58-8a71-5b912e0f09b6 ro console=tty1 console=ttyS0 initrd /boot/initrd.img-3.13.0-155-generic
There was also an older version of kernel image available 3.13.0-153.
Short fix for this is to update the grub.cfg file with the old version and reboot the server.
Good fix is still in progress. Will post as soon as i have it.
P.S: I forgot to mention the Zookeeper version:
Zookeeper version: 3.4.5--1, built on 06/10/2013 17:26 GMT
P.S 2: It seems that the issue is related with the java processes in general not only zookeeper
Cheers