Wrong Kafka configuration and deployment using puppet


I just want to share with you one case that we have last week and that involved some wrong kafka deplyment from puppet that actually filled the filesystems and got our colleagues that were using it for a transport layer on ELK.

Lets start with the begining, we have some puppet code to deploy and configure kafka machines. To keep it simple the broker config block from puppet looks like this:

$broker_config = {
    ''                     => '-1', # always set to -1.
    # broker specific config
    'zookeeper.connect'             => hiera('::kafka::zookeeper_connect', $zookeeper_connect),
    '' => hiera('::kafka::inter_broker_protocol_version', $kafka_version),
    'log.dir'                       => hiera('::kafka::log_dir', '/srv/kafka-logs'),
    'log.dirs'                      => hiera('::kafka::log_dirs', '/srv/kafka-logs'),
    'log.retention.hours'           => hiera('::kafka::log_retention_hours', $days7),
    'log.retention.bytes'           => hiera('::kafka::log_retention_bytes', '-1'),
    # confiure availability
    'num.partitions'                => hiera('::kafka::num_partitions', 256),
    'default.replication.factor'    => hiera('::kafka::default_replication_factor', $default_replication_factor),
    # configurre administratability (this is a word now)
    'delete.topic.enable'           => hiera('::kafka::delete_topic_enable', 'true'),

As you can see, there are two fields that need to be carefully configured, one is the log_retention_bytes and the other is the num_partitions. I am underlying this for a very simple reason, if we will take a look in the kafka server.properies file which the engine use to start the broker, we will see


 Now, the first one is the default value (and it means the maximum size per partition), and if you go to an online converter you will see that it means 100GB. This shouldn’t be a problem if you override it topic based, and even more if you manually create a topic with one or two partitions and you have the required space. But our case was different, they were using for logstash which created a default topic configuration with 256 partitions and if the maximum size was 100GB then it though it had 25600GB which i am not mistaken should also translate in 25TB (yeah, that it’s usually more than enough for a ELK instance).

Since we were running only three nodes with 100GB filesystem each, you can imagine that it filled up and the servers crashed.

Now there are two ways to avoid this, depending the number of consumers in the consumer group, and also the retention period. If you have only one consumer and you want to have data for a larger period of time, you can leave it at 100GB (if you have that much storage available) and manually create a topic with 1 partition and a replication factor equal to the number of nodes that you want to use for high availability, or you can leave the partition number to 256, and you highly decrease the retention bytes value. This translates in an equal distribution of data on multiple consumers, but it comes with a lesser storage period of time (now it is also true that it depends on the amount and type of data that you are transferring).

The solution that we adopted is to leave the partition number unchanged and decrease the partition size to 200MB. Keep you informed how it works. 🙂



Install RancherOS on VirtualBox and configure it for ssh access


If you are not familiar with what is RancherOS you can learn more from this link: Rancher docu It’s basically a very small Linux distro that runs all the processes as Docker containers (including the system processes).

So, starting from here, we will need a RancherOS image which you can download from the following location: Rancher git. After doing that you will need a VirtualBox machine with minimum 1GB of RAM (the reason for this is that Rancher will run at first from the memory). The size of the root partition can be as big as you like, no extra video configurations are required since it will run in CLI mode.

You also need to know that an extra jump server (or a server that is accessible over ssh protocol) is required in order to successfully configure your single/multiple running instance of Rancher and that is for a simple reason. As far as i managed to test, no mount command is working of an external USB storage (please be aware that we are talking about an isolated environment)  and also copy/paste is not running by default without Virtualbox Guest Tools installed (unfortunately this is also not possible because we will not have a GUI and these kind of releases are not supported, i think this is also the case of CoreOs). Please make sure that the servers are reachable and have sshd installed and configured.

Since Rancher is available only with ssh key login, because of security reasons, you will need to add it before install to the cloud-config.yml

On the jump server you need to generate a rsa key with the ssh-keygen command and it will create in the .ssh directory the following pair of files (this is a list from my test machine) :

-rw-r–r– 1 sorin sorin 394 Mar 21 08:09
-rw——- 1 sorin sorin 1675 Mar 21 08:09 id_rsa

The next step is to build the minimal cloud-config file in order to get access to the machine, and in that purpose we can run the command

echo -e “#cloud-confignssh_authorized_keys:n – $(cat” > $HOME/cloud-config.yml

This will create the only file you need in order to install your “server”.

Ok, it’s time to start our Rancher machine, please make sure that you have the Rancher image mounted in order to boot it. After this process is done you will need to connect to the jump server in order to grab the file created above. Please do that with the following command:

After this is done, we can install it on the local drive. Since it’s more simple with a printscreen i will list another one 🙂

Ok, this being done, you will be propted to restart the machine but before that please make sure that you have unmounted the rancher image from the virtual drive otherwise it will boot from it and not from the actual install.

You are almost done, after restart you can access the server via ssh rancher@[rancher server ip] if you used the default id_rsa key from the .ssh directory, and if not, ssh -i [private key file location] rancher@[rancher server ip]

More articles to come on this topic,