Category: python

  • Loading unique IP’s in MongoDB

    Hi,

    So today I played a little bit with the possibility of storing the unique IP addresses in a separate table.

    Since I will use a subscription from  ip-api.com, it seems that there is an option to query info by batch processing with a limit of 100 IP’s per payload.

    So, at a first glance there are 227200 unique ip’s in my dataset. That will account for 2272 payloads to be queried.

    The code more or less looks in the following way:

    unique_ip = temp['SourceIP'].unique()
    unique_list = [unique_ip[i:i + 100] for i in range(0, len(unique_ip), 100)]
    data = []
    for i in range(len(unique_list)):
        temp_dict = {}
        temp_dict['id'] = i+1
        temp_dict['payload'] = unique_list[i].tolist()
        data.append(temp_dict)

    Once this is constructed you only need to parse the list element by element and insert it to MongoDB using this code:

    import pymongo
    
    myclient = pymongo.MongoClient("mongodb://localhost:27017/")
    mydb = myclient["mydatabase"]
    mycol = mydb["unique_ip"]
    for i in range(len(data)): 
        mycol.insert_one(data[i])

    Next step will involve taking the collection one by one and serve it to the API endpoint.

    Tnx,

    Sorin

  • Loading data to a Mongo database for further processing

    Hi,

    Since I needed my data to be available for further processing in a centralized manner, I have decided to store the first draft as well as further queries to location API in a Mongo database.

    Here is the short code snippet that was used for this task:

    import pandas as pd
    
    import pymongo
    
    df = pd.read_csv(r'C://Users//Sorin//Downloads//filter.concat')
    
    test = df[df['pppoe0'] == 'pppoe0']
    temp = test.iloc[:, [0,1,2,21, 22, 23, 24]]
    
    dict = {'Feb': 'Month',
            '18': 'Day',
            '09:16:00': 'Hour',
            '184.105.247.254': 'SourceIP',
            '86.123.204.222': 'DestinationIP',
            '48307': 'SourcePort',
            '447': 'DestinationPort'}
     
    temp.rename(columns=dict,
              inplace=True)
    
    
    myclient = pymongo.MongoClient("mongodb://localhost:27017/")
    mydb = myclient["mydatabase"]
    mycol = mydb["traffic_init_load"]
    temp.reset_index(inplace=True)
    data_dict = temp.to_dict("records")
    # Insert collection
    mycol.insert_many(data_dict) 

    From the concatenated file, my interest is strictly related to traffic on pppoe.

    We take only headers which are related to Source and Destination, and after that the documents are written to MongoDB.

    That is all.

  • Start of the traffic project

    So, I managed to gather about 1 GB of records from the pfsense installation and grab them from the box (filter.log files that you can find under /var/log).

    And I have a list of 16 logs that I need to concatenate.

    I had a lot of trouble concatenating it since I tried multiple times to use writelines() method from the file object.

    The code that worked for me:

    outputcsv = open('//Users//tudorsorin//Downloads//var//log//firewallrepo//filter.csv','w')
    f = open(f'//Users//tudorsorin//Downloads//var//log//firewallrepo//filter.concat', 'r')
    lines = f.readlines()
    for line in lines:
        outputcsv.writelines(",".join(line.split(" ")[0:3])+","+line.split(" ")[-1])
    f.close()
    outputcsv.close()

    The idea is that it’s already in CSV format and all you need to do is to modify the “header” that normally looks like Feb 20 07:58:18 soaretudorhome filterlog[41546]: to something like Feb,20, 07:58:18, and the rest remains the same.

    Suprisingly, if you want to load it directly to a dataframe using pd.read_csv and you don’t force a header it works and I have all the data there with NaN in the fields that are not filled.

    After this is done, we can filter only traffic that is done over ppoe0 which is the WAN interface, and you can easily do that using temp = df[df[‘pppoe0’] == ‘pppoe0’]

    So far so good. I also took a look at a generic pppoe0 line and came to the conclusion that the colums that interest me are [0,1,2,21,22,23,24] which represent the date and source ip, destination ip and ports (source and destination). You can than filter the dataframe by temp = temp.iloc[:, [0,1,2,21, 22, 23, 24]]

    So we finally we have a dateframe that we can work with. Now what remains is to change the table header and try to enhance it with extra info.

    Cheers

    Sorin

  • Python Kata on Codewars

    Hi,

    Since I pretty much broke the internet trying to solve the following “kata” with pieces of code, lets paste it also here cause it makes me proud.

    Here is the link to the kata: https://www.codewars.com/kata/5977ef1f945d45158d00011f

    And also here is my “solution” which took quite a long time to fix:

    def sep_str(st): 
        # your code here
        test_list = [[letter for letter in element] for element in st.split()]
        for a in test_list:
            a.extend([""] * (max(map(len, test_list)) - len(a)))
        if test_list:    
            result = [[test_list[j][i] for j in range(len(test_list))] for i in range(len(test_list[0]))]
        else:
            result = []
        return result

    That is all.

    Cheers!

  • Unique value on columns – pandas

    Hi,

    Today is a short example on cases that have longer columns with spaces.

    For example. I have a dataframe that has the following columns:

    I have read in some sources that you can use the construction wine_new.[column name].unique() to filter the values.

    If you have a one word column, it will work, but if the column is listed as multiple words, you can not use a construct like wine_new.’Page ID’.unique() because it will give a syntax error.

    Good, so you try to rename it. why Page ID and not pageid? Ok, that should be easy

    wine_new = wine_new.rename(columns={"Page ID": "pageid"}, errors="raise")

    And it now looks “better”.

    But if you need to keep the column name, you can just as easily use wine_new[‘Page ID’].unique() (If you want to count the number of unique values you can also use wine_new[‘Page ID’].nunique())

    There are multiple resources on this topic but the approach is not explained using both of the versions on the majority of them.

    Cheers

  • Prometheus metrics to Pandas data frame

    Hi,

    We are trying to implement a decision tree algorithm in order to see if our resource usage can classify our servers in different categories.

    First step in that process is querying Prometheus from Python and create some data frames with some basic information in order to get them aggregated.

    To that purpose, you can also use the following lines of code:

    import requests
    import copy 
    
    URL = "http://[node_hostname]:9090/api/v1/query?query=metric_to_be_quried[1d]"
      
    r = requests.get(url = URL) 
    
    data = r.json()
    
    data_dict={}
    metric_list = []
    for i in data['data']['result']:
        data_dict = copy.deepcopy(i['metric'])
        for j in i['values']:
            data_dict['time'] = j[0]
            data_dict['value'] = j[1]
            metric_list.append(data_dict)    
    
    df_metric = pd.DataFrame(metric_list)

    Other pieces will follow.

    Cheers

  • Strange problem in puppet run for Ubuntu

    Hi,

    Short sharing of a strange case.

    We’ve written a small manifest in order to distribute some python scripts. You can find the reference here: https://medium.com/metrosystemsro/new-ground-automatic-increase-of-kafka-lvm-on-gcp-311633b0816c

    When you try to run it on Ubuntu 14.04, there is this very strange error:

    Error: Failed to apply catalog: [nil, nil, nil, nil, nil, nil]

    The cause for this is as follows:

    Python 3.4.3 (default, Nov 12 2018, 22:25:49)
    [GCC 4.8.4] on linux (and I believe this is the default max version on trusty)

    In order to install the dependencies, you need python3-pip, so a short search returns following options:

    apt search python3-pip
    Sorting... Done
    Full Text Search... Done
    python3-pip/trusty-updates,now 1.5.4-1ubuntu4 all [installed]
      alternative Python package installer - Python 3 version of the package
    
    python3-pipeline/trusty 0.1.3-3 all
      iterator pipelines for Python 3

    If we want to list all the installed modules with pip3 list, guess what, it’s not working:

    Traceback (most recent call last):
       File "/usr/bin/pip3", line 5, in 
         from pkg_resources import load_entry_point
       File "/usr/local/lib/python3.4/dist-packages/pkg_resources/init.py", line 93, in 
         raise RuntimeError("Python 3.5 or later is required")
     RuntimeError: Python 3.5 or later is required

    So, main conclusion is that it’s not related to puppet, just the incompatibility between version for this old distribution.

    Cheers

  • Small addition for ‘cat’ in Python

    Hi,

    There was a issue on options that aggregate any other ones, like -A for my previous post

    In my view the easiest way to solve it is by storing the options in a tuple.

    Here is the snippet

    run_options = []
    try:
        opts, args = getopt.gnu_getopt(sys.argv[1:-1], 'AbeEnstTv', ['show-all', 'number-nonblank', 'show-ends', 'number', 'show-blank', 'squeeze-blank' 'show-tabs', 'show-nonprinting', 'help', 'version'])
    except getopt.GetoptError:
         print("Something went wrong")
         sys.exit(2)
    for opt, arg in opts:
        if opt in ('-A','--show-all'):
            run_options.append('E')
            run_options.append('T')
        elif opt in ('-b', '--number-nonblank'):
            run_options.append('b')
        elif opt in ('-n', '--number'):
            run_options.append('n')
        elif opt in ('-E', '--show-ends'):
            run_options.append('E')
        elif opt in ('-s', '--squeeze-blank'):
            run_options.append('s')
        elif opt in ('-T', '--show-tabs'):
            run_options.append('T')
     
    final_run_options = tuple(run_options)
    for element in final_run_options:
        if element == 'b':
            content_list = number_nonempty_lines(content_list)
        elif element == 'n':
            content_list = number_all_lines(content_list)   
        elif element == 'E':
            content_list = display_endline(content_list)
        elif element == 's':
            content_list = squeeze_blanks(content_list)
        elif element == 'T':
            content_list = show_tabs(content_list)

    So basically, you store the actual cases in a list which you convert to a tuple to eliminate duplicates. Once you have the final case, you parse it and change the actual content option by option.

    I didn’t have the time to test it but there is no big reason why it should’t work.

    Cheers

  • Linux ‘cat’ in Python – almost complete

    Morning,

    Since I am striving to find useful content to post more often, I took homework for a ‘cat’ written in Python.

    It’s not elegant, and it’s not the best version but it works.

    # -*- coding: utf-8 -*-
    """
    Created on Wed Dec 25 10:28:39 2019
    @author: Sorin
    """
    import sys,getopt,os
    if os.path.isabs(sys.argv[-1:][0]):
        FILENAME= sys.argv[-1:][0]
    else:
        FILENAME = os.getcwd() + "\\" + sys.argv[-1:][0]
    
    def read_content(filename):
        try:
            f = open(filename, "r+")
            content = f.read()
            f.close()
        except IOError as e:
                print("File could not be opened:", e)
                sys.exit(3)
        return content
        
    def transform_content():
        content = read_content(FILENAME)
        content_list = content.split('\n')
        return content_list
    def number_nonempty_lines(content_list):
        i = 0
        for line in content_list:
            if line != '':
                content_list[i] = str(i) + " " + line
            i = i + 1
        return content_list
    def squeeze_blanks(content_list):   
        i = 0
        duplicate_index = []
        for line in content_list:
            if (line == "" or line == "$")  or (str.isdigit(line.split(' ')[0]) and (line.split(' ')[-1] == "" or line.split(' ')[-1] == "$")):
               duplicate_index.append(i+1)
            i = i + 1
        delete_index = []
        for j in range(len(duplicate_index) - 1):
            if  duplicate_index[j] + 1 == duplicate_index[j+1]:
                delete_index.append(duplicate_index[j])
        for element in delete_index:
            content_list.pop(element)
        return content_list    
            
    def number_all_lines(content_list):
        i = 0
        for line in content_list:
            content_list[i] = str(i) + " " + line
            i = i + 1
        return content_list
    
    def display_endline(content_list):
       return [line + "$" for line in content_list]
    
    def show_tabs(content_list):
        print(content_list)
        content_list = [ line.replace('\t','^I') for line in content_list]
        return content_list
    
    content_list =transform_content()
    try:
        opts, args = getopt.gnu_getopt(sys.argv[1:-1], 'AbeEnstTv', ['show-all', 'number-nonblank', 'show-ends', 'number', 'show-blank', 'squeeze-blank' 'show-tabs', 'show-nonprinting', 'help', 'version'])
    except getopt.GetoptError:
         print("Something went wrong")
         sys.exit(2)
    for opt, arg in opts:
        if opt in ('-A','--show-all'):
            content_list = display_endline(content_list)
            content_list = show_tabs(content_list)
        elif opt in ('-b', '--number-nonblank'):
           content_list = number_nonempty_lines(content_list)
        elif opt in ('-n', '--number'):
           content_list = number_all_lines(content_list)
        elif opt in ('-E', '--show-ends'):
            content_list = display_endline(content_list)
        elif opt in ('-s', '--squeeze-blank'):
            content_list = squeeze_blanks(content_list)
        elif opt in ('-T', '--show-tabs'):
            content_list = show_tabs(content_list)
    print('\n'.join(content_list))

    Further improvements will be also posted. I must confess that there are still a couple of things to be fixed, like not running the same options twice, and the issue of putting it to work on very large files, but it will do in this form for now.

    Cheers