• Adding schema to BigQuery table

    I am currently working on a bigger article for Medium but until I can put it words so that it’s really something really worth sharing, just wanted to add on one of the steps that I learned while working.

    Normally I wanted to load some info that was stored in CSV to BigQuery for analysis and filtering.

    I though that you can just add the header to CSV file and it will automatically recognize it and load it.

    Turns out that it’s a little bit more complicated.

    Normally it should work, and since it did with my first CSV, couldn’t really understand what was wrong.

    Now, there are two parts to this story:

    • How can you add the Table schema manually, and it will reveal the actual issue.
    • What do you need to be aware of and why this happens

    How can you add the table schema manually

    The data that is written to CSV is actually a Dataframe, so you have info about the types of the columns directly from code

    dtype_mapping = {
        'object': 'STRING',
        'int64': 'FLOAT',
        'float64': 'FLOAT'
    }
    
    schema = []
    for column, dtype in df.dtypes.items():
        schema.append({
            'name': column,
            'type': dtype_mapping.get(str(dtype), 'STRING') # Default to STRING if type is not in map
        })
    
    import json
    print(json.dumps(schema, indent=2))

    Yes, I know, int64 should be mapped to INTEGER, but it turns out that for my case some columns even if in Python are market as int64, in BigQuery they need to be FLOAT. I know there is more memory allocated but the dataset is quite small.

    So you can easily use so that you can exclude the header.

    df.to_csv(f"df.csv", index=False, mode='a')

    The above piece of code will help you create a SCHEMA from a Dataframe header

    What do you need to be aware of and why this happens

    The actually reason why this happened is because I was not aware that somewhere in my csv file a line with the header definition still remained (yes, I actually wrote multiple dataframes with header and filtered with a Linux command, and it did not work)

    And when the file loaded, I actually saw this:

    Normally if that line was missing and no header, that it should had looked like

    Things to be learned from this exercise:

    • If you have to write multiple dataframes in a CSV file, don’t add the header and use the above code to generate a specific definition of the schema
    • Properly check the CSV not to have rogue lines that don’t match the rest of the structure of the data, otherwise you will find out that everything is converted to string and you don’t understand why.

    And now you know.

    Sorin

  • Data analysis in investing

    I recently wrote a piece on Medium about a quick analysis of the correlation of multiple indexes in China, Korea and Japan, really hope you enjoy it

    https://medium.com/@sorin.tudor/data-science-in-investing-101-51a40281f28a

    Cheers,

    Sorin

  • How to bring down a Kafka cluster without your fault.

    I will keep it short as a message since sharing screens or details is a security breach 😄:


    1 Google decides to migrate 2 of you VMs in a space of 5 min (lucky us)


    2 The cluster becomes unbalanced, surely you also have Kafka Manager installed locally that doesn’t make your life any easier (Tnx Yahoo)


    3 You receive a message from monitoring a couple of hours later that you have offline partitions (but why?)


    4 You find out that one of you node is not responding anymore so you restart it


    5 After replication you see that your other node has increased IOWait and load so you take action to restart it as well


    6 News flash, your partitions are not replicating anymore, but why? So you take a look on the topics to see that ISR is only on the node you restarted the first time and that is also the leader of all partitions…..wow, strange….but it isn’t, because you changed a very important parameter…..unclean.leader.election.enabled=false , and now, yes, comes the AHA moment


    7 What the hell do you do because that parameters constrains Kafka from moving the leader to a node that it’s not in ISR


    8 You decide that the actual issue is with the first node you restarted which is also the controller, so you restart it to force the cluster to restart replication even if you will have downtime and offline partitions


    9 The cluster restarts the replication process and the other nodes are brought in ISR


    10 You manually rebalance the cluster partitions so that the first node does not get overwhelmed

  • Small script to retrieve the OS version from GCP compute engine

    I add here a small script that was drafted with AI but was tested and modified by me so that it returns the OS version for all of the VM instances under a project in GCP

    #!/bin/bash
    
    # Prompt for the project ID
    read -p "Enter your Google Cloud Project ID: " PROJECT_ID
    
    if [ -z "$PROJECT_ID" ]; then
      echo "Project ID cannot be empty."
      exit 1
    fi
    
    echo "Fetching VM instances and their OS versions for project: $PROJECT_ID"
    echo "--------------------------------------------------------------------"
    echo "Instance Name        | Zone                 | OS (from inventory)"
    echo "---------------------|----------------------|---------------------"
    
    # Get all instances (name and zone) in the project
    # The `instances list` command can list instances across all zones if no --zones flag is specified.
    # However, `instances describe` requires a specific zone.
    # So, we list name and zone, then describe each.
    gcloud compute instances list --project="$PROJECT_ID" --format="value(name,zone)" | while read -r INSTANCE_NAME ZONE; do
      # Get the licenses of the boot disk for the current instance
      # We filter for the disk that has boot=true
      # If multiple licenses, it takes the first one. Often OS licenses are listed.
      OS_INFO=$(gcloud compute instances os-inventory describe "$INSTANCE_NAME" \
        --zone="$ZONE" \
        --project="$PROJECT_ID" \
        --format="value(SystemInformation.LongName)" 2>/dev/null)
    
      # If no license info found, display a placeholder
      if [ -z "$OS_INFO" ] || [ "$OS_INFO" = "None" ]; then
        OS_VERSION="N/A or Custom"
      else
        # The command above should already give the last part, but an extra check/cleanup
        OS_VERSION="$OS_INFO"
      fi
    
      # Print the instance name, zone, and OS version in a formatted way
      printf "%-20s | %-20s | %s\n" "$INSTANCE_NAME" "$ZONE" "$OS_VERSION"
    done
    
    echo "--------------------------------------------------------------------"

    I don’t think any other details are needed in regard to this.

    Cheers,

    Sorin

  • Setting an environment variable in Jupyter

    Short one. I wanted to setup an API key env variable in such a way that it’s easily available to Jupyter.

    It turns out that the easiest way is by adding it to IPython startup directly.

    You can find the path under, ~/.ipython/profile_default/startup/00-startup.py and if it’s not present, create it.

    The code that you add in this file is Python and can look in the following format:

    import os
    os.environ['MY_VAR'] = 'my_value'

    After you saved the edited file, don’t forget to restart the kernel.

    Cheers

  • Transforming from dict to separate columns

    I queried, some time ago, trace route info for IP’s in a list and took the wrong decision to save it in the following form.

    At first sight it’s not really an issue but if you want to retrieve it and create a dataframe it will look not as expected.

    Here is the actual code to separate the payload column to column defined by keys in dictionary

    import pandas as pd
    
    # Assuming 'df' is your DataFrame and 'payload' is the column containing dictionaries
    df_expanded = pd.json_normalize(trace_rt_df['payload'])
    
    # Rename columns to match original keys (optional)
    df_expanded.columns = df_expanded.columns.map(lambda x: x.split('.')[-1])
    
    # Concatenate the expanded columns with the original DataFrame
    df_final = pd.concat([trace_rt_df, df_expanded], axis=1)
    
    # Drop the original 'payload' column (optional)
    df_final.drop('payload', axis=1, inplace=True)

    After the processing the dataframe will look like

    That’s all.

    Sorin

  • Powerlifting play ground

    It’s true that I switched a little bit from this blog to Medium in hope of a larger audience but I will leave the link also here:

    https://medium.com/@sorin.tudor/powerlifting-play-ground-582c7e4da17d

    Hopefully even more analysis will come.

    Cheers!

  • Plotting unique values in a dataframe column

    Today, I began experimenting with a Kaggle dataframe for the first time. Here is the link to it: https://www.kaggle.com/datasets/open-powerlifting/powerlifting-database

    To make a long story short, if you want to plot the count for each individual column and capture the unique values of a particular column inside that dataframe, you can simply achieve it by using the following command:

    mycategory = powerlifting[powerlifting['Best3DeadliftKg'] == 140.0]
    mycategory.dropna(subset=['Country'], inplace=True)

    This allows you to filter the records that you are interested in, as well as remove the columns that have the value NAN.

    The last step involves the counting of the values for the country column and also creating a barplot:

    country_series =mycategory['Country'].value_counts()
    country_series.head(10).plot(kind='bar')

    We will select the first 10 countries in the series since the actual number of unique countries in the dataset is actually 118.

    And the plot looks like this.

    I wanted to document it because it’s an operation that I have done before, and the actual feedback from the chatbot when asked what is the solution was quite a bit more complex.

    First of all it consisted of the code to filter the values and save the definition of the bar plot to a JSON file, using vega-altair library (subject for exploration at one moment in time – https://altair-viz.github.io/) and parsing the data to create the plot.

    The first part worked without any issues, with the following piece of code:

    import altair as alt
    import pandas as pd
    
    # Assuming df is your DataFrame (from the previous example)
    
    # 1. Filter to column 'Country' and calculate unique value counts
    value_counts = mycategory['Country'].value_counts()
    
    # 2. Convert to DataFrame for plotting
    value_counts_df = pd.DataFrame({'Value': value_counts.index, 'Count': value_counts.values})
    
    # 3. Create bar plot
    chart = alt.Chart(value_counts_df).mark_bar().encode(
        x=alt.X('Value:N', axis=alt.Axis(title='Value')),
        y=alt.Y('Count:Q', axis=alt.Axis(title='Count')),
        tooltip=['Value', 'Count']
    ).properties(
        title='Value Counts for Column Country'
    )
    
    # Save the chart
    chart.save('value_counts_column_a_bar_chart.json')

    But the second part was a little bit tricky since the chatbot didn’t have a read idea of how the structure of the file was formed, suggesting:

    import plotly.graph_objects as go
    import json
    
    # Load JSON data
    with open('value_counts_column_a_bar_chart.json', 'r') as f:
        chart_data = json.load(f)
    
    # Extract data for plotting correctly, handling both nominal and quantitative data
    if 'Column Name' in chart_data['datasets'][0]['data'][0]:
        # Nominal data (e.g., categories like 'A', 'B', 'C')
        x = [d['Column Name'] for d in chart_data['datasets'][0]['data']]
        y = [d['Unique Value Count'] for d in chart_data['datasets'][0]['data']]
    else:
        # Quantitative data (e.g., numeric values)
        x = [d['Value'] for d in chart_data['datasets'][0]['data']]
        y = [d['Count'] for d in chart_data['datasets'][0]['data']]
    
    # Create Plotly bar chart
    fig = go.Figure([go.Bar(x=x, y=y)])
    
    # Customize layout (optional)
    fig.update_layout(
        title='Value Counts for Column A',
        xaxis_title='Value',
        yaxis_title='Count'
    )
    
    # Show the chart
    fig.show()
    

    If you try to compile that it will return a KeyError:0 which is cause by the index of the chart_data[datasets][0] which is not correct.

    Taking a look in the actual JSON structure you quickly find that the right key is data-85d48ef46f547bd16ab0f88b32c209fb, which bring us to the correct version:

    import plotly.graph_objects as go
    import json
    
    # Load JSON data
    with open('value_counts_column_a_bar_chart.json', 'r') as f:
        chart_data = json.load(f)
    
    # Extract data for plotting correctly, handling both nominal and quantitative data
    if 'Column Name' in chart_data['datasets']['data-85d48ef46f547bd16ab0f88b32c209fb']:
        # Nominal data (e.g., categories like 'A', 'B', 'C')
        x = [d['Column Name'] for d in chart_data['datasets']['data-85d48ef46f547bd16ab0f88b32c209fb']]
        y = [d['Unique Value Count'] for d in chart_data['datasets']['data-85d48ef46f547bd16ab0f88b32c209fb']]
    else:
        # Quantitative data (e.g., numeric values)
        x = [d['Value'] for d in chart_data['datasets']['data-85d48ef46f547bd16ab0f88b32c209fb']]
        y = [d['Count'] for d in chart_data['datasets']['data-85d48ef46f547bd16ab0f88b32c209fb']]
    
    # Create Plotly bar chart
    fig = go.Figure([go.Bar(x=x, y=y)])
    
    # Customize layout (optional)
    fig.update_layout(
        title='Value Counts for Country',
        xaxis_title='Value',
        yaxis_title='Count'
    )
    
    # Show the chart
    fig.show()

    Or even a more elegant one suggested by the LLM:

    mport plotly.graph_objects as go
    import json
    
    # Load JSON data
    with open('value_counts_column_a_bar_chart.json', 'r') as f:
        chart_data = json.load(f)
    
    # Find the correct data key within 'datasets'
    data_key = list(chart_data['datasets'].keys())[0]
    
    # Extract data for plotting correctly, handling both nominal and quantitative data
    if 'Column Name' in chart_data['datasets'][data_key][0]:
        # Nominal data (e.g., categories like 'A', 'B', 'C')
        x = [d['Column Name'] for d in chart_data['datasets'][data_key][:10]]
        y = [d['Unique Value Count'] for d in chart_data['datasets'][data_key][:10]]
    else:
        # Quantitative data (e.g., numeric values)
        x = [d['Value'] for d in chart_data['datasets'][data_key][:10]]
        y = [d['Count'] for d in chart_data['datasets'][data_key][:10]]
    
    # Create Plotly bar chart
    fig = go.Figure([go.Bar(x=x, y=y)])
    
    # Customize layout (optional)
    fig.update_layout(
        title='Value Counts for Country',
        xaxis_title='Value',
        yaxis_title='Count'
    )
    
    # Show the chart
    fig.show()

    Somewhere in the code is also a trim of the first 10 values in order to have similar outcomes. But the final plot looks like it should.

    And that my friends is a small “pill” of working with Data Analysis, assisted by a LLM.

    Cheers

  • Getting unique IP from mongoDB and use it for traceroute query

    Hi,

    Some time ago, I tried to write some python code in order to grab each unique IP from my traffic logs and trying to trace it so that We can find similar nodes which were used for the jumps.

    This is also a good exercise in order to improve the basic dataframe information and a good baseline for some explorations.

    I will put here the code so that it is available for me as reference and also maybe for you if you want to take pieces of it.

    I know that it is not optimised, but you can modify it or maybe use a chatbot to improve it.

    import pymongo
    import scapy.all as scapy
    
    myclient = pymongo.MongoClient("mongodb://localhost:27017/")
    db = myclient["mydatabase"]
    read_col = db["unique_ip"]
    write_col = db["unique_ip_trace_tcp"]
    lastid = 0
    index_last_ip = write_col.find().sort([('_id', -1)]).limit(1)
    for doc in index_last_ip:
        doc['source_ip']
        index_id = read_col.find({"payload":doc['source_ip']})
        for elem in index_id:
            lastid=elem['id']
    print(lastid)
    for i in range(lastid, read_col.count_documents({})):
        mydoc = read_col.find({ "id": i })
        for x in mydoc:
            try:  
                for element in x['payload']:
                    response = {}
                    ans, unans = scapy.traceroute(element)    
                    response['source_ip'] = element
                    payload = {}
                    for sdr,rcv in ans:
                        payload[str(sdr.ttl)]= rcv.src
                    response['payload'] = payload    
                    write_col.insert_one(response)
            except Exception as e:
                print(e)
                continue

    That would be all.

    Cheers!

  • Fixing “local issuer certificate” in Python

    I am putting this here more as a reference since this error appeared a couple of time when using urllib library.

    For example there is the classical case of

    from urllib.request import urlopen
    
    from bs4 import BeautifulSoup 
    
    
    html = urlopen('http://en.wikipedia.org/wiki/Kevin_Bacon')
    bs = BeautifulSoup(html, 'html.parser')
    for link in bs.find_all('a'):
        if 'href' in link.attrs:
            print(link.attrs['href'])

    And if you run it for the first time in your Jupyter Kernel, it will return

    URLError: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:997)>

    The easiest way to fix it is by adding two extra lines to the code

    from urllib.request import urlopen
    import ssl
    from bs4 import BeautifulSoup 
    
    ssl._create_default_https_context = ssl._create_unverified_context
    
    html = urlopen('http://en.wikipedia.org/wiki/Kevin_Bacon')
    bs = BeautifulSoup(html, 'html.parser')
    for link in bs.find_all('a'):
        if 'href' in link.attrs:
            print(link.attrs['href'])

    The first one is to import the ssl library and the second one to actually create an unverified context.

    The interesting fact is that once this is loaded to the kernel, it will actually work even if you comment the lines and re-execute it.

    That would be all.

    Cheers!