• Claude MCP config with OpenBB and Wolfram

    Putting this as reference. An article will follow in the next days:

    {
      "mcpServers": {
        "openbb": {
          "command": "/Users/tudorsorin/OpenBB/conda/envs/openbb/bin/openbb-mcp",
          "args": [
            "--transport",
            "stdio",
            "--default-categories",
            "equity,economy"
          ]
        },
        "Wolfram": {
          "command": "/Applications/Wolfram Engine.app/Contents/Resources/Wolfram Player.app/Contents/MacOS/wolfram",
          "args": [
            "-run",
            "PacletSymbol[\"Wolfram/MCPServer\",\"Wolfram`MCPServer`StartMCPServer\"][]",
            "-noinit",
            "-noprompt"
          ],
          "env": {
            "MCP_SERVER_NAME": "Wolfram"
          }
        }
      },
      "preferences": {
        "sidebarMode": "chat"
      }
    }

  • Data science in investing 102

    Since mostly I play with financial data and data science concepts, I think it’s a good idea to create a series called Data science in investing in which I share the small things I learn.

    For today we have Data science in investing 102 at this link, enjoy: https://medium.com/@sorin.tudor/data-science-in-investing-102-54dd5b0550e3

  • Guide: How to Run Gemini in OpenBB Workspace (Local Setup)

    Prerequisites

    • Python 3.10 or 3.11 (Strictly required; 3.12+ is incompatible).
    • Google AI Studio API Key (Free tier works).
    • Conda or a virtual environment manager.

    1. Architecture Overview

    Because OpenBB Agents are hardcoded for OpenAI’s SDK, we cannot connect Gemini directly. We use a “Bridge” architecture:

    OpenBB Workspace → Local Agent Server (Port 8095) → LiteLLM Bridge (Port 4000) → Google Gemini API

    2. Environment Setup (The “Bridge”)

    First, we set up LiteLLM, a tool that translates OpenAI HTTP requests into Google Gemini requests.

    1. Install LiteLLM:
      Bash
      pip install litellm
    2. Run the Bridge:
      Open a dedicated terminal window (keep this running) and start the proxy. We use the Flash model because it supports the large context windows required for analyzing financial datasets.
      Bash
      # Replace with your actual model name if different
      export GEMINI_API_KEY=your_google_key_here
      litellm –model gemini/gemini-latest-flash –port 4000

      Result: You now have an “OpenAI-compatible” server running at http://127.0.0.1:4000.

    3. Agent Installation (The “Brain”)

    The official openbb-agents package on PyPI is often outdated. We must install the agent server from the source to avoid dependency conflicts.

    1. Create a Clean Environment:
      Bash
      conda create -n bb-agents python=3.11 -y
      conda activate bb-agents
    2. Install the Core Library (Editable Mode):
      Clone the library repository and install it so Python can find the base modules.
      Bash
      git clone https://github.com/OpenBB-finance/openbb-agents.git
      cd openbb-agents
      pip install -e . (there are some dependencies to be fixed, it will want pydantic = “==2.7.1” and you need to update it under [home_dir]/OpenBB/openbb-agents/pyproject.toml to pydantic = “>=2.7.1”)
    3. Get the Runnable Examples:
      Clone the secondary repository that contains the actual server implementations.
      Bash
      cd .. # Go back to your root folder
      git clone https://github.com/OpenBB-finance/agents-for-openbb.git
      cd agents-for-openbb

    4. Configuration & Launch

    We will run the “Vanilla Agent” (Agent #30), which can read widgets from your dashboard. We must configure it to talk to our local Bridge (Port 4000) instead of OpenAI servers.

    1. Navigate to the Agent:
      Bash
      cd 30-vanilla-agent-raw-widget-data
    2. Install Dependencies:
      Bash
      pip install langchain-openai fastapi uvicorn python-dotenv
    3. Launch with Redirection:
      Run this command block to inject the configuration and start the server.
    • OPENAI_BASE_URL: Points the agent to our local LiteLLM bridge.
    • OPENAI_API_KEY: A dummy key (required to pass validation checks).

    Bash
    export OPENAI_BASE_URL=”http://127.0.0.1:4000″
    export OPENAI_API_KEY=”sk-fake-key”

    uvicorn main:app –host 127.0.0.1 –port 8095 –reload
    Success Indicator: You will see Uvicorn running on http://127.0.0.1:8095.


    5. Connecting to OpenBB

    1. Open OpenBB Workspace (Web or Desktop).
    2. Open the Copilot Menu (Sidebar).
    3. Click (+) to Add Copilot.
    4. Enter the URL: http://127.0.0.1:8095.
    5. Click Add.

    Verification: You should see a green indicator next to “Vanilla Agent.”

    6. How to Use It

    The Vanilla Agent does not have internet access; it relies on context.

    1. Add a widget to your dashboard (e.g., Stock Price for AAPL).
    2. Click the @ (Add to Chat) button on the widget header.
    3. Ask the chat: “Summarize the price trend in this widget.”

    Result: The agent retrieves the JSON data from the widget, sends it to Gemini (via LiteLLM), and returns the analysis.

  • WordPress Migration Guide: From Old Host to RackNerd

    This guide outlines the professional process of moving two WordPress installations to RackNerd. By using a “Backdoor Subdomain,” we ensure the old site remains accessible for the migration “pull” even after the primary domain is pointed to the new server.


    Phase 1: Preparation

    1. Backup Everything: Perform a full backup of your files and database on the old host.
    2. Gather Credentials:
      • Old Host: cPanel login or FTP/IP details.
      • New Host (RackNerd): Shared IP address, cPanel username, and password (found in your “Welcome Email”).
    3. Network Solutions Access: Ensure you have login access to your domain registrar to manage DNS/Nameservers. Setup the RackNerd nameservers to be used for the domain

    Phase 2: Setting Up RackNerd (Destination)

    1. Add the Domains:
      • Login to RackNerd cPanel.
      • Go to Domains > Addon Domains.
      • Add your domain.
    2. Install Fresh WordPress (Optional but helpful): Use the Softaculous installer in RackNerd to install a blank WordPress site on both domains. This ensures the database and folders are ready for Migrate Guru to overwrite.

    Phase 3: The “Backdoor Subdomain” (Source Setup)

    This step allows you to access the old site dashboard while the main domain is live on RackNerd.

    1. Create the DNS Record:
      • In your active DNS manager (Network Solutions or RackNerd, wherever the nameservers point), add an A Record.
      • Host: old
      • Points to: The IP Address of your OLD host (you can take it from the cPanel interface).
    2. Configure Old cPanel:
      • On the old host, create the subdomain old.yourdomain.com.
      • Crucial: Set the Document Root to point to the existing site folder (usually /public_html/).
    3. Override WordPress URL:
      • On the old host, open wp-config.php via File Manager.
      • Add these lines at the very top (after <?php):PHPdefine( 'WP_HOME', 'http://old.yourdomain.com' ); define( 'WP_SITEURL', 'http://old.yourdomain.com' );
      • Troubleshooting: If it still redirects, rename your .htaccess file to .htaccess_bak on the old host.

    Phase 4: Migration with Migrate Guru

    1. Access Old Site: Log in to http://old.yourdomain.com/wp-admin.
    2. Install Migrate Guru: Search for and activate the “Migrate Guru” plugin.
    3. Configure Migration:
      • Click that it is installed on both env copy key from the new site on the old one.
    4. Initiate: Click “Migrate.” The plugin will clone the site to RackNerd in the background.

    Phase 5: Testing & Going Live

    1. Verify that everything functions correctly
    2. SSL Installation: In RackNerd cPanel, go to SSL/TLS Status and run “AutoSSL” to ensure your site has a valid HTTPS certificate.

    Phase 6: Post-Migration Cleanup

    1. New Server Config: On the RackNerd server, open wp-config.php and ensure those “old” subdomain lines are not present. If they were copied over, delete them.
    2. Permalinks: In the new WordPress dashboard, go to Settings > Permalinks and click “Save Changes” twice to refresh the URL structure.
    3. Decommission: Once you are 100% sure the migration is successful, you can cancel your old hosting account.

  • Calculate total bucket usage by project

    Here is another script that can be used to calculate the total bucket usage per project from the cloud monitoring metrics.

    import time
    from google.cloud import monitoring_v3
    from google.protobuf.duration_pb2 import Duration
    
    def format_bytes(bytes_value):
        """Converts bytes to a human-readable format."""
        if bytes_value is None:
            return "N/A"
        
        bytes_value = float(bytes_value)
        units = ['Bytes', 'KiB', 'MiB', 'GiB', 'TiB', 'PiB']
        i = 0
        size = bytes_value
        while size >= 1024 and i < len(units) - 1: 
            size /= 1024
            i += 1
        
        return f"{size:.2f} {units[i]}"
    
    def get_total_gcs_storage_for_project(project_id):
        """
        Fetches the total_bytes metric, summed across ALL GCS buckets in a project.
        
        Args:
            project_id (str): Your Google Cloud project ID.
        
        Returns:
            float: The total size of all buckets in bytes, or None if no data is found.
        """
        client = monitoring_v3.MetricServiceClient()
        
        # FIX 1: Use manual path construction for older library versions
        project_name = f"projects/{project_id}"
        
        # 1. Define the Time Interval (last 2 days)
        now = int(time.time()) # Get current time as integer seconds
        
        # FIX 2: Instantiate TimeInterval and TimeStamp components directly
        interval = monitoring_v3.TimeInterval(
            end_time={'seconds': now},  # TimeStamp using dict for seconds
            start_time={'seconds': now - 2 * 86400} 
        )
    
        # 2. Define the Metric Filter
        metric_filter = (
            'metric.type = "storage.googleapis.com/storage/total_bytes" '
            'AND resource.type = "gcs_bucket"'
        )
        
        # 3. Define Aggregation for Summing All Buckets
        aggregation = monitoring_v3.Aggregation(
            # FIX 3: Instantiate Duration object manually
            alignment_period=Duration(seconds=86400), # 1 day
            
            per_series_aligner=monitoring_v3.Aggregation.Aligner.ALIGN_MAX,
            cross_series_reducer=monitoring_v3.Aggregation.Reducer.REDUCE_SUM,
            group_by_fields=[],
        )
    
        # 4. List Time Series
        try:
            results = client.list_time_series(
                request={
                    "name": project_name,
                    "filter": metric_filter,
                    "interval": interval,
                    "aggregation": aggregation,
                    "view": monitoring_v3.ListTimeSeriesRequest.TimeSeriesView.FULL,
                }
            )
        except Exception as e:
            print(f"Error querying Cloud Monitoring: {e}")
            return None
    
        # Process the result (should contain only one time series with one point)
        for result in results:
            if result.points:
                total_bytes = result.points[0].value.double_value
                return total_bytes
                
        return None
    
    # --- Configuration ---
    GCP_PROJECT_ID = "PROJECT_ID"  # <-- CHANGE THIS
    # ---------------------
    
    if __name__ == "__main__":
        total_size_bytes = get_total_gcs_storage_for_project(GCP_PROJECT_ID)
        
        print(f"Project ID: {GCP_PROJECT_ID}")
        print("---")
        
        if total_size_bytes is not None and total_size_bytes >= 0:
            print(f"Total GCS Storage Used (Raw Bytes): {int(total_size_bytes)}")
            print(f"Total GCS Storage Used (Human-readable): {format_bytes(total_size_bytes)}")
        else:
            print("Could not retrieve total storage metric. Check the project ID, permissions, or if Cloud Monitoring is enabled.")

    Tnx

  • Checking if STOXX 50 is historically overvalued

    This is not my script, but I needed to share this because it’s impressive.

    Here is the generated script by Gemini to check if the index is overvalued

    import requests
    import pandas as pd
    import matplotlib.pyplot as plt
    import seaborn as sns
    import os
    
    # ==========================================
    # 1. CONFIGURATION
    # ==========================================
    API_KEY = os.environ.get('FMP_KEY')  # <--- REPLACE THIS
    BASE_URL = 'https://financialmodelingprep.com/api/v3'
    
    # Top STOXX 50 Heavyweights (approx 60-70% of index weight)
    # We use a representative subset to keep the script fast.
    # For production, use the full 50 list.
    stoxx_tickers = [
        # --- FRANCE (approx. 18) ---
        "MC.PA",     # LVMH Moet Hennessy Louis Vuitton
        "OR.PA",     # L'Oreal
        "RMS.PA",    # Hermes International
        "TTE.PA",    # TotalEnergies
        "SAN.PA",    # Sanofi (Note: Do not confuse with SAN.MC)
        "AIR.PA",    # Airbus
        "SU.PA",     # Schneider Electric
        "AI.PA",     # Air Liquide
        "BNP.PA",    # BNP Paribas
        "CS.PA",     # AXA
        "DG.PA",     # Vinci
        "EL.PA",     # EssilorLuxottica
        "SAF.PA",    # Safran
        "KER.PA",    # Kering
        "RI.PA",     # Pernod Ricard
        "BN.PA",     # Danone
        "SGO.PA",    # Saint-Gobain
        "GLE.PA",    # Societe Generale (Often on the edge of inclusion)
    
        # --- GERMANY (approx. 15) ---
        "SAP.DE",    # SAP SE
        "SIE.DE",    # Siemens AG
        "ALV.DE",    # Allianz SE
        "DTE.DE",    # Deutsche Telekom
        "MBG.DE",    # Mercedes-Benz Group
        "BMW.DE",    # BMW
        "VOW3.DE",   # Volkswagen (Preferred)
        "MUV2.DE",   # Munich Re
        "BAS.DE",    # BASF
        "BAYN.DE",   # Bayer
        "ADS.DE",    # Adidas
        "DHL.DE",    # DHL Group (formerly Deutsche Post DPW.DE)
        "DB1.DE",    # Deutsche Boerse
        "IFX.DE",    # Infineon Technologies
        "ENR.DE",    # Siemens Energy (Recent addition candidate)
        # "DBK.DE",  # Deutsche Bank (Check if re-added in Sep 2025 rebalance)
        
        # --- NETHERLANDS (approx. 6) ---
        "ASML.AS",   # ASML Holding
        "ADYEN.AS",  # Adyen
        "INGA.AS",   # ING Groep
        "AD.AS",     # Ahold Delhaize
        "PHIA.AS",   # Philips
        "PRX.AS",    # Prosus
    
        # --- SPAIN (approx. 4) ---
        "ITX.MC",    # Inditex
        "IBE.MC",    # Iberdrola
        "SAN.MC",    # Banco Santander (Note: Suffix .MC)
        "BBVA.MC",   # BBVA
    
        # --- ITALY (approx. 4) ---
        "ISP.MI",    # Intesa Sanpaolo
        "ENEL.MI",   # Enel
        "ENI.MI",    # Eni
        "STLAM.MI",  # Stellantis (Listed in Milan and Paris, .MI often used for index)
        "RACE.MI",   # Ferrari (Recent addition)
        # "UCG.MI",  # UniCredit (Often rotates in/out)
    
        # --- BELGIUM (1) ---
        "ABI.BR",    # Anheuser-Busch InBev
    
        # --- FINLAND (2) ---
        "KNEBV.HE",  # Kone
        "NOKIA.HE"   # Nokia
    ]
    
    # ==========================================
    # 2. HELPER FUNCTIONS
    # ==========================================
    def get_json(endpoint):
        """Helper to handle FMP API requests"""
        url = f"{BASE_URL}/{endpoint}"
        if '?' in endpoint:
            url += f"&apikey={API_KEY}"
        else:
            url += f"?apikey={API_KEY}"
        
        try:
            response = requests.get(url)
            response.raise_for_status()
            return response.json()
        except Exception as e:
            print(f"Error fetching {endpoint}: {e}")
            return []
    
    def get_aggregated_pe(tickers, lookback_years=5):
        """
        Calculates the aggregate P/E of a list of tickers over time.
        (Sum of Market Caps) / (Sum of Net Income)
        """
        print(f"--- Building Index P/E for {len(tickers)} companies ---")
        
        df_earnings = pd.DataFrame()
        df_mcap = pd.DataFrame()
        
        # Limit data to keep script snappy
        limit_q = lookback_years * 4 + 4 
        limit_d = lookback_years * 365
        
        for ticker in tickers:
            # A. Fetch Earnings (Quarterly)
            inc_stmt = get_json(f"income-statement/{ticker}?period=quarter&limit={limit_q}")
            if inc_stmt:
                df = pd.DataFrame(inc_stmt)
                df['date'] = pd.to_datetime(df['date'])
                df.set_index('date', inplace=True)
                df = df.sort_index()
                # Calculate TTM Net Income (Sum last 4 quarters)
                ttm = df['netIncome'].rolling(4).sum()
                # Resample to daily (forward fill)
                df_earnings[ticker] = ttm.resample('D').ffill()
    
            # B. Fetch Market Cap (Daily)
            mcap = get_json(f"historical-market-capitalization/{ticker}?limit={limit_d}")
            if mcap:
                df = pd.DataFrame(mcap)
                df['date'] = pd.to_datetime(df['date'])
                df.set_index('date', inplace=True)
                df_mcap[ticker] = df['marketCap']
    
        # C. Aggregate
        # Align dates
        common_idx = df_earnings.index.intersection(df_mcap.index)
        df_earnings = df_earnings.loc[common_idx]
        df_mcap = df_mcap.loc[common_idx]
        
        # Sum components
        total_earnings = df_earnings.sum(axis=1)
        total_mcap = df_mcap.sum(axis=1)
        
        # Calculate P/E
        # Filter out potential data glitches (negative or zero sums)
        idx_pe = total_mcap / total_earnings
        idx_pe = idx_pe[idx_pe > 0] 
        
        return idx_pe
    
    
    # ==========================================
    # 3. MAIN EXECUTION
    # ==========================================
    
    # A. Compute STOXX 50 P/E
    stoxx_pe = get_aggregated_pe(stoxx_tickers)
    
    # B. (Optional) Get Benchmark. 
    # Since S&P 500 P/E is hard to get via free/basic calls, 
    # let's generate a synthetic line for comparison based on market averages if API fails,
    # or plot just the STOXX history.
    # For this script, we will focus on the STOXX Mean Reversion.
    
    # C. Calculate Statistics
    current_pe = stoxx_pe.iloc[-1]
    avg_pe_3y = stoxx_pe.rolling(window=365*3).mean().iloc[-1]
    min_pe = stoxx_pe.min()
    max_pe = stoxx_pe.max()
    
    print(f"\nRESULTS:")
    print(f"Current STOXX 50 P/E: {current_pe:.2f}")
    print(f"3-Year Average P/E:   {avg_pe_3y:.2f}")
    print(f"Min P/E (Period):     {min_pe:.2f}")
    print(f"Max P/E (Period):     {max_pe:.2f}")
    
    if current_pe < avg_pe_3y:
        print("CONCLUSION: UNDERVALUED relative to recent history.")
    else:
        print("CONCLUSION: OVERVALUED relative to recent history.")
    
    # ==========================================
    # 4. PLOTTING
    # ==========================================
    sns.set_style("darkgrid")
    plt.figure(figsize=(12, 6))
    
    # Plot Historical P/E
    plt.plot(stoxx_pe.index, stoxx_pe, label='STOXX 50 P/E (Calculated)', color='#003399', linewidth=2)
    
    # Plot Average Line
    plt.axhline(y=stoxx_pe.mean(), color='r', linestyle='--', label=f'Historical Mean ({stoxx_pe.mean():.1f}x)')
    
    # Add formatting
    plt.title('STOXX 50 Historical Valuation (P/E Ratio)', fontsize=16)
    plt.ylabel('Price / Earnings Ratio')
    plt.xlabel('Year')
    plt.legend()
    
    # Highlight Current Valuation
    plt.annotate(f'Current: {current_pe:.1f}x', 
                 xy=(stoxx_pe.index[-1], current_pe), 
                 xytext=(stoxx_pe.index[-1] - pd.Timedelta(weeks=20), current_pe + 2),
                 arrowprops=dict(facecolor='black', shrink=0.05))
    
    plt.tight_layout()
    plt.show()

    And the fact is that it works.

    This is impressive to say the least.

  • Update RECORD struct in BigQuery with data from other table

    I saw this use case in one of the challenge labs from Google skillboost, and it kind of intrigued me, so I decided to reproduce it.

    It’s quite easy to export historic data and ratings from Financial Modeling Prep for a certain stock. In my case that was Merck, a drug producer.

    So, I quickly drafted the following script:

    import requests
    import os
    import json
    import time
    
    FMP_KEY = os.getenv("FMP_KEY")
    symbol = input("Enter stock symbol: ").upper()
    start_date = input("Enter start date (YYYY-MM-DD): ")
    end_date = input("Enter end date (YYYY-MM-DD): ")
    url = f"https://financialmodelingprep.com/stable/historical-price-eod/full?symbol={symbol}&from={start_date}&to={end_date}&apikey={FMP_KEY}"
    def fetch_json(url):
        try:
            response = requests.get(url)
            response.raise_for_status()  # Raise an error for bad status codes
            return response.json()
        except requests.exceptions.RequestException as e:
            print(f"An error occurred: {e}")
            return None
        
    data = fetch_json(url)
    with open(f"{symbol}_historical_data.json", "w") as f:
       for item in data:
            f.write(json.dumps(item) + "\n")
    
    date_format = "%Y-%m-%d"
    t1 = time.mktime(time.strptime(start_date, date_format))
    t2 = time.mktime(time.strptime(end_date, date_format))
    days = int((t2 - t1) / 86400)
    
    
    url_rating=f"https://financialmodelingprep.com/stable/ratings-historical?symbol={symbol}&limit={days}&apikey={FMP_KEY}"
    data_rating = fetch_json(url_rating)
    with open(f"{symbol}_historical_ratings.json", "w") as f:
       for item in data_rating:
            f.write(json.dumps(item) + "\n")

    This script creates two New Line Delimiter JSON (which actually means that each JSON entry is a completely independent JSON on a line, and this is the BigQuery format supported for Schema Recognition)

    After the files are exported, I manually created two tables from these files called mkr_ratings and mrk_historical

    Good, now we have to denormalize some of the data and add our RECORD struct between the tables.

    Not wanting to waste time, I added the schema from GUI like

    [
        {
            "name": "price",
            "type": "RECORD",
            "mode": "NULLABLE",
            "fields": [
                {
                    "name": "close",
                    "type": "FLOAT",
                    "mode": "NULLABLE"
                },
                {
                    "name": "spread",
                    "type": "FLOAT",
                    "mode": "NULLABLE"
                },
                {
                    "name": "change",
                    "type": "FLOAT",
                    "mode": "NULLABLE"
                }
            ]
        }
    ]

    In the end the schemas looked like this:

    mkr_historical

    mkr_ratings

    And now we need to map data from historical to ratings in the following relation.

    mrk_historical.close AS price.close

    mrk_historical.spread AS price.spread

    mrk_historical.change AS price.change

    We also need a anchor field and that is in our case date, which is present in both tables.

    After trying to draft it by myself and review it and improve it using a little bit of LLM help, I reached the following form, which is annoyingly simple.

    UPDATE `[GCP_PROJECT]`.smallcaps.mrk_ratings AS mrk_ratings
    SET
      price = STRUCT(mrk_historical.close AS close, mrk_historical.spread AS spread, mrk_historical.change AS change)
    FROM
      `[GCP_PROJECT]`.smallcaps.mrk_historical AS mrk_historical
    WHERE
      mrk_ratings.date = mrk_historical.date;
    

    So, you just update the first table and give it an alias, set the actual RECORD as a STRUCT of the required fields from the second table, and at the end match the anchor column of both aliases. Sweet!

  • Adding schema to BigQuery table

    I am currently working on a bigger article for Medium but until I can put it words so that it’s really something really worth sharing, just wanted to add on one of the steps that I learned while working.

    Normally I wanted to load some info that was stored in CSV to BigQuery for analysis and filtering.

    I though that you can just add the header to CSV file and it will automatically recognize it and load it.

    Turns out that it’s a little bit more complicated.

    Normally it should work, and since it did with my first CSV, couldn’t really understand what was wrong.

    Now, there are two parts to this story:

    • How can you add the Table schema manually, and it will reveal the actual issue.
    • What do you need to be aware of and why this happens

    How can you add the table schema manually

    The data that is written to CSV is actually a Dataframe, so you have info about the types of the columns directly from code

    dtype_mapping = {
        'object': 'STRING',
        'int64': 'FLOAT',
        'float64': 'FLOAT'
    }
    
    schema = []
    for column, dtype in df.dtypes.items():
        schema.append({
            'name': column,
            'type': dtype_mapping.get(str(dtype), 'STRING') # Default to STRING if type is not in map
        })
    
    import json
    print(json.dumps(schema, indent=2))

    Yes, I know, int64 should be mapped to INTEGER, but it turns out that for my case some columns even if in Python are market as int64, in BigQuery they need to be FLOAT. I know there is more memory allocated but the dataset is quite small.

    So you can easily use so that you can exclude the header.

    df.to_csv(f"df.csv", index=False, mode='a')

    The above piece of code will help you create a SCHEMA from a Dataframe header

    What do you need to be aware of and why this happens

    The actually reason why this happened is because I was not aware that somewhere in my csv file a line with the header definition still remained (yes, I actually wrote multiple dataframes with header and filtered with a Linux command, and it did not work)

    And when the file loaded, I actually saw this:

    Normally if that line was missing and no header, that it should had looked like

    Things to be learned from this exercise:

    • If you have to write multiple dataframes in a CSV file, don’t add the header and use the above code to generate a specific definition of the schema
    • Properly check the CSV not to have rogue lines that don’t match the rest of the structure of the data, otherwise you will find out that everything is converted to string and you don’t understand why.

    And now you know.

    Sorin

  • Data science in investing 101

    I recently wrote a piece on Medium about a quick analysis of the correlation of multiple indexes in China, Korea and Japan, really hope you enjoy it

    https://medium.com/@sorin.tudor/data-science-in-investing-101-51a40281f28a

    Cheers,

    Sorin

  • How to bring down a Kafka cluster without your fault.

    I will keep it short as a message since sharing screens or details is a security breach 😄:


    1 Google decides to migrate 2 of you VMs in a space of 5 min (lucky us)


    2 The cluster becomes unbalanced, surely you also have Kafka Manager installed locally that doesn’t make your life any easier (Tnx Yahoo)


    3 You receive a message from monitoring a couple of hours later that you have offline partitions (but why?)


    4 You find out that one of you node is not responding anymore so you restart it


    5 After replication you see that your other node has increased IOWait and load so you take action to restart it as well


    6 News flash, your partitions are not replicating anymore, but why? So you take a look on the topics to see that ISR is only on the node you restarted the first time and that is also the leader of all partitions…..wow, strange….but it isn’t, because you changed a very important parameter…..unclean.leader.election.enabled=false , and now, yes, comes the AHA moment


    7 What the hell do you do because that parameters constrains Kafka from moving the leader to a node that it’s not in ISR


    8 You decide that the actual issue is with the first node you restarted which is also the controller, so you restart it to force the cluster to restart replication even if you will have downtime and offline partitions


    9 The cluster restarts the replication process and the other nodes are brought in ISR


    10 You manually rebalance the cluster partitions so that the first node does not get overwhelmed