Tag: scraping

  • Using Google Gemini to optimise my scraping code

    I wanted for Gemini to take a look on the code that I posted here

    Here is his answer:

    Absolutely! Here’s a breakdown of improvements and considerations for refining the provided web scraping code:

    Enhancements:

    • Robust Error Handling: Incorporate try-except blocks around the scraping logic and network calls. This will prevent the script from crashing due to unexpected issues like elements not being found or the website being temporarily down.
    • Optimized Waits: Replace the fixed time.sleep(2) with more precise WebDriverWait conditions. For instance, wait for specific elements that indicate successful loading. This improves efficiency.
    • Reusable Function: Convert the alcopascrap function to accept additional parameters (like wait times) for greater flexibility in various scraping scenarios.
    • Data Cleaning: Within the loop, sanitize the extracted data, removing unnecessary whitespace or non-standard characters to ensure data consistency.
    • Comments: Add clear comments to explain the purpose of different code sections, which is crucial for understanding and maintenance.

    Optimization Considerations

    • Scrolling Logic: Investigate if the website uses infinite scroll or lazy loading. If it does, adjust the scrolling logic to trigger loading of new content only when necessary.
    • Parallelism (Advanced): Consider parallelizing requests for different brands using threading or multiprocessing, especially if the scraping scope is large.

    from bs4 import BeautifulSoup
    from selenium import webdriver
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.support import expected_conditions as EC
    from selenium.common.exceptions import TimeoutException, NoSuchElementException
    import time
    
    def alcopascrap(brand, max_wait_time=10):
        driver = webdriver.Chrome()  
    
        try:
            driver.get(f"https://www.alcopa-auction.fr/ro/cautare?brand={brand}")
            WebDriverWait(driver, max_wait_time).until(
                EC.element_to_be_clickable((By.ID, "CybotCookiebotDialogBodyButtonAccept"))
            ).click()
    
            # Optimized scrolling (Implementation depends on the site's scrolling mechanism)
    
            list_of_cars = []
    
            ratings = soup.find_all('div',{'data-brand':brand})
            for rating in ratings:
                data = {}
                try: 
                    data['title'] = rating.find('li',{'class':'mb-2 list-group-item border-0 p-0'}).get_text().strip()
                    data['price'] = rating['data-prices']
                    data['details'] = [element.get_text().strip() for element in rating.find_all('li', {'class':'list-group-item border-0 p-0'})]
                    data['link'] = rating.find('a', {'class':'no-decoration'}).get('href')
                    list_of_cars.append(data)
                except NoSuchElementException:
                    print(f"Error processing car data in rating block: {rating}")
    
            # ... (rest of your code)
    
        except TimeoutException:
            print(f"Timeout waiting for elements on page for brand: {brand}")
        finally:
            driver.quit()
    
    alcopascrap('nissan') 
    

    It lacks a couple of lines like:

            html_source = driver.page_source
            soup = BeautifulSoup(html_source, 'html.parser')

    And

      for car in list_of_cars:
                print(car, end='\n')

    But otherwise it works like charm and it’s faster than my first implementation

    Use chatbots, they know better most of the times.

    Cheers!

  • Scraping Alcopa auction site using python

    Just for reference a small example of scrapping https://www.alcopa-auction.fr/ for new cars on specific brand.

    It can be improved a lot but just as a reference of the code.

    from bs4 import BeautifulSoup
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.support import expected_conditions as EC
    from selenium import webdriver
    import time
    
    # Set up a headless browser
    driver = webdriver.Chrome()  
    
    
    # Load the website with JavaScript
    def alcopascrap(brand):
    
        driver.get(f"https://www.alcopa-auction.fr/ro/cautare?brand={brand}")
    
        # Wait for JavaScript to execute (adjust wait time if needed)
        driver.implicitly_wait(180)  
    
    
        WebDriverWait(driver, 1).until(
        EC.element_to_be_clickable((By.ID, "CybotCookiebotDialogBodyButtonAccept"))
    ).click()
    
    # Get initial scroll height
        last_height = driver.execute_script("return document.body.scrollHeight")
    
        while True:
        # Scroll down
            driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    
        # Wait for loading
            time.sleep(2)  # Adjust wait time as needed
    
        # Calculate new scroll height and compare with previous
            new_height = driver.execute_script("return document.body.scrollHeight")
            if new_height == last_height:
                break
            last_height = new_height
    
    # Get the HTML source with JavaScript modifications
        html_source = driver.page_source
    
    
    # Parse the HTML with BeautifulSoup 
        soup = BeautifulSoup(html_source, 'html.parser')
    
    #Create a list to store the data
    
        list_of_cars = []
    # Now you can use BeautifulSoup to extract the data you need
    # ...
        ratings = soup.find_all('div',{'data-brand':brand})
        for rating in ratings:
            data = {}
            data['title'] = rating.find('li',{'class':'mb-2 list-group-item border-0 p-0'}).get_text()
            data['price'] = rating['data-prices']
            data['details'] = [ element.get_text().strip() for element in rating.find_all('li',{'class':'list-group-item border-0 p-0'})]
            data['link'] = rating.find('a', {'class':'no-decoration'}).get('href')
            list_of_cars.append(data)
       
    # Print the data
        for car in list_of_cars:
            print(car, end='\n')
    
    # Close browser
    
        driver.quit()
        
    
    alcopascrap('nissan')

    As for the code description, here is the one from Github Copilot, you judge if it’s good or not.

    This Python script uses Selenium and BeautifulSoup to scrape car data from the Alcopa Auction website. Here’s a step-by-step breakdown:

    1. Import necessary libraries: BeautifulSoup for parsing HTML, Selenium for browser automation, and time for controlling the script’s timing.
    2. Set up a headless browser using Selenium’s webdriver. This script uses Chrome as the browser.
    3. Define a function alcopascrap(brand), which takes a car brand as an argument and scrapes data for that brand from the website.
    4. The function first navigates to the Alcopa Auction website’s search page for the specified brand.
    5. It waits for the page to load, then clicks the accept button on the cookie consent dialog.
    6. It then enters a loop where it continually scrolls to the bottom of the page, waits for more content to load, and repeats until no more new content is loaded. This is done to ensure all the dynamic content loaded by JavaScript is captured.
    7. After all content is loaded, it gets the page’s HTML source.
    8. It then parses the HTML using BeautifulSoup.
    9. It creates an empty list list_of_cars to store the car data.
    10. It finds all div elements with a data-brand attribute matching the specified brand. For each of these elements, it extracts the car title, price, details, and link, and stores them in a dictionary. Each dictionary is then appended to list_of_cars.
    11. It prints out the data for each car.
    12. Finally, it closes the browser.
    13. The function is then called with ‘nissan’ as the argument, so it scrapes data for Nissan cars.

    This script is a good example of how to use Selenium and BeautifulSoup together to scrape data from a website that uses JavaScript to load content.