Web scraping for startup with Python - Part 1

Web scraping is a great tool for startup to generate leads, performing good seo. Here will guide you to scrape websites using python, requests and selenium

What is Web scraping? 

Web scraping is the automated process of extracting data from websites. It involves programmatically retrieving and parsing HTML or other structured data from the web, allowing you to gather large amounts of information that can be used for various purposes, such as market research, price comparison, or content aggregation.

Why do we need webscraping?

Generate leads

You can scrape the email list and contacts from potential customers into database.

Enhance SEO

You can find backlinks opportunities by connecting relevant websites in your nature. Moreover, one thing people overlook is that you can scrape data related to your business nature, consolidate them and render the data in your website. You can create more webpages for google to index and increase the chance of hitting the keywords

Python for web scraping

Python is dynamic typing, lack of compliation requirements, and beginner-friendly nature make it an excellent choice for web scraping, allowing startups to quickly extract and analyze data from the web without complex setup.

Tools we need

  1. Install latest python version here

  2. Get VSCode here (You can use whatever editor you like)

  3. Make sure you have chrome installed

  4. Create a new project

  5. Download required python webscraping packages from pip command line

pip install beatifulsoup4
pip install selenium
pip install webdriver-manager

Steps before writing code for web scraping

  1. Identify the types of website (whether it is static or constantly updating)

  2. Inspect what elements you want to extract in google devtools ( click F12 in the webpage)

  3. Identify the target html element you want to extract

  4. Then you can start coding 😄 

Real-world example 1 (Static website)

Suppose we want to get the e-commerce items in the webapge: https://webscraper.io/test-sites/e-commerce/allinone.

  1. Import required modules

  2. Define the data model which we want to pass the data

    1. item image url

    2. item title

    3. item description

    4. number of stars in the item

    5. number of reviews in the item

    6. item price

import requests
from bs4 import BeautifulSoup

# Initiailize a class to store the e-commerce item
class Item:
    def __init__(self, imgSrc: str, title: str, description: str, num_of_stars: int, num_of_reviews: int, price: str):
        self.imgSrc = imgSrc
        self.title = title
        self.description = description
        self.num_of_stars = num_of_stars
        self.num_of_reviews = num_of_reviews
        self.price = price

items = []
  1. Set the required configurations to get the html source of the webpage

# Add user_agent to mimic the behavior of a web browser
user_agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'

url = 'https://webscraper.io/test-sites/e-commerce/allinone'

response = requests.get(url, headers= {'User-Agent': user_agent})

# Pass the html to beautiful soup for easier extraction
soup = BeautifulSoup(response.content, 'html.parser')
  1. Identify the element we want to extract. Simply right click the element in chrome and click inspect.

Inspect elements type in a website using google devtoolss
  1. The target item will be highlighted when you hover to the html code.

  2. Identify the tag, class, attributes needed to get the required content

    1. All the items are located in the “card” class. We can get all the card elements and loop them one by one to extract the data

    2. The image url is inside the <img src=’’/>

    3. The price and title are inside <h4></h4> tags

    4. The description is in <p></p> of the <div class=”caption”> </div>

    5. The number of reviews and stars are in <p></p> of the <div class=”ratings”> </div>

  3. BeautifulSoup library is used to extract the data according to the class, tags and attributes. Read the docs here

for card in cards:
    imgSrc = 'https://webscraper.io' + card.find("img").get("src")

    caption = card.find("div", {"class": "caption"})
    h4Text = card.find_all("h4")
    price = h4Text[0].text
    title = h4Text[1].text
    description = caption.find('p').text

    ratings = card.find("div", {"class": "ratings"})
    pText = ratings.find_all("p")

    # 14 reviews -> need to split the text using a space and get the first element
    num_of_reviews = int(pText[0].text.split(' ')[0])
    # to get the value in the tag (data-rating)
    num_of_stars = int(pText[1].get('data-rating'))
  1. We can map the data to our data model

    items.append(Item(imgSrc=imgSrc, title=title, description=description, num_of_stars=num_of_stars, num_of_reviews=num_of_reviews, price=price))
  1. We finish!! We can setup a timer to run this job and get the data right now!!

Real-world example 2 (website with real time data)

Get the bitcoin price and time at https://hk.investing.com/crypto/bitcoin

  1. Setup required configurations for selenium and chrome

import time
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager


bitcoin_url = "https://hk.investing.com/crypto/bitcoin"

options = webdriver.ChromeOptions()
# Allow headless chrome mode
options.add_argument("--headless=new")

# pass the parameters to simulate real browser
options.add_argument("--dns-prefetch-disable")
options.add_argument("--start-maximized")
options.add_argument("--window-size=1920,1080")
options.add_argument("--no-sandbox")

# disable useless stuff to load faster
options.add_argument("--disable-dev-shm-usage")
options.add_argument("disable-infobars")
options.add_argument("blink-settings=imagesEnabled=false")

# mimic the behavior of a web browser
options.add_argument("user-agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36")

# use webdriver-chrome and it will download chrome driver for you to use with selenium automatically

driver = webdriver.Chrome(service=Service(
    ChromeDriverManager().install()), options=options)

# it will stop function if it takes more than 60 seconds
driver.set_page_load_timeout(60)
driver.set_script_timeout(60)

driver.get(bitcoin_url)
# wait 5 seconds after loading 
driver.implicitly_wait(5)
  1. Identify the bitcoin price and the last updated time by right click and inspects the element

  2. We can get the texts once we identify the element with the “data-test” tag

Inspect elements type in a website using google devtoolss
  1. Keep the code running using while loop and keep getting the updated data as soon as the website is updating. Don’t make the window close!

while True:
    soup = BeautifulSoup(driver.page_source, "html.parser")
    # using xpath to get the element is the fastest way but it is not substainable. It is better to get the data from its tag
    labels = soup.find_all(attrs={"data-test": True})
    for label in labels:
        bitcoin_price = ''
        if(label.get("data-test") == 'instrument-price-last'):
            bitcoin_price = label.text
            print(f'bitcoin price (usd): {label.text}')
        if(label.get("data-test") == 'trading-time-label'):
            last_updated_time = label.text
            print(f'last updated time: {label.text}')
            time.sleep(2)

Reminders

  1. Most website will rate limit your ip address if you access frequently. Remember not to scrape too frequently and take a look at their data usage policy

  2. Sometimes the website will take more time to load. Make sure you add more timeout

  3. Constantly updating your script is necessary, as the HTML structure of websites can change, causing your web scraping script to break.

Summary

Web scraping can be both fun and challenging, but it is well worth the effort to write your own Python scripts and automate the process. By doing so, you will be able to work much more efficiently and gain valuable insights from the data you collect.

We will have part 2 to explore this topic deeply. Enjoy!!!

Download my apps if you are interested 😍