Web scraping for startup with Python - Part 2 - Rate Limit Bypass

Introduction

In Part 1 we explored the basics of web scraping using Selenium and BeautifulSoup - two powerful tools that are efficient and easy to use. However, when it comes to large-scale web scraping projects, we recommend considering Scrapy, a more robust and scalable web scraping framework.

While Scrapy offers advanced features and capabilities, it does come with a steeper learning curve. I don’t recommend beginners to learn.

In Part 2 - We will focus to tackle the obstacles on running scrapers sustainably. There are some challenges faced in modern websites that we often overlook.

Common Challenges

One common challenge faced in web scraping is dealing with rate-limit rules implemented by modern websites. These rules are designed to prevent users from making too frequent requests, which can potentially overload the website's servers. So, what should you do in such cases?

We will explore a few practical screnarios and provide you with strategies and examples to effectively handle rate-limit restriction and ensure your scrapers run smoothly.

Scenario 1: IP Blocking Due to Excessive Requests

Solution 1: Implement Rotating Proxies

Using rotating proxies can be an effective solution in this scenario. Rotating proxies allow you to distribute your requests across a pool of different IP addresses, helping you avoid being blocked by the target website.

How do I get free proxy servers?

How to implement proxy server rotating using python?

Example: Using Selenium

from selenium import webdriver
proxy_server_pools = ["211.123.252.110:8080", "211.123.252.111:8080","211.123.252.111:8080"]
current_proxy = random.choice(
              [x for x in proxy_server_pools])
while True:
	try:
		chrome_options = WebDriver.ChromeOptions()
		chrome_options.add_argument('--proxy-server=%s' % current_proxy)
		chrome = webdriver.Chrome(chrome_options=chrome_options)
		# Logic for scraping ...
	except Exception as e:
		print(str(e))
		current_proxy = random.choice(x for x in proxy_server_pools if x != current_proxy])
		
# Remember to close the driver connection at the end

Example: Using Requests module in python

from selenium import webdriver
http_proxy = ["http://211.123.252.110:8080", "http://211.123.252.111:8080","http://211.123.252.111:8080"]
https_proxy = ["https://211.123.252.114:8080", "https://211.123.252.115:8080","http://211.123.252.116:8080"]
current_http_proxy = random.choice([x for x in http_proxy])
current_https_proxy = random.choice([x for x in https_proxy])
while True:
	try:
		proxies = { 
		              "http"  : current_http_proxy, 
		              "https" : current_https_proxy, 
		            }
		# Logic for scraping ...
	except Exception as e:
		print(str(e))
		current_http_proxy = random.choice([x for x in http_proxy])
		current_https_proxy = random.choice([x for x in https_proxy])

Scenario 2: Rate Limit Restrictions Without IP Blocking

In some cases, the target website may not outright block your IP address, but instead, impose rate limit restrictions. This means you can still access the website, but only after a certain time period has elapsed since your last request.

Solution 1: Implement Delay and Randomization

We can simply add a timeout after each successful request.

Example: Selenium

import random
import time
from selenium import webdriver

driver = webdriver.Chrome()
while True:
	delay_range = (2, 5)  # Delay range in seconds
	driver.get("https://example.com")
	
	# Perform your Selenium actions here
	# ...
	
	random_delay = random.uniform(*delay_range)
	time.sleep(random_delay)
	
driver.quit()

Scenario 3: Respecting Website Guidelines and Restrictions

When embarking on a web scraping project, it's crucial to understand and respect the guidelines and restrictions set by the target website. One effective way to do this is by consulting the website's robots.txt file, which can be found under the host domain (e.g., https://www.google.com/robots.txt).

Solution: Analyze the robots.txt file

The robots.txt file is a standard protocol used by website owners to communicate their preferences and guidelines for web crawlers and scrapers. By carefully examining the robots.txt file, you can gain valuable insights that can help you avoid triggering rate limits or other defensive measures implemented by the website.

Allow indexing of everything

User-agent: *Allow:

Disallow indexing of everything

User-agent: *Disallow:

Disallow indexing of a specific folder

User-agent: *Disallow: /folder/

Disallow Googlebot from indexing of a folder, except for allowing the indexing of one file in that folder

User-agent: GooglebotDisallow: /folder1/Allow: /folder1/myfile.html

The crawler should wait at least 5 seconds between successive requests to the website

User-agent: *

Crawl-Delay: 5

Here's how you can leverage the information in the robots.txt file:

Identify Crawlable Areas: The robots.txt file will typically specify the areas of the website that are allowed to be crawled and indexed by search engines and other web scrapers. By adhering to these guidelines, you can ensure that your scraping activities do not violate the website's terms of service.

Detect Restricted Sections: Conversely, the robots.txt file may also list areas of the website that are off-limits or restricted for scraping. Respecting these restrictions can help you avoid potential legal issues or retaliation from the website owners.

Adjust Scraping Patterns: The information in the robots.txt file can also guide you in adjusting your scraping patterns and strategies to align with the website's preferences. This can include adjusting request frequencies, avoiding certain pages or sections, or employing other techniques to ensure your scraping activities are within the website's acceptable limits.

SEO Plan - Before you writing scripts

1. Respect the Website's robots.txt Guidelines

  • Carefully review the website's robots.txt file to understand the crawlable and restricted areas.

  • Align your scraping activities with the website's guidelines to demonstrate respect and avoid potential issues.

2. Prepare the Technical Infrastructure

  • Set up a pool of rotating proxy servers to distribute requests and avoid IP-based blocking.

  • Implement randomization and delay mechanisms to comply with rate limit restrictions.

  • Ensure robust error handling to gracefully handle any issues that may arise during crawling.

3. Leverage Familiar Tools and Technologies

  • Choose tools and frameworks you are most comfortable with, such as Selenium with BeautifulSoup or pure Selenium.

  • This can help you develop the crawler more efficiently and reduce development time.

4. Understand the Website's HTML Architecture

  • Carefully analyze the website's HTML structure to identify the target data you want to crawl.

  • This will guide you in crafting the appropriate scraping logic and selectors.

5. Inspect the Network Traffic

  • Use browser developer tools to inspect the network requests and responses.

  • Identify if the data is fetched from a separate API endpoint, which can simplify the scraping process.

6. Prioritize Simplicity

  • Aim for the simplest and most straightforward approach to extract the desired information.

  • Avoid over-complicating the crawler, as simpler solutions are often more robust and maintainable.

7. Debug, Test, Deploy, and Monitor

  • Thoroughly debug and test the crawler to ensure it functions as expected.

  • Deploy the crawler and monitor its performance over time.

  • Be prepared to make adjustments as needed to address any issues that may arise.