- Tony Tang's Newsletter
- Posts
- Web scraping for startup with Python - Part 2 - Rate Limit Bypass
Web scraping for startup with Python - Part 2 - Rate Limit Bypass
Introduction
In Part 1 we explored the basics of web scraping using Selenium and BeautifulSoup - two powerful tools that are efficient and easy to use. However, when it comes to large-scale web scraping projects, we recommend considering Scrapy, a more robust and scalable web scraping framework.
While Scrapy offers advanced features and capabilities, it does come with a steeper learning curve. I don’t recommend beginners to learn.
In Part 2 - We will focus to tackle the obstacles on running scrapers sustainably. There are some challenges faced in modern websites that we often overlook.
Common Challenges
One common challenge faced in web scraping is dealing with rate-limit rules implemented by modern websites. These rules are designed to prevent users from making too frequent requests, which can potentially overload the website's servers. So, what should you do in such cases?
We will explore a few practical screnarios and provide you with strategies and examples to effectively handle rate-limit restriction and ensure your scrapers run smoothly.
Scenario 1: IP Blocking Due to Excessive Requests
Solution 1: Implement Rotating Proxies
Using rotating proxies can be an effective solution in this scenario. Rotating proxies allow you to distribute your requests across a pool of different IP addresses, helping you avoid being blocked by the target website.
How do I get free proxy servers?
There are free proxies online but usually most of them do not work and it may have potential security issues
There are lots of proxy server saas online and you could buy proxies from them.
You could also purchase virtual machines from cloud (AWS, Azure, Linode). The cheapest one is 5 usd per month
The simplest way is to setup squid proxy server in each of the machine and you could use them for scraping
How to implement proxy server rotating using python?
Example: Using Selenium
from selenium import webdriver
proxy_server_pools = ["211.123.252.110:8080", "211.123.252.111:8080","211.123.252.111:8080"]
current_proxy = random.choice(
[x for x in proxy_server_pools])
while True:
try:
chrome_options = WebDriver.ChromeOptions()
chrome_options.add_argument('--proxy-server=%s' % current_proxy)
chrome = webdriver.Chrome(chrome_options=chrome_options)
# Logic for scraping ...
except Exception as e:
print(str(e))
current_proxy = random.choice(x for x in proxy_server_pools if x != current_proxy])
# Remember to close the driver connection at the end
Example: Using Requests module in python
from selenium import webdriver
http_proxy = ["http://211.123.252.110:8080", "http://211.123.252.111:8080","http://211.123.252.111:8080"]
https_proxy = ["https://211.123.252.114:8080", "https://211.123.252.115:8080","http://211.123.252.116:8080"]
current_http_proxy = random.choice([x for x in http_proxy])
current_https_proxy = random.choice([x for x in https_proxy])
while True:
try:
proxies = {
"http" : current_http_proxy,
"https" : current_https_proxy,
}
# Logic for scraping ...
except Exception as e:
print(str(e))
current_http_proxy = random.choice([x for x in http_proxy])
current_https_proxy = random.choice([x for x in https_proxy])
Scenario 2: Rate Limit Restrictions Without IP Blocking
In some cases, the target website may not outright block your IP address, but instead, impose rate limit restrictions. This means you can still access the website, but only after a certain time period has elapsed since your last request.
Solution 1: Implement Delay and Randomization
We can simply add a timeout after each successful request.
Example: Selenium
import random
import time
from selenium import webdriver
driver = webdriver.Chrome()
while True:
delay_range = (2, 5) # Delay range in seconds
driver.get("https://example.com")
# Perform your Selenium actions here
# ...
random_delay = random.uniform(*delay_range)
time.sleep(random_delay)
driver.quit()
Scenario 3: Respecting Website Guidelines and Restrictions
When embarking on a web scraping project, it's crucial to understand and respect the guidelines and restrictions set by the target website. One effective way to do this is by consulting the website's robots.txt file, which can be found under the host domain (e.g., https://www.google.com/robots.txt).
Solution: Analyze the robots.txt file
The robots.txt file is a standard protocol used by website owners to communicate their preferences and guidelines for web crawlers and scrapers. By carefully examining the robots.txt file, you can gain valuable insights that can help you avoid triggering rate limits or other defensive measures implemented by the website.
Allow indexing of everything
User-agent: *Allow:
Disallow indexing of everything
User-agent: *Disallow:
Disallow indexing of a specific folder
User-agent: *Disallow: /folder/
Disallow Googlebot from indexing of a folder, except for allowing the indexing of one file in that folder
User-agent: GooglebotDisallow: /folder1/Allow: /folder1/myfile.html
The crawler should wait at least 5 seconds between successive requests to the website
User-agent: *
Crawl-Delay: 5
Here's how you can leverage the information in the robots.txt file:
Identify Crawlable Areas: The robots.txt file will typically specify the areas of the website that are allowed to be crawled and indexed by search engines and other web scrapers. By adhering to these guidelines, you can ensure that your scraping activities do not violate the website's terms of service.
Detect Restricted Sections: Conversely, the robots.txt file may also list areas of the website that are off-limits or restricted for scraping. Respecting these restrictions can help you avoid potential legal issues or retaliation from the website owners.
Adjust Scraping Patterns: The information in the robots.txt file can also guide you in adjusting your scraping patterns and strategies to align with the website's preferences. This can include adjusting request frequencies, avoiding certain pages or sections, or employing other techniques to ensure your scraping activities are within the website's acceptable limits.
SEO Plan - Before you writing scripts
1. Respect the Website's robots.txt Guidelines
Carefully review the website's robots.txt file to understand the crawlable and restricted areas.
Align your scraping activities with the website's guidelines to demonstrate respect and avoid potential issues.
2. Prepare the Technical Infrastructure
Set up a pool of rotating proxy servers to distribute requests and avoid IP-based blocking.
Implement randomization and delay mechanisms to comply with rate limit restrictions.
Ensure robust error handling to gracefully handle any issues that may arise during crawling.
3. Leverage Familiar Tools and Technologies
Choose tools and frameworks you are most comfortable with, such as Selenium with BeautifulSoup or pure Selenium.
This can help you develop the crawler more efficiently and reduce development time.
4. Understand the Website's HTML Architecture
Carefully analyze the website's HTML structure to identify the target data you want to crawl.
This will guide you in crafting the appropriate scraping logic and selectors.
5. Inspect the Network Traffic
Use browser developer tools to inspect the network requests and responses.
Identify if the data is fetched from a separate API endpoint, which can simplify the scraping process.
6. Prioritize Simplicity
Aim for the simplest and most straightforward approach to extract the desired information.
Avoid over-complicating the crawler, as simpler solutions are often more robust and maintainable.
7. Debug, Test, Deploy, and Monitor
Thoroughly debug and test the crawler to ensure it functions as expected.
Deploy the crawler and monitor its performance over time.
Be prepared to make adjustments as needed to address any issues that may arise.