How to implement request retries in Python

Comments: 0

Web scraping is an effective method for extracting data from the web. Many developers prefer to use the Python requests library to carry out web scraping projects as it’s simple and effective. However, great as it is, the request library has its limitations. One typical problem we may encounter in web scraping is failed requests, which often lead to unstable data extraction. In this article, we will go through the process of implementing request retries in Python, so you can handle the HTTP errors and keep your web scraping scripts stable and reliable.

Getting started with the requests library

Let’s set up our environment first. Make sure you have Python installed and any IDE of your choice. Then install the requests library if you don’t have it already.

pip install requests

Once installed, let's send a request to example.com using Python's requests module. Here's a simple function that does just that:

import requests

def send_request(url):
    """
    Sends an HTTP GET request to the specified URL and prints the response status code.
    
    Parameters:
        url (str): The URL to send the request to.
    """
    response = requests.get(url)
    print('Response Status Code: ', response.status_code)

send_request('https://example.com')

The code output is shown below:

How to implement request retries in Python.png

Let's take a closer look at HTTP status codes to understand them better.

Understanding HTTP status codes

The server responds to an HTTP request with a status code indicating the request's outcome. Here's a quick rundown:

  1. 1xx (Informational): The request was received and continues to be processed.
  2. 2xx (Success): The request was received, understood, and accepted.
    • 200 OK: The request was successful. This is the green light of HTTP status codes.
  3. 3xx (Redirection): Further action is needed to complete the request.
  4. 4xx (Client Error): There was an error with the request, often due to something on the client-side.
  5. 5xx (Server Error): The server failed to fulfill a valid request due to an error on its end.
    • 500 Internal Server Error: The server was unable to complete the request. This indicates that the server encountered an unexpected condition that prevented it from fulfilling the request. This is the HTTP status code equivalent of the red traffic light.
    • 504 Gateway Timeout: The server didn’t receive a response from the upstream server in time. This is the HTTP status code equivalent of the waiting room timeout traffic light.

In our example, the status code 200 means the request to https://example.com was completed. It's the server's way of saying, "Everything's good here, your request was a success".

These status codes can also play a role in bot detection and indicating when access is restricted due to bot-like behavior.

Below is a quick rundown of HTTP error codes that mainly occur due to bot detection and authentication issues.

  1. 429 too many requests: this status code indicates that the user has sent too many requests in a given time (“rate limiting”). It’s a common response when bots exceed predefined request limits.
  2. 403 forbidden: this code is returned when the server refuses to fulfill the request. This can occur if the server suspects the request is coming from a bot, based on User-Agent or other criteria.
  3. 401 unauthorized: this status might be used if access requires authentication that the bot does not have.
  4. 503 service unavailable: sometimes used to indicate that the server is temporarily unable to handle the request, which might happen during automated traffic spikes.

Implementing retry mechanism in Python

Let’s now write a simple retry mechanism in Python to make HTTP GET request with the requests library. There are times when network requests fail because of some network problem or server overload. So if our request fails, we should retry these requests.

Basic retry mechanism

The function send_request_with_basic_retry_mechanism makes HTTP GET requests to a given URL with a basic retry mechanism in place which would only retry if a network or request exception like connection error is encountered. It would retry the request max_retries times maximum. If all tries fail with such an exception, it raises the last encountered exception.

import requests
import time

def send_request_with_basic_retry_mechanism(url, max_retries=2):
    """
    Sends an HTTP GET request to a URL with a basic retry mechanism.
    
    Parameters:
        url (str): The URL to send the request to.
        max_retries (int): The maximum number of times to retry the request.

    Raises:
        requests.RequestException: Raises the last exception if all retries fail.

    """
    for attempt in range(max_retries):
        try:
            response = requests.get(url)
            print('Response status: ', response.status_code)
            break  # Exit loop if request successful
        except requests.RequestException as error:
            print(f"Attempt {attempt+1} failed:", error)
            if attempt < max_retries - 1:
                print(f"Retrying...")
                time.sleep(delay)  # Wait before retrying
            else:
                print("Max retries exceeded.")
                # Re-raise the last exception if max retries reached
                raise
                send_request_with_basic_retry_mechanism('https://example.com')

Advance retry mechanism

Let’s now adapt the basic retry mechanism to handle scenarios where the website we’re trying to scrape implements bot detection mechanisms that may result in blocking. To address such scenarios, we need to retry the request diligently multiple times, as they may not be just bot detection blocks but also could be because of network or server problems.

The below function send_request_with_advance_retry_mechanism sends an HTTP GET request to the provided URL with optional retry attempts and retry delay. It tries to send the request multiple times for the specified number of attempts and prints the response status code if the request successfully gets the response. If it encounters an error during the request operation, it prints the error message and retries it. It waits for the specified retry delay between each attempt. If the request fails even after the specified number of retry attempts, it raises the last encountered exception.

The delay parameter is important as it avoids bombarding the server with multiple requests at a close interval. Instead, it waits for the server to have enough time to process the request, making the server think that a human and not a bot is making the requests. So, the retry mechanism should be delayed to avoid server overload or slow server response which may trigger anti-bot mechanisms.

import requests
import time

def send_request_with_advance_retry_mechanism(url, max_retries=3, delay=1):
    """
    Sends an HTTP GET request to the specified URL with an advanced retry mechanism.
    
    Parameters:
        url (str): The URL to send the request to.
        max_retries (int): The maximum number of times to retry the request. Default is 3.
        delay (int): The delay (in seconds) between retries. Default is 1.

    Raises:
        requests.RequestException: Raises the last exception if all retries fail.
    """
    for attempt in range(max_retries):
        try:
            response = requests.get(url)
            # Raise an exception for 4xx or 5xx status codes
            response.raise_for_status()
            print('Response Status Code:', response.status_code)
        except requests.RequestException as e:
            # Print error message and attempt number if the request fails
            print(f"Attempt {attempt+1} failed:", e)
            if attempt < max_retries - 1:
                # Print the retry message and wait before retrying
                print(f"Retrying in {delay} seconds...")
                time.sleep(delay)
            else:
                # If max retries exceeded, print message and re-raise exception
                print("Max retries exceeded.")
                raise

# Example usage
send_request_with_advance_retry_mechanism('https://httpbin.org/status/404')

Here are the drawbacks of this implementation:

  • All status codes belonging to the 4xx and 5xx ranges are retried. However, requests resulting in a 404 (Not Found) status code do not need to be retried.
  • Some bot detection services may respond with a status code of 200 (OK), but the response content may differ. This situation is not handled in the current implementation. Implementing content length validation could address this issue.

Here's the corrected code along with comments addressing the drawbacks:

import requests
import time

def send_request_with_advance_retry_mechanism(url, max_retries=3, delay=1, min_content_length=10):
    """
    Sends an HTTP GET request to the specified URL with an advanced retry mechanism.

    Parameters:
        url (str): The URL to send the request to.
        max_retries (int): The maximum number of times to retry the request. The default is 3.
        delay (int): The delay (in seconds) between retries. Default is 1.
        min_content_length (int): The minimum length of response content to consider valid. The default is 10.

    Raises:
        requests.RequestException: Raises the last exception if all retries fail.
    """
    for attempt in range(max_retries):
        try:
            response = requests.get(url)
            # Raise an exception for 4xx or 5xx status codes
            response.raise_for_status()
            
            # Check if response status code is 404
            if response.status_code == 404:
                print("404 Error: Not Found")
                break  # Exit loop for 404 errors
            
            # Check if length of the response text is less than the specified minimum content length
            if len(response.text) < min_content_length:
                print("Response text length is less than specified minimum. Retrying...")
                time.sleep(delay)
                continue  # Retry the request
            
            print('Response Status Code:', response.status_code)
            # If conditions are met, break out of the loop
            break
            
        except requests.RequestException as e:
            print(f"Attempt {attempt+1} failed:", e)
            if attempt < max_retries - 1:
                print(f"Retrying in {delay} seconds...")
                time.sleep(delay)
            else:
                print("Max retries exceeded.")
                # Re-raise the last exception if max retries reached
                raise

# Example usage
send_request_with_advance_retry_mechanism('https://httpbin.org/status/404')

Handling specific HTTP errors with proxies

For certain errors like 429 Too Many Requests, using rotating proxies can help distribute your requests and avoid rate limiting.

The code below implements an advanced retry strategy along with the use of proxies. This way, we can implement a Python requests retry mechanism. Using high-quality web scraping proxies is also important. These proxies should have a good algorithm for proxy rotation and a reliable pool.

import requests
import time

def send_request_with_advance_retry_mechanism(url, max_retries=3, delay=1, min_content_length=10):
    """
    Sends an HTTP GET request to the specified URL with an advanced retry mechanism.

    Parameters:
        url (str): The URL to send the request to.
        max_retries (int): The maximum number of times to retry the request. Default is 3.
        delay (int): The delay (in seconds) between retries. The default is 1.
   
    Raises:
        requests.RequestException: Raises the last exception if all retries fail.
    """
    
    proxies = {
        "http": "http://USER:PASS@HOST:PORT",
        "https": "https://USER:PASS@HOST:PORT"
    }
    
    for attempt in range(max_retries):
        try:
            response = requests.get(url, proxies=proxies, verify=False)
            # Raise an exception for 4xx or 5xx status codes
            response.raise_for_status()
            
            # Check if the response status code is 404
            if response.status_code == 404:
                print("404 Error: Not Found")
                break  # Exit loop for 404 errors
            
            # Check if the length of the response text is less than 10 characters
            if len(response.text) < min_content_length:
                print("Response text length is less than 10 characters. Retrying...")
                time.sleep(delay)
                continue  # Retry the request
            
            print('Response Status Code:', response.status_code)
            # If conditions are met, break out of the loop
            break
            
        except requests.RequestException as e:
            print(f"Attempt {attempt+1} failed:", e)
            if attempt < max_retries - 1:
                print(f"Retrying in {delay} seconds...")
                time.sleep(delay)
            else:
                print("Max retries exceeded.")
                # Re-raise the last exception if max retries reached
                raise

send_request_with_advance_retry_mechanism('https://httpbin.org/status/404')

Request retries in Python are crucial for effective web scraping. The methods we've discussed to manage retries can help prevent blocking and enhance the efficiency and reliability of data collection. Implementing these techniques will make your web scraping scripts more robust and less susceptible to detection by bot protection systems.

Comments:

0 comments