How to rotate proxies while scraping web data

Comments: 0

As good as this approach to data gathering seems, it is frowned upon by many websites, and there are consequences for following through with scraping, like a ban on our IP.

On a positive note, proxy services help avoid this consequence. They allow us to take on a different IP while gathering data online, and as secure as this seems, using multiple proxies is better. Using multiple proxies while scraping makes interaction with the website appear random and enhances security.

The target website (source) for this guide is an online bookstore. It imitates an e-commerce website for books. On it are books with a name, price and availability. Since this guide focuses not on organizing the data returned but on rotating proxies, the data returned will only be presented in the console.

Preparing the work environment and integrating proxies

Install and import some Python modules into our file before we can begin coding the functions that would aid in rotating the proxies and scraping the website.

pip install requests beautifulSoup4 lxml

3 of the 5 Python modules needed for this scraping script can be installed using the command above. Requests allows us to send HTTP request to the website, beautifulSoup4 allows us to extract the information from the page’s HTML provided by requests, and LXML is an HTML parser.

In addition, we also need the built-in threading module to allow multiple testing of the proxies to see if they work and json to read from a JSON file.

import requests
import threading
from requests.auth import HTTPProxyAuth
import json
from bs4 import BeautifulSoup
import lxml
import time

url_to_scrape = "https://books.toscrape.com"
valid_proxies = []
book_names = []
book_price = []
book_availability = []
next_button_link = ""

Step 1: Verifying proxy from a list of proxies

Building a scraping script that rotates proxies means we need a list of proxies to choose from during rotation. Some proxies require authentication, and others do not. We must create a list of dictionaries with proxy details, including the proxy username and password if authentication is needed.

The best approach to this is to put our proxy information in a separate JSON file organized like the one below:

[
  {
    "proxy_address": "XX.X.XX.X:XX",
    "proxy_username": "",
    "proxy_password": ""
  },

  {
    "proxy_address": "XX.X.XX.X:XX",
    "proxy_username": "",
    "proxy_password": ""
  },
  {
    "proxy_address": "XX.X.XX.X:XX",
    "proxy_username": "",
    "proxy_password": ""
  },
  {
    "proxy_address": "XX.X.XX.X:XX",
    "proxy_username": "",
    "proxy_password": ""
  }
]

In the “proxy_address” field, enter the IP address and port, separated by a colon. In the “proxy_username” and “proxy_password” fields, provide the username and password for authorization.

Above is the content of a JSON file with 4 proxies for the script to choose from. The username and password can be empty, indicating a proxy that requires no authentication.

def verify_proxies(proxy:dict):
    try:
        if proxy['proxy_username'] != "" and  proxy['proxy_password'] != "":
            proxy_auth = HTTPProxyAuth(proxy['proxy_username'], proxy['proxy_password'])
            res = requests.get(
                url_to_scrape,
                auth = proxy_auth,
                proxies={
                "http" : proxy['proxy_address']
                }
            )
        else:
            res = requests.get(url_to_scrape, proxies={
                "http" : proxy['proxy_address'],
            })
        
        if res.status_code == 200:
            valid_proxies.append(proxy)
            print(f"Proxy Validated: {proxy['proxy_address']}")
            
    except:
        print("Proxy Invalidated, Moving on")

As a precaution, this function ensures that the proxies provided are active and working. We can achieve this by looping through each dictionary in the JSON file, sending a GET request to the website, and if a status code of 200 is returned, then add that proxy to the list of valid_proxies - a variable we created earlier to house the proxies that work from the list in the file. If the call is not successful, execution continues.

Step 2: Sending web scraping request

Since beautifulSoup needs the website's HTML code to extract the data we need, we have created request_function(), which takes the URL and proxy of choice and returns the HTML code as text. The proxy variable enables us to route the requestl through different proxies, hence rotating the proxy.

def request_function(url, proxy):
    try:
        if proxy['proxy_username'] != "" and  proxy['proxy_password'] != "":
            proxy_auth = HTTPProxyAuth(proxy['proxy_username'], proxy['proxy_password'])
            response = requests.get(
                url,
                auth = proxy_auth,
                proxies={
                "http" : proxy['proxy_address']
                }
            )
        else:
            response = requests.get(url, proxies={
                "http" : proxy['proxy_address']
            })
        
        if response.status_code == 200:
            return response.text

    except Exception as err:
        print(f"Switching Proxies, URL access was unsuccessful: {err}")
        return None

Step 3: Extracting data from target website

data_extract() extracts the data we need from the HTML code provided. It gathers the HTML element housing the book information like the book name, price and availability. It also extracts the link for the next page.

This is particularly tricky because the link is dynamic, so we had to account for the dynamism. Finally, it looks through the books and extracts the name, price and availability, then returns the next button link that we would use to retrieve the HTML code of the next page.

def data_extract(response):
    soup = BeautifulSoup(response, "lxml")
    books = soup.find_all("li", class_="col-xs-6 col-sm-4 col-md-3 col-lg-3")
    next_button_link = soup.find("li", class_="next").find('a').get('href')
    next_button_link=f"{url_to_scrape}/{next_button_link}" if "catalogue" in next_button_link else f"{url_to_scrape}/catalogue/{next_button_link}"

    for each in books:
        book_names.append(each.find("img").get("alt"))
        book_price.append(each.find("p", class_="price_color").text)
        book_availability.append(each.find("p", class_="instock availability").text.strip())

    return next_button_link

Step 4: Linking everything together

To link everything together, we have to:

  1. Load the proxy details from the JSON file. Start a thread for each proxy using the threading.Thread(). This will help us test multiple proxies at a time. The valid proxies are added to valid_proxies().
  2. Load the homepage of the source using a valid proxy. If a proxy doesn't work, we use the next one, all to ensure that the homepage loads or doesn't return None before the execution continues.
  3. Then we cycle through active proxies, use the request_function() function to create a GET request. And if we received a GET request, we collect data from the site.
  4. Finally, we print the data gathered to the console.
with open("proxy-list.json") as json_file:
    proxies = json.load(json_file)
    for each in proxies:
        threading.Thread(target=verify_proxies, args=(each, )).start() 


time.sleep(4)

for i in range(len(valid_proxies)):
    response = request_function(url_to_scrape, valid_proxies[i])
    if response != None:
        next_button_link = data_extract(response)
        break
    else:
        continue

for proxy in valid_proxies:
   print(f"Using Proxy: {proxy['proxy_address']}")
   response = request_function(next_button_link, proxy)
   if response is not None:
       next_button_link = data_extract(response)
   else:
       continue


for each in range(len(book_names)):
    print(f"No {each+1}: Book Name: {book_names[each]} Book Price: {book_price[each]} and Availability {book_availability[each]}")

Full code

import requests
import threading
from requests.auth import HTTPProxyAuth
import json
from bs4 import BeautifulSoup
import time

url_to_scrape = "https://books.toscrape.com"
valid_proxies = []
book_names = []
book_price = []
book_availability = []
next_button_link = ""


def verify_proxies(proxy: dict):
   try:
       if proxy['proxy_username'] != "" and proxy['proxy_password'] != "":
           proxy_auth = HTTPProxyAuth(proxy['proxy_username'], proxy['proxy_password'])
           res = requests.get(
               url_to_scrape,
               auth=proxy_auth,
               proxies={
                   "http": proxy['proxy_address'],
               }
           )
       else:
           res = requests.get(url_to_scrape, proxies={
               "http": proxy['proxy_address'],
           })

       if res.status_code == 200:
           valid_proxies.append(proxy)
           print(f"Proxy Validated: {proxy['proxy_address']}")

   except:
       print("Proxy Invalidated, Moving on")


# Retrieves the HTML element of a page
def request_function(url, proxy):
   try:
       if proxy['proxy_username'] != "" and proxy['proxy_password'] != "":
           proxy_auth = HTTPProxyAuth(proxy['proxy_username'], proxy['proxy_password'])
           response = requests.get(
               url,
               auth=proxy_auth,
               proxies={
                   "http": proxy['proxy_address'],
               }
           )
       else:
           response = requests.get(url, proxies={
               "http": proxy['proxy_address'],
           })

       if response.status_code == 200:
           return response.text

   except Exception as err:
       print(f"Switching Proxies, URL access was unsuccessful: {err}")
       return None


# Scraping
def data_extract(response):
   soup = BeautifulSoup(response, "lxml")
   books = soup.find_all("li", class_="col-xs-6 col-sm-4 col-md-3 col-lg-3")
   next_button_link = soup.find("li", class_="next").find('a').get('href')
   next_button_link = f"{url_to_scrape}/{next_button_link}" if "catalogue" in next_button_link else f"{url_to_scrape}/catalogue/{next_button_link}"

   for each in books:
       book_names.append(each.find("img").get("alt"))
       book_price.append(each.find("p", class_="price_color").text)
       book_availability.append(each.find("p", class_="instock availability").text.strip())

   return next_button_link


# Get proxy from JSON
with open("proxy-list.json") as json_file:
   proxies = json.load(json_file)
   for each in proxies:
       threading.Thread(target=verify_proxies, args=(each,)).start()

time.sleep(4)

for i in range(len(valid_proxies)):
   response = request_function(url_to_scrape, valid_proxies[i])
   if response is not None:
       next_button_link = data_extract(response)
       break
   else:
       continue

for proxy in valid_proxies:
   print(f"Using Proxy: {proxy['proxy_address']}")
   response = request_function(next_button_link, proxy)
   if response is not None:
       next_button_link = data_extract(response)
   else:
       continue

for each in range(len(book_names)):
   print(
       f"No {each + 1}: Book Name: {book_names[each]} Book Price: {book_price[each]} and Availability {book_availability[each]}")

Final result

After a successful execution, the results look like the below. This goes on to extract information on over 100 books using the 2 proxies provided.

1.png

2.png

3.png

4.png

Using multiple proxies for web scraping enables an increase in the number of requests to the target resource and helps bypass blocking. To maintain the stability of the scraping process, it is advisable to use IP addresses that offer high speed and a strong trust factor, such as static ISP and dynamic residential proxies. Additionally, the functionality of the provided script can be readily expanded to accommodate various data scraping requirements.

Comments:

0 comments