As good as this approach to data gathering seems, it is frowned upon by many websites, and there are consequences for following through with scraping, like a ban on our IP.
On a positive note, proxy services help avoid this consequence. They allow us to take on a different IP while gathering data online, and as secure as this seems, using multiple proxies is better. Using multiple proxies while scraping makes interaction with the website appear random and enhances security.
The target website (source) for this guide is an online bookstore. It imitates an e-commerce website for books. On it are books with a name, price and availability. Since this guide focuses not on organizing the data returned but on rotating proxies, the data returned will only be presented in the console.
Install and import some Python modules into our file before we can begin coding the functions that would aid in rotating the proxies and scraping the website.
pip install requests beautifulSoup4 lxml
3 of the 5 Python modules needed for this scraping script can be installed using the command above. Requests allows us to send HTTP request to the website, beautifulSoup4 allows us to extract the information from the page’s HTML provided by requests, and LXML is an HTML parser.
In addition, we also need the built-in threading module to allow multiple testing of the proxies to see if they work and json to read from a JSON file.
import requests
import threading
from requests.auth import HTTPProxyAuth
import json
from bs4 import BeautifulSoup
import lxml
import time
url_to_scrape = "https://books.toscrape.com"
valid_proxies = []
book_names = []
book_price = []
book_availability = []
next_button_link = ""
Building a scraping script that rotates proxies means we need a list of proxies to choose from during rotation. Some proxies require authentication, and others do not. We must create a list of dictionaries with proxy details, including the proxy username and password if authentication is needed.
The best approach to this is to put our proxy information in a separate JSON file organized like the one below:
[
{
"proxy_address": "XX.X.XX.X:XX",
"proxy_username": "",
"proxy_password": ""
},
{
"proxy_address": "XX.X.XX.X:XX",
"proxy_username": "",
"proxy_password": ""
},
{
"proxy_address": "XX.X.XX.X:XX",
"proxy_username": "",
"proxy_password": ""
},
{
"proxy_address": "XX.X.XX.X:XX",
"proxy_username": "",
"proxy_password": ""
}
]
In the “proxy_address” field, enter the IP address and port, separated by a colon. In the “proxy_username” and “proxy_password” fields, provide the username and password for authorization.
Above is the content of a JSON file with 4 proxies for the script to choose from. The username and password can be empty, indicating a proxy that requires no authentication.
def verify_proxies(proxy:dict):
try:
if proxy['proxy_username'] != "" and proxy['proxy_password'] != "":
proxy_auth = HTTPProxyAuth(proxy['proxy_username'], proxy['proxy_password'])
res = requests.get(
url_to_scrape,
auth = proxy_auth,
proxies={
"http" : proxy['proxy_address']
}
)
else:
res = requests.get(url_to_scrape, proxies={
"http" : proxy['proxy_address'],
})
if res.status_code == 200:
valid_proxies.append(proxy)
print(f"Proxy Validated: {proxy['proxy_address']}")
except:
print("Proxy Invalidated, Moving on")
As a precaution, this function ensures that the proxies provided are active and working. We can achieve this by looping through each dictionary in the JSON file, sending a GET request to the website, and if a status code of 200 is returned, then add that proxy to the list of valid_proxies - a variable we created earlier to house the proxies that work from the list in the file. If the call is not successful, execution continues.
Since beautifulSoup needs the website's HTML code to extract the data we need, we have created request_function(), which takes the URL and proxy of choice and returns the HTML code as text. The proxy variable enables us to route the requestl through different proxies, hence rotating the proxy.
def request_function(url, proxy):
try:
if proxy['proxy_username'] != "" and proxy['proxy_password'] != "":
proxy_auth = HTTPProxyAuth(proxy['proxy_username'], proxy['proxy_password'])
response = requests.get(
url,
auth = proxy_auth,
proxies={
"http" : proxy['proxy_address']
}
)
else:
response = requests.get(url, proxies={
"http" : proxy['proxy_address']
})
if response.status_code == 200:
return response.text
except Exception as err:
print(f"Switching Proxies, URL access was unsuccessful: {err}")
return None
data_extract() extracts the data we need from the HTML code provided. It gathers the HTML element housing the book information like the book name, price and availability. It also extracts the link for the next page.
This is particularly tricky because the link is dynamic, so we had to account for the dynamism. Finally, it looks through the books and extracts the name, price and availability, then returns the next button link that we would use to retrieve the HTML code of the next page.
def data_extract(response):
soup = BeautifulSoup(response, "lxml")
books = soup.find_all("li", class_="col-xs-6 col-sm-4 col-md-3 col-lg-3")
next_button_link = soup.find("li", class_="next").find('a').get('href')
next_button_link=f"{url_to_scrape}/{next_button_link}" if "catalogue" in next_button_link else f"{url_to_scrape}/catalogue/{next_button_link}"
for each in books:
book_names.append(each.find("img").get("alt"))
book_price.append(each.find("p", class_="price_color").text)
book_availability.append(each.find("p", class_="instock availability").text.strip())
return next_button_link
To link everything together, we have to:
with open("proxy-list.json") as json_file:
proxies = json.load(json_file)
for each in proxies:
threading.Thread(target=verify_proxies, args=(each, )).start()
time.sleep(4)
for i in range(len(valid_proxies)):
response = request_function(url_to_scrape, valid_proxies[i])
if response != None:
next_button_link = data_extract(response)
break
else:
continue
for proxy in valid_proxies:
print(f"Using Proxy: {proxy['proxy_address']}")
response = request_function(next_button_link, proxy)
if response is not None:
next_button_link = data_extract(response)
else:
continue
for each in range(len(book_names)):
print(f"No {each+1}: Book Name: {book_names[each]} Book Price: {book_price[each]} and Availability {book_availability[each]}")
import requests
import threading
from requests.auth import HTTPProxyAuth
import json
from bs4 import BeautifulSoup
import time
url_to_scrape = "https://books.toscrape.com"
valid_proxies = []
book_names = []
book_price = []
book_availability = []
next_button_link = ""
def verify_proxies(proxy: dict):
try:
if proxy['proxy_username'] != "" and proxy['proxy_password'] != "":
proxy_auth = HTTPProxyAuth(proxy['proxy_username'], proxy['proxy_password'])
res = requests.get(
url_to_scrape,
auth=proxy_auth,
proxies={
"http": proxy['proxy_address'],
}
)
else:
res = requests.get(url_to_scrape, proxies={
"http": proxy['proxy_address'],
})
if res.status_code == 200:
valid_proxies.append(proxy)
print(f"Proxy Validated: {proxy['proxy_address']}")
except:
print("Proxy Invalidated, Moving on")
# Retrieves the HTML element of a page
def request_function(url, proxy):
try:
if proxy['proxy_username'] != "" and proxy['proxy_password'] != "":
proxy_auth = HTTPProxyAuth(proxy['proxy_username'], proxy['proxy_password'])
response = requests.get(
url,
auth=proxy_auth,
proxies={
"http": proxy['proxy_address'],
}
)
else:
response = requests.get(url, proxies={
"http": proxy['proxy_address'],
})
if response.status_code == 200:
return response.text
except Exception as err:
print(f"Switching Proxies, URL access was unsuccessful: {err}")
return None
# Scraping
def data_extract(response):
soup = BeautifulSoup(response, "lxml")
books = soup.find_all("li", class_="col-xs-6 col-sm-4 col-md-3 col-lg-3")
next_button_link = soup.find("li", class_="next").find('a').get('href')
next_button_link = f"{url_to_scrape}/{next_button_link}" if "catalogue" in next_button_link else f"{url_to_scrape}/catalogue/{next_button_link}"
for each in books:
book_names.append(each.find("img").get("alt"))
book_price.append(each.find("p", class_="price_color").text)
book_availability.append(each.find("p", class_="instock availability").text.strip())
return next_button_link
# Get proxy from JSON
with open("proxy-list.json") as json_file:
proxies = json.load(json_file)
for each in proxies:
threading.Thread(target=verify_proxies, args=(each,)).start()
time.sleep(4)
for i in range(len(valid_proxies)):
response = request_function(url_to_scrape, valid_proxies[i])
if response is not None:
next_button_link = data_extract(response)
break
else:
continue
for proxy in valid_proxies:
print(f"Using Proxy: {proxy['proxy_address']}")
response = request_function(next_button_link, proxy)
if response is not None:
next_button_link = data_extract(response)
else:
continue
for each in range(len(book_names)):
print(
f"No {each + 1}: Book Name: {book_names[each]} Book Price: {book_price[each]} and Availability {book_availability[each]}")
After a successful execution, the results look like the below. This goes on to extract information on over 100 books using the 2 proxies provided.
Using multiple proxies for web scraping enables an increase in the number of requests to the target resource and helps bypass blocking. To maintain the stability of the scraping process, it is advisable to use IP addresses that offer high speed and a strong trust factor, such as static ISP and dynamic residential proxies. Additionally, the functionality of the provided script can be readily expanded to accommodate various data scraping requirements.
Comments: 0