Scraping Amazon reviews with Python is useful when conducting competitor analysis, checking reviews, and doing market research. This demonstrates how to scrape product reviews on Amazon efficiently with Python, BeautifulSoup, and Requests libraries.
Before diving into the scraping process, ensure you have the necessary Python libraries installed:
pip install requests
pip install beautifulsoup4
We will focus on extracting product reviews from the Amazon page and examine each stage of the scraping process step-by-step.
Inspect the HTML structure of the Amazon product reviews page to identify the elements we want to scrape: reviewer names, ratings, and comments.
Product title and URL:
Total rating:
Review section:
Author name:
Rating:
Comment:
Use the Requests library to send HTTP GET requests to the Amazon product reviews page. Set up headers to mimic legitimate browser behavior and avoid detection. Proxies and complete request headers are essential to avoid being blocked by Amazon.
Using proxies helps rotate IP addresses to avoid IP bans and rate limits from Amazon. It's crucial for large-scale scraping to maintain anonymity and prevent detection. Here, the proxy details are provided by the proxy service.
Including various headers like Accept-Encoding, Accept-Language, Referer, Connection, and Upgrade-Insecure-Requests mimics a legitimate browser request, reducing the chance of being flagged as a bot.
import requests
url = "https://www.amazon.com/Portable-Mechanical-Keyboard-MageGee-Backlit/product-reviews/B098LG3N6R/ref=cm_cr_dp_d_show_all_btm?ie=UTF8&reviewerType=all_reviews"
# Example of a proxy provided by the proxy service
proxy = {
'http': 'http://your_proxy_ip:your_proxy_port',
'https': 'https://your_proxy_ip:your_proxy_port'
}
headers = {
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7',
'accept-language': 'en-US,en;q=0.9',
'cache-control': 'no-cache',
'dnt': '1',
'pragma': 'no-cache',
'sec-ch-ua': '"Not/A)Brand";v="99", "Google Chrome";v="91", "Chromium";v="91"',
'sec-ch-ua-mobile': '?0',
'sec-fetch-dest': 'document',
'sec-fetch-mode': 'navigate',
'sec-fetch-site': 'same-origin',
'sec-fetch-user': '?1',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
}
# Send HTTP GET request to the URL with headers and proxy
try:
response = requests.get(url, headers=headers, proxies=proxy, timeout=10)
response.raise_for_status() # Raise an exception for bad response status
except requests.exceptions.RequestException as e:
print(f"Error: {e}")
Parse the HTML content of the response using BeautifulSoup to extract common product details such as URL, title, and total rating.
from bs4 import BeautifulSoup
soup = BeautifulSoup(response.content, 'html.parser')
# Extracting common product details
product_url = soup.find('a', {'data-hook': 'product-link'}).get('href', '')
product_title = soup.find('a', {'data-hook': 'product-link'}).get_text(strip=True)
total_rating = soup.find('span', {'data-hook': 'rating-out-of-text'}).get_text(strip=True)
Continue parsing the HTML content to extract reviewer names, ratings, and comments based on the identified XPath expressions.
reviews = []
review_elements = soup.find_all('div', {'data-hook': 'review'})
for review in review_elements:
author_name = review.find('span', class_='a-profile-name').get_text(strip=True)
rating_given = review.find('i', class_='review-rating').get_text(strip=True)
comment = review.find('span', class_='review-text').get_text(strip=True)
reviews.append({
'Product URL': product_url,
'Product Title': product_title,
'Total Rating': total_rating,
'Author': author_name,
'Rating': rating_given,
'Comment': comment,
})
Use Python's built-in CSV module to save the extracted data into a CSV file for further analysis.
import csv
# Define CSV file path
csv_file = 'amazon_reviews.csv'
# Define CSV fieldnames
fieldnames = ['Product URL', 'Product Title', 'Total Rating', 'Author', 'Rating', 'Comment']
# Writing data to CSV file
with open(csv_file, mode='w', newline='', encoding='utf-8') as file:
writer = csv.DictWriter(file, fieldnames=fieldnames)
writer.writeheader()
for review in reviews:
writer.writerow(review)
print(f"Data saved to {csv_file}")
Here is the complete code to scrape Amazon review data and save it to a CSV file:
import requests
from bs4 import BeautifulSoup
import csv
import urllib3
urllib3.disable_warnings()
# URL of the Amazon product reviews page
url = "https://www.amazon.com/Portable-Mechanical-Keyboard-MageGee-Backlit/product-reviews/B098LG3N6R/ref=cm_cr_dp_d_show_all_btm?ie=UTF8&reviewerType=all_reviews"
# Proxy provided by the proxy service with IP-authorization
path_proxy = 'your_proxy_ip:your_proxy_port'
proxy = {
'http': f'http://{path_proxy}',
'https': f'https://{path_proxy}'
}
# Headers for the HTTP request
headers = {
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7',
'accept-language': 'en-US,en;q=0.9',
'cache-control': 'no-cache',
'dnt': '1',
'pragma': 'no-cache',
'sec-ch-ua': '"Not/A)Brand";v="99", "Google Chrome";v="91", "Chromium";v="91"',
'sec-ch-ua-mobile': '?0',
'sec-fetch-dest': 'document',
'sec-fetch-mode': 'navigate',
'sec-fetch-site': 'same-origin',
'sec-fetch-user': '?1',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
}
# Send HTTP GET request to the URL with headers and handle exceptions
try:
response = requests.get(url, headers=headers, timeout=10, proxies=proxy, verify=False)
response.raise_for_status() # Raise an exception for bad response status
except requests.exceptions.RequestException as e:
print(f"Error: {e}")
# Parse the HTML content using BeautifulSoup
soup = BeautifulSoup(response.content, 'html.parser')
# Extracting common product details
product_url = soup.find('a', {'data-hook': 'product-link'}).get('href', '') # Extract product URL
product_title = soup.find('a', {'data-hook': 'product-link'}).get_text(strip=True) # Extract product title
total_rating = soup.find('span', {'data-hook': 'rating-out-of-text'}).get_text(strip=True) # Extract total rating
# Extracting individual reviews
reviews = []
review_elements = soup.find_all('div', {'data-hook': 'review'})
for review in review_elements:
author_name = review.find('span', class_='a-profile-name').get_text(strip=True) # Extract author name
rating_given = review.find('i', class_='review-rating').get_text(strip=True) # Extract rating given
comment = review.find('span', class_='review-text').get_text(strip=True) # Extract review comment
# Store each review in a dictionary
reviews.append({
'Product URL': product_url,
'Product Title': product_title,
'Total Rating': total_rating,
'Author': author_name,
'Rating': rating_given,
'Comment': comment,
})
# Define CSV file path
csv_file = 'amazon_reviews.csv'
# Define CSV fieldnames
fieldnames = ['Product URL', 'Product Title', 'Total Rating', 'Author', 'Rating', 'Comment']
# Writing data to CSV file
with open(csv_file, mode='w', newline='', encoding='utf-8') as file:
writer = csv.DictWriter(file, fieldnames=fieldnames)
writer.writeheader()
for review in reviews:
writer.writerow(review)
# Print confirmation message
print(f"Data saved to {csv_file}")
In conclusion, it is crucial to emphasize that selecting reliable proxy servers is a key step in scriptwriting for web scraping. This ensures effective bypassing of blockages and protection against anti-bot filters. The most suitable options for scraping are residential proxy servers, which offer a high trust factor and dynamic IP addresses, along with static ISP proxies that provide high speed and operational stability.
Comments: 0