Guide to scraping Amazon reviews using Python

04.11.2024

Comments: 0

Content of the article:

Step 1. Installing required libraries
Step 2. Configuring the scraping process

Understanding the website's structure
Sending HTTP requests

Step 3. Extracting product details using BeautifulSoup
Step 4. Extracting review data using BeautifulSoup
Step 5. Saving data to CSV
Complete code

Scraping Amazon reviews with Python is useful when conducting competitor analysis, checking reviews, and doing market research. This demonstrates how to scrape product reviews on Amazon efficiently with Python, BeautifulSoup, and Requests libraries.

Step 1. Installing required libraries

Before diving into the scraping process, ensure you have the necessary Python libraries installed:

pip install requests
pip install beautifulsoup4

Step 2. Configuring the scraping process

We will focus on extracting product reviews from the Amazon page and examine each stage of the scraping process step-by-step.

Understanding the website's structure

Inspect the HTML structure of the Amazon product reviews page to identify the elements we want to scrape: reviewer names, ratings, and comments.

Product title and URL:

Total rating:

Review section:

Author name:

Rating:

Comment:

Sending HTTP requests

Use the Requests library to send HTTP GET requests to the Amazon product reviews page. Set up headers to mimic legitimate browser behavior and avoid detection. Proxies and complete request headers are essential to avoid being blocked by Amazon.

Proxies

Using proxies helps rotate IP addresses to avoid IP bans and rate limits from Amazon. It's crucial for large-scale scraping to maintain anonymity and prevent detection. Here, the proxy details are provided by the proxy service.

Complete request headers

Including various headers like Accept-Encoding, Accept-Language, Referer, Connection, and Upgrade-Insecure-Requests mimics a legitimate browser request, reducing the chance of being flagged as a bot.


import requests

url = "https://www.amazon.com/Portable-Mechanical-Keyboard-MageGee-Backlit/product-reviews/B098LG3N6R/ref=cm_cr_dp_d_show_all_btm?ie=UTF8&reviewerType=all_reviews"

# Example of a proxy provided by the proxy service
proxy = {
    'http': 'http://your_proxy_ip:your_proxy_port',
    'https': 'https://your_proxy_ip:your_proxy_port'
}

headers = {
    'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7',
    'accept-language': 'en-US,en;q=0.9',
    'cache-control': 'no-cache',
    'dnt': '1',
    'pragma': 'no-cache',
    'sec-ch-ua': '"Not/A)Brand";v="99", "Google Chrome";v="91", "Chromium";v="91"',
    'sec-ch-ua-mobile': '?0',
    'sec-fetch-dest': 'document',
    'sec-fetch-mode': 'navigate',
    'sec-fetch-site': 'same-origin',
    'sec-fetch-user': '?1',
    'upgrade-insecure-requests': '1',
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
}

# Send HTTP GET request to the URL with headers and proxy
try:
    response = requests.get(url, headers=headers, proxies=proxy, timeout=10)
    response.raise_for_status()  # Raise an exception for bad response status
except requests.exceptions.RequestException as e:
    print(f"Error: {e}")

Step 3. Extracting product details using BeautifulSoup

Parse the HTML content of the response using BeautifulSoup to extract common product details such as URL, title, and total rating.


from bs4 import BeautifulSoup

soup = BeautifulSoup(response.content, 'html.parser')

# Extracting common product details
product_url = soup.find('a', {'data-hook': 'product-link'}).get('href', '')
product_title = soup.find('a', {'data-hook': 'product-link'}).get_text(strip=True)
total_rating = soup.find('span', {'data-hook': 'rating-out-of-text'}).get_text(strip=True)

Step 4. Extracting review data using BeautifulSoup

Continue parsing the HTML content to extract reviewer names, ratings, and comments based on the identified XPath expressions.


reviews = []
review_elements = soup.find_all('div', {'data-hook': 'review'})
for review in review_elements:
    author_name = review.find('span', class_='a-profile-name').get_text(strip=True)
    rating_given = review.find('i', class_='review-rating').get_text(strip=True)
    comment = review.find('span', class_='review-text').get_text(strip=True)

    reviews.append({
        'Product URL': product_url,
        'Product Title': product_title,
        'Total Rating': total_rating,
        'Author': author_name,
        'Rating': rating_given,
        'Comment': comment,
    })

Step 5. Saving data to CSV

Use Python's built-in CSV module to save the extracted data into a CSV file for further analysis.


import csv

# Define CSV file path
csv_file = 'amazon_reviews.csv'

# Define CSV fieldnames
fieldnames = ['Product URL', 'Product Title', 'Total Rating', 'Author', 'Rating', 'Comment']

# Writing data to CSV file
with open(csv_file, mode='w', newline='', encoding='utf-8') as file:
    writer = csv.DictWriter(file, fieldnames=fieldnames)
    writer.writeheader()
    for review in reviews:
        writer.writerow(review)

print(f"Data saved to {csv_file}")

Complete code

Here is the complete code to scrape Amazon review data and save it to a CSV file:


import requests
from bs4 import BeautifulSoup
import csv
import urllib3

urllib3.disable_warnings()

# URL of the Amazon product reviews page
url = "https://www.amazon.com/Portable-Mechanical-Keyboard-MageGee-Backlit/product-reviews/B098LG3N6R/ref=cm_cr_dp_d_show_all_btm?ie=UTF8&reviewerType=all_reviews"

# Proxy provided by the proxy service with IP-authorization
path_proxy = 'your_proxy_ip:your_proxy_port'
proxy = {
   'http': f'http://{path_proxy}',
   'https': f'https://{path_proxy}'
}

# Headers for the HTTP request
headers = {
   'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7',
   'accept-language': 'en-US,en;q=0.9',
   'cache-control': 'no-cache',
   'dnt': '1',
   'pragma': 'no-cache',
   'sec-ch-ua': '"Not/A)Brand";v="99", "Google Chrome";v="91", "Chromium";v="91"',
   'sec-ch-ua-mobile': '?0',
   'sec-fetch-dest': 'document',
   'sec-fetch-mode': 'navigate',
   'sec-fetch-site': 'same-origin',
   'sec-fetch-user': '?1',
   'upgrade-insecure-requests': '1',
   'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
}

# Send HTTP GET request to the URL with headers and handle exceptions
try:
   response = requests.get(url, headers=headers, timeout=10, proxies=proxy, verify=False)
   response.raise_for_status()  # Raise an exception for bad response status
except requests.exceptions.RequestException as e:
   print(f"Error: {e}")

# Parse the HTML content using BeautifulSoup
soup = BeautifulSoup(response.content, 'html.parser')

# Extracting common product details
product_url = soup.find('a', {'data-hook': 'product-link'}).get('href', '')  # Extract product URL
product_title = soup.find('a', {'data-hook': 'product-link'}).get_text(strip=True)  # Extract product title
total_rating = soup.find('span', {'data-hook': 'rating-out-of-text'}).get_text(strip=True)  # Extract total rating

# Extracting individual reviews
reviews = []
review_elements = soup.find_all('div', {'data-hook': 'review'})
for review in review_elements:
   author_name = review.find('span', class_='a-profile-name').get_text(strip=True)  # Extract author name
   rating_given = review.find('i', class_='review-rating').get_text(strip=True)  # Extract rating given
   comment = review.find('span', class_='review-text').get_text(strip=True)  # Extract review comment

   # Store each review in a dictionary
   reviews.append({
       'Product URL': product_url,
       'Product Title': product_title,
       'Total Rating': total_rating,
       'Author': author_name,
       'Rating': rating_given,
       'Comment': comment,
   })

# Define CSV file path
csv_file = 'amazon_reviews.csv'

# Define CSV fieldnames
fieldnames = ['Product URL', 'Product Title', 'Total Rating', 'Author', 'Rating', 'Comment']

# Writing data to CSV file
with open(csv_file, mode='w', newline='', encoding='utf-8') as file:
   writer = csv.DictWriter(file, fieldnames=fieldnames)
   writer.writeheader()
   for review in reviews:
       writer.writerow(review)

# Print confirmation message
print(f"Data saved to {csv_file}")

In conclusion, it is crucial to emphasize that selecting reliable proxy servers is a key step in scriptwriting for web scraping. This ensures effective bypassing of blockages and protection against anti-bot filters. The most suitable options for scraping are residential proxy servers, which offer a high trust factor and dynamic IP addresses, along with static ISP proxies that provide high speed and operational stability.