Amazon product data scraping with Python

Comments: 0

Accessing data from e-commerce giants like Amazon is crucial for market analysis, pricing strategies, and product research. This data can be gathered through web scraping, a method that involves extracting large amounts of information from websites. However, Amazon rigorously protects its data, making traditional scraping techniques often ineffective. In this comprehensive guide, we'll delve into the methods of collecting product data from Amazon and discuss strategies to circumvent the platform’s robust anti-scraping systems. We'll explore the use of Python, proxies, and advanced scraping techniques that can help overcome these challenges and efficiently harvest the data needed for your business or research purposes.

Creating a Python script to scrape Amazon data

To successfully scrape data from Amazon, you can follow the structured algorithm outlined below. This method ensures you retrieve the needed information efficiently and accurately.

Step 1: Sending HTTP requests to Amazon product pages:

  • Utilize the requests library to initiate HTTP GET requests targeting Amazon product pages;
  • Capture the raw HTML content from the HTTP response to prepare for parsing.

Step 2: Parsing the HTML content:

  • Employ the lxml library to parse the received raw HTML content;
  • Analyze the HTML to locate the specific data points you wish to extract;
  • Execute XPath queries to precisely target and extract these data points from the HTML structure.

Step 3: Storing the data:

  • After extracting the necessary data, proceed to save it in accessible formats like CSV or JSON, which facilitate easy analysis and integration with other applications.

Handling potential roadblocks

Amazon employs several measures to hinder scraping efforts, including connection speed limitations, CAPTCHA integration, and IP blocking. Users can adopt countermeasures to circumvent these obstacles, such as utilizing high-quality proxies.

For extensive scraping activities, advanced Python techniques can be employed to harvest substantial amounts of product data. These techniques include header stuffing and TLS fingerprinting, which help in evading detection and ensuring successful data extraction.

These steps are explained in the coming sections of the article where will see its practical implementation using Python 3.12.2

Prerequisites

To initiate a web scraping project, we will begin by setting up a basic scraper using the lxml library for HTML parsing and the requests library to manage HTTP requests directed at the Amazon web server.

Our focus will be on extracting essential information such as product names, prices, and ratings from Amazon product pages. We will also showcase techniques to efficiently parse HTML and manage requests, ensuring precise and organized extraction of data.

For maintaining project dependencies and avoiding conflicts, it is advisable to create a separate virtual environment for this web scraping endeavor. Using tools like “venv” or “pyenv” is recommended for setting up virtual environments.

Install third-party libraries

You'll need the following third-party Python libraries:

1. requests

Used to send HTTP requests and retrieve web content. It's often used for web scraping and interacting with web APIs.

Installation:

 pip install requests

2. lxml

A library for parsing and manipulating XML and HTML documents. It's frequently used for web scraping and working with structured data from web pages.

Installation:

 pip install lxml

Importing necessary libraries

Here we need to import the required libraries required for our scraper to run. Which includes a request library for handling HTTP requests, CSV library for handling CSV file operation, the random library for generating random values and making random choices, lxml library for parsing the raw HTML content, and Dict and List for type hinting.

 import requests
import csv
import random
from lxml import html
from typing import Dict, List

Reading inputs from the CSV file

The following code snippet reads a CSV file named amazon_product_urls.csv where each line contains the URL of an Amazon product page. The code iterates over the rows, extracting the URLs from each row and appending them to a list called URL.

 with open('amazon_product_urls.csv', 'r') as file:
    reader = csv.DictReader(file)
    for row in reader:
        urls.append(row['url'])

HTTP request headers and proxies

Request headers play an important role in HTTP requests, providing complex client and request information. While scraping, it is important to copy the authorized user titles to avoid detection and easily access the information you want. By mimicking commonly used headers, scrapers can avoid detection techniques, ensuring that data is extracted consistently while maintaining ethical standards.

Proxies act as intermediaries in web scraping, masking the scraper’s IP address to prevent server detection and blocking. A rotating proxy allows you to send each request with a new IP address avoiding potential blocks.. The use of residential or mobile proxies strengthens the resilience to anti-scraping measures due to the real host and provider detection.

Code for integrating request headers and proxy servers with IP address authorization:

 headers = {
    'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7',
    'accept-language': 'en-IN,en;q=0.9',
    'dnt': '1',
    'sec-ch-ua': '"Google Chrome";v="123", "Not:A-Brand";v="8", "Chromium";v="123"',
    'sec-ch-ua-mobile': '?0',
    'sec-ch-ua-platform': '"Windows"',
    'sec-fetch-dest': 'document',
    'sec-fetch-mode': 'navigate',
    'sec-fetch-site': 'same-origin',
    'sec-fetch-user': '?1',
    'upgrade-insecure-requests': '1',
   }
proxies = {'http': '', 'https': ''}

Useragent rotation

Here, we will create a list of user agent collections, from which a random user agent will be chosen for each request. Implementing a header rotation mechanism, such as rotating the User-Agent after each request, can further aid in bypassing bot detection measures.

 useragents = [
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4591.54 Safari/537.36",
        "Mozilla/5.0 (Windows NT 7_0_2; Win64; x64) AppleWebKit/541.38 (KHTML, like Gecko) Chrome/105.0.1585 Safari/537.36",
        "Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36",
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.7863.44 Safari/537.36"
    ]
headers['user-agent'] = random.choice(useragnets)

Make HTTP requests to Amazon’s product page along with headers and proxy

Sends an HTTP GET request to a specified URL with custom headers, a timeout of 30 seconds, and proxies specified for the request.

response = requests.get(url=url, headers=headers, proxies=proxies, timeout=30)

Identifying XPath/Selectors of required data points

Required data points: title, price, and ratings. Now, let's inspect and identify the corresponding XPath for the elements shown in the screenshots along with their respective data points.

The below screenshot shows the Chrome DevTools "Inspect" feature being used to find the XPath `//span[@id="productTitle"]/text()` for extracting the product title from an Amazon product page.

1.png

The following screenshot shows finding respective the XPath `//div[@id="corePrice_feature_div"]/div/div/span/span/text()` for extracting the product price from an Amazon product page.

2.png

The screenshot shows finding respective the XPath `//span[@id="acrPopover"]/@title'` for extracting the product ratings from an Amazon product page.

3.png

Create a dictionary that provides XPath expressions to extract specific information from a webpage: the title, ratings, and price of a product.

xpath_queries = {'title': '//span[@id="productTitle"]/text()', 'ratings': '//span[@id="acrPopover"]/@title', 'price': '//span[@class="a-offscreen"]/text()'}

Creating lxml parser from the HTML response

The code below parses the HTML content obtained from the GET request to Amazon server into a structured tree-like format, allowing for easier navigation and manipulation of its elements and attributes.

tree = html.fromstring(response.text)

Extracting the required data

The following code snippet extracts data from the parsed HTML tree using an XPath query and assigns it to a dictionary with a specified key. strip() is used to remove whitespaces at the beginning and end if any. It retrieves the first result of the XPath query and stores it under the given key in the extracted_data dictionary.

data = tree.xpath(xpath_query)[0].strip()
extracted_data[key] = data

Saving the extracted data into CSV

The following code writes data from the extracted_data dictionary to a CSV file called product_data.csv. Ensure that the header line is written only if the file is empty. If the file is not empty, it adds the data as an additional row to the CSV file. This function allows the CSV file to be continuously updated with new extracted data without overwriting existing text.

 csv_file_path = 'product_data.csv'
fieldnames = ['title', 'ratings', 'price']
with open(csv_file_path, 'a', newline='') as csvfile:
    writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
    if csvfile.tell() == 0:
        writer.writeheader()
    writer.writerow(extracted_data)

Code Implementation

Please refer to our complete code, which will help you get started quickly. The code is well-structured and documented, making it beginner-friendly. To execute this code, the user must have a CSV file called "amazon_product_urls" in the same directory. Below is the structure of the CSV file:

4.png

import requests
import csv
import random
from lxml import html
from typing import Dict, List


def send_requests(
    url: str, headers: Dict[str, str], proxies: Dict[str, str]
) -> List[Dict[str, str]]:
    """
    Sends HTTP GET requests to multiple URLs with headers and proxies.

    Args:
        urls (str): URL to send requests to.
        headers (Dict[str, str]): Dictionary containing request headers.
        proxies (Dict[str, str]): Dictionary containing proxy settings.

    Returns:
        Response: Response object containing response data for each URL.
    """
    try:
        response = requests.get(url, headers=headers, proxies=proxies, timeout=30)
        # Response validation
        if len(response.text)> 10000:
            return response
        return None
    except Exception as e:
        print(f"Error occurred while fetching URL {url}: {str(e)}")


def extract_data_from_html(
    response, xpath_queries: Dict[str, str]
) -> Dict[str, List[str]]:
    """
    Extracts data from HTML content using XPath queries.

    Args:
        response (Response): Response Object.
        xpath_queries (Dict[str, str]): Dictionary containing XPath queries for data extraction.

    Returns:
        Dict[str, str]: Dictionary containing extracted data for each XPath query.
    """
    extracted_data = {}
    tree = html.fromstring(response.text)
    for key, xpath_query in xpath_queries.items():
        data = tree.xpath(xpath_query)[0].strip()
        extracted_data[key] = data
    return extracted_data


def save_to_csv(extracted_data: Dict[str, any]):
    """
    Saves a dictionary as a row in a CSV file using DictWriter.

    Args:
        extracted_data (Dict[str, any]): Dictionary representing a row of data.
    """
    csv_file_path = "product_data.csv"
    fieldnames = ["title", "ratings", "price"]
    with open(csv_file_path, "a", newline="") as csvfile:
        writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
        if csvfile.tell() == 0:
            writer.writeheader()  # Write header only if the file is empty
        writer.writerow(extracted_data)


def main():
    # Reading URLs from a CSV file
    urls = []
    with open("amazon_product_urls.csv", "r") as file:
        reader = csv.DictReader(file)
        for row in reader:
            urls.append(row["url"])

    # Defining request headers
    headers = {
        "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7",
        "accept-language": "en-IN,en;q=0.9",
        "dnt": "1",
        "sec-ch-ua": '"Google Chrome";v="123", "Not:A-Brand";v="8", "Chromium";v="123"',
        "sec-ch-ua-mobile": "?0",
        "sec-ch-ua-platform": '"Windows"',
        "sec-fetch-dest": "document",
        "sec-fetch-mode": "navigate",
        "sec-fetch-site": "same-origin",
        "sec-fetch-user": "?1",
        "upgrade-insecure-requests": "1",
        "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36",
    }

    useragents = [
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4591.54 Safari/537.36",
        "Mozilla/5.0 (Windows NT 7_0_2; Win64; x64) AppleWebKit/541.38 (KHTML, like Gecko) Chrome/105.0.1585 Safari/537.36",
        "Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36",
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.7863.44 Safari/537.36"
        ]

    # Defining proxies
    proxies = {"http": "IP:Port", "https": "IP:Port"}

    # Sending requests to URLs
    for url in urls:
        # Useragent rotation in headers
        headers["user-agent"] = random.choice(useragnets)
        response = send_requests(url, headers, proxies)
        if response:
            # Extracting data from HTML content
            xpath_queries = {
                "title": '//span[@id="productTitle"]/text()',
                "ratings": '//span[@id="acrPopover"]/@title',
                "price": '//span[@class="a-offscreen"]/text()',
            }
            extracted_data = extract_data_from_html(response, xpath_queries)

            # Saving extracted data to a CSV file
            save_to_csv(extracted_data)


if __name__ == "__main__":
    main()

Proxy suggestions for scraping Amazon through Python

Different proxy solutions, including Datacenter IPv4, rotating mobile, ISP, and residential proxies, are available for uninterrupted data extraction. Proper rotation logic and user agents are used to simulate real user behaviour, while special proxies support large-scale scraping with internal rotation and extensive IP pools. Understanding the pros and cons of each proxy option is crucial for uninterrupted data extraction.

Type Pros Cons
Datacenter Proxies

High speed and performance.

Cost-effective.

Ideal for large volume requests.

Can be easily detected and blacklisted.

Not reliable against anti-scraping or anti-bot systems.

Residential Proxies

High legitimacy due to real residential IPs.

Wide global IP availability for location-specific data scraping.

IP rotation capabilities.

More expensive than datacenter proxies.

Mobile Proxies

Highly legitimate IPs.

Effective for avoiding blocks and verification prompts.

More expensive than other proxy types.

Slower than datacenter proxies due to mobile network reliance.

ISP Proxies

Highly-reliable IPs.

Faster than residential IPs.

Limited IP availability.

IP rotation not available.

Scraping product data from Amazon involves meticulous preparation to navigate the platform's anti-scraping mechanisms effectively. Utilizing proxy servers along with Python enables efficient data processing and targeted extraction of necessary information. When selecting proxies for web scraping, it's crucial to consider factors such as performance, cost, server reliability, and the specific requirements of your project. Employing dynamic proxies and implementing strategies to counteract security measures can minimize the risk of being blocked and enhance the overall efficiency of the scraping process.

Comments:

0 comments