How to Scrape E-Commerce Websites with Python

Comments: 0

Ecommerce data scraping for product details is useful for conducting competitive analysis, price monitoring, and carrying out market research. Data scraping from product pages can be done conveniently using Python. This ecommerce scraping tutorial will show you how to harvest information using a combination of requests and lxml from online stores.

Scraping web pages for ecommerce consists in getting product information such as title, price or identifier number from shops on the internet. The numerous libraries available in Python do make this not only easy but fairly efficient. In this article, we will focus on web scraping ecommerce websites using Python. Costco's website will be our object.

Writing a Script for Ecommerce Data Scraping

To begin with, let’s make sure we have all the available Python ecommerce scraping tools or libraries that we will require for this script:


pip install requests
pip install lxml

We'll focus on extracting product names, features, and brands from specific pages on the website.

Step 1. Understanding the HTML structure of the website

To start with building an ecommerce product scraper you first have to understand how a given webpage is structured. Go to the website and open the page you want to gather information from, and inspect the required elements (e.g. product's name, features, brand, etc.).

Step 2. Sending HTTP requests

First, we’ll import the requests library to send GET especially for the product pages. Also, we are going to configure request headers to resemble a browser request.


import requests

# List of product URLs to scrape
urls = [
    "https://www.costco.com/kirkland-signature-men's-sneaker.product.4000216649.html",
    "https://www.costco.com/adidas-ladies'-puremotion-shoe.product.4000177646.html"
]

headers = {
    'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7',
    'accept-language': 'en-US,en;q=0.9',
    'cache-control': 'no-cache',
    'dnt': '1',
    'pragma': 'no-cache',
    'sec-ch-ua': '"Not/A)Brand";v="99", "Google Chrome";v="91", "Chromium";v="91"',
    'sec-ch-ua-mobile': '?0',
    'sec-fetch-dest': 'document',
    'sec-fetch-mode': 'navigate',
    'sec-fetch-site': 'same-origin',
    'sec-fetch-user': '?1',
    'upgrade-insecure-requests': '1',
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
}

# Loop through each URL and send a GET request
for url in urls:
    response = requests.get(url, headers=headers)
    if response.status_code == 200:
        html_content = response.text
        # Further processing will be added in subsequent steps
    else:
        print(f"Failed to retrieve {url}. Status code: {response.status_code}")

Step 3. Extracting data using XPath and lxml

With lxml, we will extract the desired info from the html. It’s crucial when dealing with ecommerce data scraping.


from lxml import html

# List to store scraped data
scraped_data = []

# Loop through each URL and send a GET request
for url in urls:
    response = requests.get(url)
    if response.status_code == 200:
        html_content = response.content
        # Parse HTML content with lxml
        tree = html.fromstring(html_content)
        
        # Extract data using XPath
        product_name = tree.xpath('//h1[@automation-id="productName"]/text()')[0].strip()
        product_feature = tree.xpath('//ul[@class="pdp-features"]//li//text()')
        product_brand = tree.xpath('//div[@itemprop="brand"]/text()')[0].strip()
        
        # Append extracted data to the list
        scraped_data.append({'Product Name': product_name, 'Product Feature': product_feature, 'Brand': product_brand})
    else:
        print(f"Failed to retrieve {url}. Status code: {response.status_code}")

# Print the scraped data
for item in scraped_data:
    print(item)

Step 4. Addressing potential issues

When we try to scrape an ecommerce website with Python we need to understand that most websites have some form of anti-bot software. Using proxies and rotating user agents can help alleviate their suspicions.

Using proxies with IP-authorization:


proxies = {
    'http': 'http://your_proxy_ip:your_proxy_port',
    'https': 'https://your_proxy_ip:your_proxy_port'
}
response = requests.get(url, proxies=proxies)

Rotating User Agents:


import random

user_agents = [
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
    # Add more user agents as needed
]

headers['user-agent'] = random.choice(user_agents)

response = requests.get(url, headers=headers)

Step 5. Saving data to a CSV file

In the end, the extracted data will be stored in a CSV format so I can analyze it later for a more advanced ecommerce data scraping process.


import csv

csv_file = 'costco_products.csv'
fieldnames = ['Product Name', 'Product Feature', 'Brand']

# Writing data to CSV file
try:
    with open(csv_file, mode='w', newline='', encoding='utf-8') as file:
        writer = csv.DictWriter(file, fieldnames=fieldnames)
        writer.writeheader()
        for item in scraped_data:
            writer.writerow(item)
    print(f"Data saved to {csv_file}")
except IOError:
    print(f"Error occurred while writing data to {csv_file}")

Complete code

Here is the final version of the script for the effective ecommerce data scraping. One can copy and paste it for easy use.


import requests
import urllib3
from lxml import html
import csv
import random
import ssl

ssl._create_default_https_context = ssl._create_stdlib_context
urllib3.disable_warnings()

# List of product URLs to scrape
urls = [
   "https://www.costco.com/kirkland-signature-men's-sneaker.product.4000216649.html",
   "https://www.costco.com/adidas-ladies'-puremotion-shoe.product.4000177646.html"
]

# headers
headers = {
   'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7',
   'accept-language': 'en-US,en;q=0.9',
   'cache-control': 'no-cache',
   'dnt': '1',
   'pragma': 'no-cache',
   'sec-ch-ua': '"Not/A)Brand";v="99", "Google Chrome";v="91", "Chromium";v="91"',
   'sec-ch-ua-mobile': '?0',
   'sec-fetch-dest': 'document',
   'sec-fetch-mode': 'navigate',
   'sec-fetch-site': 'same-origin',
   'sec-fetch-user': '?1',
   'upgrade-insecure-requests': '1',
   'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
}

# List of user agents for rotating requests
user_agents = [
   'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
   'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
   # Add more user agents as needed
]


# List of proxies for rotating requests
proxies = [
    {'http': 'http://your_proxy_ip:your_proxy_port', 'https': 'https://your_proxy_ip:your_proxy_port'},
    {'http': 'http://your_proxy_ip2:your_proxy_port2', 'https': 'https://your_proxy_ip2:your_proxy_port2'},
    # Add more proxies as needed
]

# List to store scraped data
scraped_data = []

# Loop through each URL and send a GET request
for url in urls:
   # Choose a random user agent for the request headers
   headers['user-agent'] = random.choice(user_agents)
   # Choose a random proxy for the request
   proxy = random.choice(proxies)

   # Send HTTP GET request to the URL with headers and proxy
   response = requests.get(url, headers=headers, proxies=proxy, verify=False)
   if response.status_code == 200:
       # Store the HTML content from the response
       html_content = response.content
       # Parse HTML content with lxml
       tree = html.fromstring(html_content)

       # Extract data using XPath
       product_name = tree.xpath('//h1[@automation-id="productName"]/text()')[0].strip()
       product_feature = tree.xpath('//ul[@class="pdp-features"]//li//text()')
       product_brand = tree.xpath('//div[@itemprop="brand"]/text()')[0].strip()

       # Append extracted data to the list
       scraped_data.append({'Product Name': product_name, 'Product Feature': product_feature, 'Brand': product_brand})
   else:
       # Print error message if request fails
       print(f"Failed to retrieve {url}. Status code: {response.status_code}")

# CSV file setup
csv_file = 'costco_products.csv'
fieldnames = ['Product Name', 'Product Feature', 'Brand']

# Writing data to CSV file
try:
   with open(csv_file, mode='w', newline='', encoding='utf-8') as file:
       writer = csv.DictWriter(file, fieldnames=fieldnames)
       writer.writeheader()
       for item in scraped_data:
           writer.writerow(item)
   print(f"Data saved to {csv_file}")
except IOError:
   # Print error message if writing to file fails
   print(f"Error occurred while writing data to {csv_file}")

Python ecommerce scraper is complete now.

Ecommerce Data Scraping: Final Thoughts

The use of an ecommerce web scraper for Costco's online store showcases how effective Python can be in obtaining product data for analysis and optimal business decision making. With the right scripts and libraries Requests and Lxml to provide automated extractors, it is possible to scrape the site without any workflow interruptions caused by an anti-bot API. Finally, it is essential to always comply with ethical regulations when performing ecommerce web scraping.

Comments:

0 comments