How to scrape E-Commerce websites with Python

Comments: 0

For competitive analysis, price monitoring, and market research it is important to scrape product data from e-commerce websites. You can use Python to scrape data from product pages efficiently. In this guide, we will demonstrate how to scrape product information from online stores using a combination of requests and lxml.

E-commerce scraping involves extracting product details such as names, prices, and IDs from online stores. Python, with its versatile libraries, makes this task efficient and straightforward. In this guide, we will scrape product information from Costco's website.

Writing a script to extract product data

Before diving into the scraping process, ensure you have the necessary Python libraries installed:

pip install requests
pip install lxml

We'll focus on extracting product names, product features, and product brands from specific product pages on the website.

Step 1. Understanding the HTML structure of the website

To extract data from any website, you need to understand the structure of the webpage. Open a website page and inspect the elements you want to scrape (e.g., product name, features brand, etc).

Step 2. Sending HTTP requests

First, we'll use the requests library to send HTTP GET requests to the product pages. We will also set up the request headers to mimic a real browser request.


import requests

# List of product URLs to scrape
urls = [
    "https://www.costco.com/kirkland-signature-men's-sneaker.product.4000216649.html",
    "https://www.costco.com/adidas-ladies'-puremotion-shoe.product.4000177646.html"
]

headers = {
    'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7',
    'accept-language': 'en-US,en;q=0.9',
    'cache-control': 'no-cache',
    'dnt': '1',
    'pragma': 'no-cache',
    'sec-ch-ua': '"Not/A)Brand";v="99", "Google Chrome";v="91", "Chromium";v="91"',
    'sec-ch-ua-mobile': '?0',
    'sec-fetch-dest': 'document',
    'sec-fetch-mode': 'navigate',
    'sec-fetch-site': 'same-origin',
    'sec-fetch-user': '?1',
    'upgrade-insecure-requests': '1',
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
}

# Loop through each URL and send a GET request
for url in urls:
    response = requests.get(url, headers=headers)
    if response.status_code == 200:
        html_content = response.text
        # Further processing will be added in subsequent steps
    else:
        print(f"Failed to retrieve {url}. Status code: {response.status_code}")

Step 3. Extracting data using XPath and lxml

Using lxml, we will extract the required data points from the parsed HTML.

from lxml import html

# List to store scraped data
scraped_data = []

# Loop through each URL and send a GET request
for url in urls:
    response = requests.get(url)
    if response.status_code == 200:
        html_content = response.content
        # Parse HTML content with lxml
        tree = html.fromstring(html_content)
        
        # Extract data using XPath

        product_name = tree.xpath('//h1[@automation-id="productName"]/text()')[0].strip()
        product_feature = tree.xpath('//ul[@class="pdp-features"]//li//text()')
        product_brand = tree.xpath('//div[@itemprop="brand"]/text()')[0].strip()
        
        # Append extracted data to the list
        scraped_data.append({'Product Name': product_name, 'Product Feature': product_feature, 'Brand': product_brand})
    else:
        print(f"Failed to retrieve {url}. Status code: {response.status_code}")

# Print the scraped data
for item in scraped_data:
    print(item)

Step 4. Addressing potential issues

Websites often implement anti-bot measures. Using proxies and rotating user agents can help avoid detection.

Using proxies with IP-authorization:


proxies = {
    'http': 'http://your_proxy_ip:your_proxy_port',
    'https': 'https://your_proxy_ip:your_proxy_port'
}
response = requests.get(url, proxies=proxies)

Rotating User Agents:


import random

user_agents = [
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
    # Add more user agents as needed
]

headers['user-agent'] = random.choice(user_agents)

response = requests.get(url, headers=headers)

Step 5. Saving data to a CSV file

Finally, we will save the scraped data to a CSV file for further analysis.

import csv

csv_file = 'costco_products.csv'
fieldnames = ['Product Name', 'Product Feature', 'Brand']

# Writing data to CSV file
try:
    with open(csv_file, mode='w', newline='', encoding='utf-8') as file:
        writer = csv.DictWriter(file, fieldnames=fieldnames)
        writer.writeheader()
        for item in scraped_data:
            writer.writerow(item)
    print(f"Data saved to {csv_file}")
except IOError:
    print(f"Error occurred while writing data to {csv_file}")

Complete code


import requests
import urllib3
from lxml import html
import csv
import random
import ssl

ssl._create_default_https_context = ssl._create_stdlib_context
urllib3.disable_warnings()

# List of product URLs to scrape
urls = [
   "https://www.costco.com/kirkland-signature-men's-sneaker.product.4000216649.html",
   "https://www.costco.com/adidas-ladies'-puremotion-shoe.product.4000177646.html"
]

# headers
headers = {
   'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7',
   'accept-language': 'en-US,en;q=0.9',
   'cache-control': 'no-cache',
   'dnt': '1',
   'pragma': 'no-cache',
   'sec-ch-ua': '"Not/A)Brand";v="99", "Google Chrome";v="91", "Chromium";v="91"',
   'sec-ch-ua-mobile': '?0',
   'sec-fetch-dest': 'document',
   'sec-fetch-mode': 'navigate',
   'sec-fetch-site': 'same-origin',
   'sec-fetch-user': '?1',
   'upgrade-insecure-requests': '1',
   'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
}

# List of user agents for rotating requests
user_agents = [
   'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
   'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
   # Add more user agents as needed
]


# List of proxies for rotating requests
proxies = [
    {'http': 'http://your_proxy_ip:your_proxy_port', 'https': 'https://your_proxy_ip:your_proxy_port'},
    {'http': 'http://your_proxy_ip2:your_proxy_port2', 'https': 'https://your_proxy_ip2:your_proxy_port2'},
    # Add more proxies as needed
]

# List to store scraped data
scraped_data = []

# Loop through each URL and send a GET request
for url in urls:
   # Choose a random user agent for the request headers
   headers['user-agent'] = random.choice(user_agents)
   # Choose a random proxy for the request
   proxy = random.choice(proxies)

   # Send HTTP GET request to the URL with headers and proxy
   response = requests.get(url, headers=headers, proxies=proxy, verify=False)
   if response.status_code == 200:
       # Store the HTML content from the response
       html_content = response.content
       # Parse HTML content with lxml
       tree = html.fromstring(html_content)

       # Extract data using XPath
       product_name = tree.xpath('//h1[@automation-id="productName"]/text()')[0].strip()
       product_feature = tree.xpath('//ul[@class="pdp-features"]//li//text()')
       product_brand = tree.xpath('//div[@itemprop="brand"]/text()')[0].strip()

       # Append extracted data to the list
       scraped_data.append({'Product Name': product_name, 'Product Feature': product_feature, 'Brand': product_brand})
   else:
       # Print error message if request fails
       print(f"Failed to retrieve {url}. Status code: {response.status_code}")

# CSV file setup
csv_file = 'costco_products.csv'
fieldnames = ['Product Name', 'Product Feature', 'Brand']

# Writing data to CSV file
try:
   with open(csv_file, mode='w', newline='', encoding='utf-8') as file:
       writer = csv.DictWriter(file, fieldnames=fieldnames)
       writer.writeheader()
       for item in scraped_data:
           writer.writerow(item)
   print(f"Data saved to {csv_file}")
except IOError:
   # Print error message if writing to file fails
   print(f"Error occurred while writing data to {csv_file}")

Using Python to scrape from e-commerce sites such as Costco is an effective method for collecting product information to analyze it as well as making strategic decisions. The proper utilization of libraries e.g., Requests as well as Lxml results in automated extraction processes that can handle HTML contents without forgetting about the implementation of an anti-bot API effectively. It should be noted that ethical scraping protocols must always be followed.

Comments:

0 comments