Ecommerce data scraping for product details is useful for conducting competitive analysis, price monitoring, and carrying out market research. Data scraping from product pages can be done conveniently using Python. This ecommerce scraping tutorial will show you how to harvest information using a combination of requests and lxml from online stores.
Scraping web pages for ecommerce consists in getting product information such as title, price or identifier number from shops on the internet. The numerous libraries available in Python do make this not only easy but fairly efficient. In this article, we will focus on web scraping ecommerce websites using Python. Costco's website will be our object.
To begin with, let’s make sure we have all the available Python ecommerce scraping tools or libraries that we will require for this script:
pip install requests
pip install lxml
We'll focus on extracting product names, features, and brands from specific pages on the website.
To start with building an ecommerce product scraper you first have to understand how a given webpage is structured. Go to the website and open the page you want to gather information from, and inspect the required elements (e.g. product's name, features, brand, etc.).
First, we’ll import the requests library to send GET especially for the product pages. Also, we are going to configure request headers to resemble a browser request.
import requests
# List of product URLs to scrape
urls = [
"https://www.costco.com/kirkland-signature-men's-sneaker.product.4000216649.html",
"https://www.costco.com/adidas-ladies'-puremotion-shoe.product.4000177646.html"
]
headers = {
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7',
'accept-language': 'en-US,en;q=0.9',
'cache-control': 'no-cache',
'dnt': '1',
'pragma': 'no-cache',
'sec-ch-ua': '"Not/A)Brand";v="99", "Google Chrome";v="91", "Chromium";v="91"',
'sec-ch-ua-mobile': '?0',
'sec-fetch-dest': 'document',
'sec-fetch-mode': 'navigate',
'sec-fetch-site': 'same-origin',
'sec-fetch-user': '?1',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
}
# Loop through each URL and send a GET request
for url in urls:
response = requests.get(url, headers=headers)
if response.status_code == 200:
html_content = response.text
# Further processing will be added in subsequent steps
else:
print(f"Failed to retrieve {url}. Status code: {response.status_code}")
With lxml, we will extract the desired info from the html. It’s crucial when dealing with ecommerce data scraping.
from lxml import html
# List to store scraped data
scraped_data = []
# Loop through each URL and send a GET request
for url in urls:
response = requests.get(url)
if response.status_code == 200:
html_content = response.content
# Parse HTML content with lxml
tree = html.fromstring(html_content)
# Extract data using XPath
product_name = tree.xpath('//h1[@automation-id="productName"]/text()')[0].strip()
product_feature = tree.xpath('//ul[@class="pdp-features"]//li//text()')
product_brand = tree.xpath('//div[@itemprop="brand"]/text()')[0].strip()
# Append extracted data to the list
scraped_data.append({'Product Name': product_name, 'Product Feature': product_feature, 'Brand': product_brand})
else:
print(f"Failed to retrieve {url}. Status code: {response.status_code}")
# Print the scraped data
for item in scraped_data:
print(item)
When we try to scrape an ecommerce website with Python we need to understand that most websites have some form of anti-bot software. Using proxies and rotating user agents can help alleviate their suspicions.
Using proxies with IP-authorization:
proxies = {
'http': 'http://your_proxy_ip:your_proxy_port',
'https': 'https://your_proxy_ip:your_proxy_port'
}
response = requests.get(url, proxies=proxies)
Rotating User Agents:
import random
user_agents = [
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
# Add more user agents as needed
]
headers['user-agent'] = random.choice(user_agents)
response = requests.get(url, headers=headers)
In the end, the extracted data will be stored in a CSV format so I can analyze it later for a more advanced ecommerce data scraping process.
import csv
csv_file = 'costco_products.csv'
fieldnames = ['Product Name', 'Product Feature', 'Brand']
# Writing data to CSV file
try:
with open(csv_file, mode='w', newline='', encoding='utf-8') as file:
writer = csv.DictWriter(file, fieldnames=fieldnames)
writer.writeheader()
for item in scraped_data:
writer.writerow(item)
print(f"Data saved to {csv_file}")
except IOError:
print(f"Error occurred while writing data to {csv_file}")
Here is the final version of the script for the effective ecommerce data scraping. One can copy and paste it for easy use.
import requests
import urllib3
from lxml import html
import csv
import random
import ssl
ssl._create_default_https_context = ssl._create_stdlib_context
urllib3.disable_warnings()
# List of product URLs to scrape
urls = [
"https://www.costco.com/kirkland-signature-men's-sneaker.product.4000216649.html",
"https://www.costco.com/adidas-ladies'-puremotion-shoe.product.4000177646.html"
]
# headers
headers = {
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7',
'accept-language': 'en-US,en;q=0.9',
'cache-control': 'no-cache',
'dnt': '1',
'pragma': 'no-cache',
'sec-ch-ua': '"Not/A)Brand";v="99", "Google Chrome";v="91", "Chromium";v="91"',
'sec-ch-ua-mobile': '?0',
'sec-fetch-dest': 'document',
'sec-fetch-mode': 'navigate',
'sec-fetch-site': 'same-origin',
'sec-fetch-user': '?1',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
}
# List of user agents for rotating requests
user_agents = [
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
# Add more user agents as needed
]
# List of proxies for rotating requests
proxies = [
{'http': 'http://your_proxy_ip:your_proxy_port', 'https': 'https://your_proxy_ip:your_proxy_port'},
{'http': 'http://your_proxy_ip2:your_proxy_port2', 'https': 'https://your_proxy_ip2:your_proxy_port2'},
# Add more proxies as needed
]
# List to store scraped data
scraped_data = []
# Loop through each URL and send a GET request
for url in urls:
# Choose a random user agent for the request headers
headers['user-agent'] = random.choice(user_agents)
# Choose a random proxy for the request
proxy = random.choice(proxies)
# Send HTTP GET request to the URL with headers and proxy
response = requests.get(url, headers=headers, proxies=proxy, verify=False)
if response.status_code == 200:
# Store the HTML content from the response
html_content = response.content
# Parse HTML content with lxml
tree = html.fromstring(html_content)
# Extract data using XPath
product_name = tree.xpath('//h1[@automation-id="productName"]/text()')[0].strip()
product_feature = tree.xpath('//ul[@class="pdp-features"]//li//text()')
product_brand = tree.xpath('//div[@itemprop="brand"]/text()')[0].strip()
# Append extracted data to the list
scraped_data.append({'Product Name': product_name, 'Product Feature': product_feature, 'Brand': product_brand})
else:
# Print error message if request fails
print(f"Failed to retrieve {url}. Status code: {response.status_code}")
# CSV file setup
csv_file = 'costco_products.csv'
fieldnames = ['Product Name', 'Product Feature', 'Brand']
# Writing data to CSV file
try:
with open(csv_file, mode='w', newline='', encoding='utf-8') as file:
writer = csv.DictWriter(file, fieldnames=fieldnames)
writer.writeheader()
for item in scraped_data:
writer.writerow(item)
print(f"Data saved to {csv_file}")
except IOError:
# Print error message if writing to file fails
print(f"Error occurred while writing data to {csv_file}")
Python ecommerce scraper is complete now.
The use of an ecommerce web scraper for Costco's online store showcases how effective Python can be in obtaining product data for analysis and optimal business decision making. With the right scripts and libraries Requests and Lxml to provide automated extractors, it is possible to scrape the site without any workflow interruptions caused by an anti-bot API. Finally, it is essential to always comply with ethical regulations when performing ecommerce web scraping.
Comments: 0