For competitive analysis, price monitoring, and market research it is important to scrape product data from e-commerce websites. You can use Python to scrape data from product pages efficiently. In this guide, we will demonstrate how to scrape product information from online stores using a combination of requests and lxml.
E-commerce scraping involves extracting product details such as names, prices, and IDs from online stores. Python, with its versatile libraries, makes this task efficient and straightforward. In this guide, we will scrape product information from Costco's website.
Before diving into the scraping process, ensure you have the necessary Python libraries installed:
pip install requests
pip install lxml
We'll focus on extracting product names, product features, and product brands from specific product pages on the website.
To extract data from any website, you need to understand the structure of the webpage. Open a website page and inspect the elements you want to scrape (e.g., product name, features brand, etc).
First, we'll use the requests library to send HTTP GET requests to the product pages. We will also set up the request headers to mimic a real browser request.
import requests
# List of product URLs to scrape
urls = [
"https://www.costco.com/kirkland-signature-men's-sneaker.product.4000216649.html",
"https://www.costco.com/adidas-ladies'-puremotion-shoe.product.4000177646.html"
]
headers = {
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7',
'accept-language': 'en-US,en;q=0.9',
'cache-control': 'no-cache',
'dnt': '1',
'pragma': 'no-cache',
'sec-ch-ua': '"Not/A)Brand";v="99", "Google Chrome";v="91", "Chromium";v="91"',
'sec-ch-ua-mobile': '?0',
'sec-fetch-dest': 'document',
'sec-fetch-mode': 'navigate',
'sec-fetch-site': 'same-origin',
'sec-fetch-user': '?1',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
}
# Loop through each URL and send a GET request
for url in urls:
response = requests.get(url, headers=headers)
if response.status_code == 200:
html_content = response.text
# Further processing will be added in subsequent steps
else:
print(f"Failed to retrieve {url}. Status code: {response.status_code}")
Using lxml, we will extract the required data points from the parsed HTML.
from lxml import html
# List to store scraped data
scraped_data = []
# Loop through each URL and send a GET request
for url in urls:
response = requests.get(url)
if response.status_code == 200:
html_content = response.content
# Parse HTML content with lxml
tree = html.fromstring(html_content)
# Extract data using XPath
product_name = tree.xpath('//h1[@automation-id="productName"]/text()')[0].strip()
product_feature = tree.xpath('//ul[@class="pdp-features"]//li//text()')
product_brand = tree.xpath('//div[@itemprop="brand"]/text()')[0].strip()
# Append extracted data to the list
scraped_data.append({'Product Name': product_name, 'Product Feature': product_feature, 'Brand': product_brand})
else:
print(f"Failed to retrieve {url}. Status code: {response.status_code}")
# Print the scraped data
for item in scraped_data:
print(item)
Websites often implement anti-bot measures. Using proxies and rotating user agents can help avoid detection.
Using proxies with IP-authorization:
proxies = {
'http': 'http://your_proxy_ip:your_proxy_port',
'https': 'https://your_proxy_ip:your_proxy_port'
}
response = requests.get(url, proxies=proxies)
Rotating User Agents:
import random
user_agents = [
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
# Add more user agents as needed
]
headers['user-agent'] = random.choice(user_agents)
response = requests.get(url, headers=headers)
Finally, we will save the scraped data to a CSV file for further analysis.
import csv
csv_file = 'costco_products.csv'
fieldnames = ['Product Name', 'Product Feature', 'Brand']
# Writing data to CSV file
try:
with open(csv_file, mode='w', newline='', encoding='utf-8') as file:
writer = csv.DictWriter(file, fieldnames=fieldnames)
writer.writeheader()
for item in scraped_data:
writer.writerow(item)
print(f"Data saved to {csv_file}")
except IOError:
print(f"Error occurred while writing data to {csv_file}")
import requests
import urllib3
from lxml import html
import csv
import random
import ssl
ssl._create_default_https_context = ssl._create_stdlib_context
urllib3.disable_warnings()
# List of product URLs to scrape
urls = [
"https://www.costco.com/kirkland-signature-men's-sneaker.product.4000216649.html",
"https://www.costco.com/adidas-ladies'-puremotion-shoe.product.4000177646.html"
]
# headers
headers = {
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7',
'accept-language': 'en-US,en;q=0.9',
'cache-control': 'no-cache',
'dnt': '1',
'pragma': 'no-cache',
'sec-ch-ua': '"Not/A)Brand";v="99", "Google Chrome";v="91", "Chromium";v="91"',
'sec-ch-ua-mobile': '?0',
'sec-fetch-dest': 'document',
'sec-fetch-mode': 'navigate',
'sec-fetch-site': 'same-origin',
'sec-fetch-user': '?1',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
}
# List of user agents for rotating requests
user_agents = [
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
# Add more user agents as needed
]
# List of proxies for rotating requests
proxies = [
{'http': 'http://your_proxy_ip:your_proxy_port', 'https': 'https://your_proxy_ip:your_proxy_port'},
{'http': 'http://your_proxy_ip2:your_proxy_port2', 'https': 'https://your_proxy_ip2:your_proxy_port2'},
# Add more proxies as needed
]
# List to store scraped data
scraped_data = []
# Loop through each URL and send a GET request
for url in urls:
# Choose a random user agent for the request headers
headers['user-agent'] = random.choice(user_agents)
# Choose a random proxy for the request
proxy = random.choice(proxies)
# Send HTTP GET request to the URL with headers and proxy
response = requests.get(url, headers=headers, proxies=proxy, verify=False)
if response.status_code == 200:
# Store the HTML content from the response
html_content = response.content
# Parse HTML content with lxml
tree = html.fromstring(html_content)
# Extract data using XPath
product_name = tree.xpath('//h1[@automation-id="productName"]/text()')[0].strip()
product_feature = tree.xpath('//ul[@class="pdp-features"]//li//text()')
product_brand = tree.xpath('//div[@itemprop="brand"]/text()')[0].strip()
# Append extracted data to the list
scraped_data.append({'Product Name': product_name, 'Product Feature': product_feature, 'Brand': product_brand})
else:
# Print error message if request fails
print(f"Failed to retrieve {url}. Status code: {response.status_code}")
# CSV file setup
csv_file = 'costco_products.csv'
fieldnames = ['Product Name', 'Product Feature', 'Brand']
# Writing data to CSV file
try:
with open(csv_file, mode='w', newline='', encoding='utf-8') as file:
writer = csv.DictWriter(file, fieldnames=fieldnames)
writer.writeheader()
for item in scraped_data:
writer.writerow(item)
print(f"Data saved to {csv_file}")
except IOError:
# Print error message if writing to file fails
print(f"Error occurred while writing data to {csv_file}")
Using Python to scrape from e-commerce sites such as Costco is an effective method for collecting product information to analyze it as well as making strategic decisions. The proper utilization of libraries e.g., Requests as well as Lxml results in automated extraction processes that can handle HTML contents without forgetting about the implementation of an anti-bot API effectively. It should be noted that ethical scraping protocols must always be followed.
Comments: 0