Business intelligence, research, and analysis are just a few of the endless possibilities made available through web scraping. A fully-fledged business entity like Walmart provides a perfect structure for us to collect the necessary information. We can easily scrape Walmart data such as name, price, and review info from their multitude of websites using various scraping techniques.
In this article we are going to break through the process of: how to scrape Walmart data. We will be using requests for sending HTTP requests and lxml for parsing the returned HTML documents.
When it comes to scraping product data on multiple retail sites, Python is among the most effective options available. Here’s how it integrates seamlessly into extraction projects:
Using such language for projects in retail not only decomplicates the technical aspect, but also increases the efficiency as well as the scope of analysis, making it the prime choice for experts aiming to gain profound knowledge of the market. These aspects might be especially useful when one decides to scrape Walmart data.
Now, let’s begin with building a Walmart web scraping tool.
To start off, make sure Python is installed on your computer. The required libraries can be downloaded using pip:
pip install requests
pip install lxml
pip install urllib3
Now let`s import such libraries as:
import requests
from lxml import html
import csv
import random
import urllib3
import ssl
List of product URLs to scrape Walmart data can be added like this.
product_urls = [
'link with https',
'link with https',
'link with https'
]
When web scraping Walmart, it is crucial to present the correct HTTP headers, especially the User-Agent header, in order to mimic an actual browser. Moreover, the site's anti-bot systems can also be circumvented by using rotating proxy servers. In the example below, User-Agent strings are presented along with instructions for adding proxy server authorization by IP address.
user_agents = [
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
]
proxy = [
'<ip>:<port>',
'<ip>:<port>',
'<ip>:<port>',
]
Request headers should be set in a manner that disguises them as coming from a user’s browser. It will help a lot when trying to scrape Walmart data. Here’s an example how it would look:
headers = {
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7',
'accept-language': 'en-IN,en;q=0.9',
'dnt': '1',
'priority': 'u=0, i',
'sec-ch-ua': '"Not/A)Brand";v="8", "Chromium";v="126", "Google Chrome";v="126"',
'sec-ch-ua-mobile': '?0',
'sec-ch-ua-platform': '"Linux"',
'sec-fetch-dest': 'document',
'sec-fetch-mode': 'navigate',
'sec-fetch-site': 'none',
'sec-fetch-user': '?1',
'upgrade-insecure-requests': '1',
}
Primary step is to create a structure that will accept product information.
product_details = []
Enumerating URL pages works in the following way: For every URL page, a GET request is initiated with a randomly chosen User-Agent and a proxy. After an HTML response is returned, it is parsed for the product details including the name, price, and reviews. The relevant details are saved in the dictionary data structure which is later added to the list previously created.
for url in product_urls:
headers['user-agent'] = random.choice(user_agents)
proxies = {
'http': f'http://{random.choice(proxy)}',
'https': f'http://{random.choice(proxy)}',
}
try:
# Send an HTTP GET request to the URL
response = requests.get(url=url, headers=headers, proxies=proxies, verify=False)
print(response.status_code)
response.raise_for_status()
except requests.exceptions.RequestException as e:
print(f'Error fetching data: {e}')
# Parse the HTML content using lxml
parser = html.fromstring(response.text)
# Extract product title
title = ''.join(parser.xpath('//h1[@id="main-title"]/text()'))
# Extract product price
price = ''.join(parser.xpath('//span[@itemprop="price"]/text()'))
# Extract review details
review_details = ''.join(parser.xpath('//div[@data-testid="reviews-and-ratings"]/div/span[@class="w_iUH7"]/text()'))
# Store extracted details in a dictionary
product_detail = {
'title': title,
'price': price,
'review_details': review_details
}
# Append product details to the list
product_details.append(product_detail)
Title:
Price:
Review detail:
with open('walmart_products.csv', 'w', newline='') as csvfile:
fieldnames = ['title', 'price', 'review_details']
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writeheader()
for product_detail in product_details:
writer.writerow(product_detail)
When web scraping Walmart, Python complete script will be looking like that provided below. Here are also some comments to make it easier for you to understand each section.
import requests
from lxml import html
import csv
import random
import urllib3
import ssl
ssl._create_default_https_context = ssl._create_stdlib_context
urllib3.disable_warnings()
# List of product URLs to scrape Walmart data
product_urls = [
'link with https',
'link with https',
'link with https'
]
# Randomized User-Agent strings for anonymity
user_agents = [
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
]
# Proxy list for IP rotation
proxy = [
'<ip>:<port>',
'<ip>:<port>',
'<ip>:<port>',
]
# Headers to mimic browser requests
headers = {
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7',
'accept-language': 'en-IN,en;q=0.9',
'dnt': '1',
'priority': 'u=0, i',
'sec-ch-ua': '"Not/A)Brand";v="8", "Chromium";v="126", "Google Chrome";v="126"',
'sec-ch-ua-mobile': '?0',
'sec-ch-ua-platform': '"Linux"',
'sec-fetch-dest': 'document',
'sec-fetch-mode': 'navigate',
'sec-fetch-site': 'none',
'sec-fetch-user': '?1',
'upgrade-insecure-requests': '1',
}
# Initialize an empty list to store product details
product_details = []
# Loop through each product URL
for url in product_urls:
headers['user-agent'] = random.choice(user_agents)
proxies = {
'http': f'http://{random.choice(proxy)}',
'https': f'http://{random.choice(proxy)}',
}
try:
# Send an HTTP GET request to the URL
response = requests.get(url=url, headers=headers, proxies=proxies, verify=False)
print(response.status_code)
response.raise_for_status()
except requests.exceptions.RequestException as e:
print(f'Error fetching data: {e}')
# Parse the HTML content using lxml
parser = html.fromstring(response.text)
# Extract product title
title = ''.join(parser.xpath('//h1[@id="main-title"]/text()'))
# Extract product price
price = ''.join(parser.xpath('//span[@itemprop="price"]/text()'))
# Extract review details
review_details = ''.join(parser.xpath('//div[@data-testid="reviews-and-ratings"]/div/span[@class="w_iUH7"]/text()'))
# Store extracted details in a dictionary
product_detail = {
'title': title,
'price': price,
'review_details': review_details
}
# Append product details to the list
product_details.append(product_detail)
# Write the extracted data to a CSV file
with open('walmart_products.csv', 'w', newline='') as csvfile:
fieldnames = ['title', 'price', 'review_details']
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writeheader()
for product_detail in product_details:
writer.writerow(product_detail)
For those using Python to interface with the Walmart scraping API, it's crucial to develop robust methods that effectively scrape Walmart prices and Walmart reviews results. This API provides a direct pipeline to extensive product data, facilitating real-time analytics on pricing and customer feedback.
Employing these specific strategies enhances the precision and scope of the information collected, allowing businesses to adapt quickly to market changes and consumer trends. Through strategic application of the Walmart API in Python, companies can optimize their data gathering processes, ensuring comprehensive market analysis and informed decision-making.
In this tutorial, we explained how to use the Python libraries to scrape Walmart data and save them into a CSV file for later analysis. The script given is basic, and it serves as a starting point that you can modify to improve the efficiency of the scraping process. Improvements may include adding random time intervals between requests to simulate human browsing, using user-agents and proxies to mask the bot, and implementing advanced error handling to tackle scraping interruptions or failures.
Comments: 0