Guide on How to Scrape Walmart Data with Python

26.06.2025

Comments: 0

Like:

Content of the article:

Why Use Python to Scrape Walmart Data?
Setting Up the Environment to Scrape Walmart Data

Define Product URLs
User-Agent Strings and Proxies
Headers for Requests

Initialize Data Storage

Save Data to CSV

Complete Code
Additional Suggestions
Conclusion

Business intelligence, research, and analysis are just a few of the endless possibilities made available through web scraping. A fully-fledged business entity like Walmart provides a perfect structure for us to collect the necessary information. We can easily scrape Walmart data such as name, price, and review info from their multitude of websites using various scraping techniques.

In this article we are going to break through the process of: how to scrape Walmart data. We will be using requests for sending HTTP requests and lxml for parsing the returned HTML documents.

Why Use Python to Scrape Walmart Data?

When it comes to scraping product data on multiple retail sites, Python is among the most effective options available. Here’s how it integrates seamlessly into extraction projects:

Advanced libraries. The existence of requests for web interaction and lxml for HTML parsing means you can scrape vast online catalogs with utmost ease and effectiveness.
Ease of use. With easy to use syntax, users can program data retrieval processes with little to no prior experience and thus, get right to business.
Community support. The complexity that comes with retail websites means that there’s a plethora of qualifying resources and support from the community to help you resolve the problems that come up.
Handling data. In-depth Analysis. With the aid of Pandas for handling data and Matplotlib for visual representation, Python enables the user to analyze data on a broader scale, such as collection and analysis.
Management of dynamic content. With Selenium, interaction with the dynamic web elements becomes possible which makes sure that extensive data collection, even from JavaScript loaded pages, is achieved.
Effective scaling. With the ability to manage massive and minute datasets, Python performs exceedingly well for long periods of time, even when put through extensive data extraction activities.

Using such language for projects in retail not only decomplicates the technical aspect, but also increases the efficiency as well as the scope of analysis, making it the prime choice for experts aiming to gain profound knowledge of the market. These aspects might be especially useful when one decides to scrape Walmart data.

Now, let’s begin with building a Walmart web scraping tool.

Setting Up the Environment to Scrape Walmart Data

To start off, make sure Python is installed on your computer. The required libraries can be downloaded using pip:


pip install requests
pip install  lxml
pip install urllib3

Now let`s import such libraries as:

requests – for retrieving web pages via HTTP;
lxml – for HTML documents’ trees generation;
CSV – for writing collected data into CSV files;
random – for proxy and user agent string selection.


import requests
from lxml import html
import csv
import random
import urllib3
import ssl

Define Product URLs

List of product URLs to scrape Walmart data can be added like this.


product_urls = [
    'link with https',
    'link with https',
    'link with https'
]

User-Agent Strings and Proxies

When web scraping Walmart, it is crucial to present the correct HTTP headers, especially the User-Agent header, in order to mimic an actual browser. Moreover, the site's anti-bot systems can also be circumvented by using rotating proxy servers. In the example below, User-Agent strings are presented along with instructions for adding proxy server authorization by IP address.


user_agents = [
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
    'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
]

proxy = [
    '<ip>:<port>',
    '<ip>:<port>',
    '<ip>:<port>',
]

Headers for Requests

Request headers should be set in a manner that disguises them as coming from a user’s browser. It will help a lot when trying to scrape Walmart data. Here’s an example how it would look:


headers = {
    'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7',
    'accept-language': 'en-IN,en;q=0.9',
    'dnt': '1',
    'priority': 'u=0, i',
    'sec-ch-ua': '"Not/A)Brand";v="8", "Chromium";v="126", "Google Chrome";v="126"',
    'sec-ch-ua-mobile': '?0',
    'sec-ch-ua-platform': '"Linux"',
    'sec-fetch-dest': 'document',
    'sec-fetch-mode': 'navigate',
    'sec-fetch-site': 'none',
    'sec-fetch-user': '?1',
    'upgrade-insecure-requests': '1',
}

Initialize Data Storage

Primary step is to create a structure that will accept product information.


product_details = []

Enumerating URL pages works in the following way: For every URL page, a GET request is initiated with a randomly chosen User-Agent and a proxy. After an HTML response is returned, it is parsed for the product details including the name, price, and reviews. The relevant details are saved in the dictionary data structure which is later added to the list previously created.


for url in product_urls:
   headers['user-agent'] = random.choice(user_agents)
   proxies = {
       'http': f'http://{random.choice(proxy)}',
       'https': f'http://{random.choice(proxy)}',
   }
   try:
       # Send an HTTP GET request to the URL
       response = requests.get(url=url, headers=headers, proxies=proxies, verify=False)
       print(response.status_code)
       response.raise_for_status()
   except requests.exceptions.RequestException as e:
       print(f'Error fetching data: {e}')

   # Parse the HTML content using lxml
   parser = html.fromstring(response.text)
   # Extract product title
   title = ''.join(parser.xpath('//h1[@id="main-title"]/text()'))
   # Extract product price
   price = ''.join(parser.xpath('//span[@itemprop="price"]/text()'))
   # Extract review details
   review_details = ''.join(parser.xpath('//div[@data-testid="reviews-and-ratings"]/div/span[@class="w_iUH7"]/text()'))

   # Store extracted details in a dictionary
   product_detail = {
       'title': title,
       'price': price,
       'review_details': review_details
   }
   # Append product details to the list
   product_details.append(product_detail)

Title:

Price:

Review detail:

Save Data to CSV

Create a new file specifying CSV for the file type and set it to write mode.
Specify the CSV file field names (columns).
To write dictionaries to the CSV file, create a csv.DictWriter object.
Write the header row of the CSV file.
For every dictionary in product_details, loop through and write the dictionary as a row in the CSV file.


with open('walmart_products.csv', 'w', newline='') as csvfile:
    fieldnames = ['title', 'price', 'review_details']
    writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
    writer.writeheader()
    for product_detail in product_details:
        writer.writerow(product_detail)

Complete Code

When web scraping Walmart, Python complete script will be looking like that provided below. Here are also some comments to make it easier for you to understand each section.


import requests
from lxml import html
import csv
import random
import urllib3
import ssl

ssl._create_default_https_context = ssl._create_stdlib_context
urllib3.disable_warnings()


# List of product URLs to scrape Walmart data
product_urls = [
   'link with https',
   'link with https',
   'link with https'
]

# Randomized User-Agent strings for anonymity
user_agents = [
   'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
   'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
   'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
]

# Proxy list for IP rotation
proxy = [
    '<ip>:<port>',
    '<ip>:<port>',
    '<ip>:<port>',
]


# Headers to mimic browser requests
headers = {
   'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7',
   'accept-language': 'en-IN,en;q=0.9',
   'dnt': '1',
   'priority': 'u=0, i',
   'sec-ch-ua': '"Not/A)Brand";v="8", "Chromium";v="126", "Google Chrome";v="126"',
   'sec-ch-ua-mobile': '?0',
   'sec-ch-ua-platform': '"Linux"',
   'sec-fetch-dest': 'document',
   'sec-fetch-mode': 'navigate',
   'sec-fetch-site': 'none',
   'sec-fetch-user': '?1',
   'upgrade-insecure-requests': '1',
}

# Initialize an empty list to store product details
product_details = []

# Loop through each product URL
for url in product_urls:
   headers['user-agent'] = random.choice(user_agents)
   proxies = {
       'http': f'http://{random.choice(proxy)}',
       'https': f'http://{random.choice(proxy)}',
   }
   try:
       # Send an HTTP GET request to the URL
       response = requests.get(url=url, headers=headers, proxies=proxies, verify=False)
       print(response.status_code)
       response.raise_for_status()
   except requests.exceptions.RequestException as e:
       print(f'Error fetching data: {e}')

   # Parse the HTML content using lxml
   parser = html.fromstring(response.text)
   # Extract product title
   title = ''.join(parser.xpath('//h1[@id="main-title"]/text()'))
   # Extract product price
   price = ''.join(parser.xpath('//span[@itemprop="price"]/text()'))
   # Extract review details
   review_details = ''.join(parser.xpath('//div[@data-testid="reviews-and-ratings"]/div/span[@class="w_iUH7"]/text()'))

   # Store extracted details in a dictionary
   product_detail = {
       'title': title,
       'price': price,
       'review_details': review_details
   }
   # Append product details to the list
   product_details.append(product_detail)

# Write the extracted data to a CSV file
with open('walmart_products.csv', 'w', newline='') as csvfile:
   fieldnames = ['title', 'price', 'review_details']
   writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
   writer.writeheader()
   for product_detail in product_details:
       writer.writerow(product_detail)

Additional Suggestions

For those using Python to interface with the Walmart scraping API, it's crucial to develop robust methods that effectively scrape Walmart prices and Walmart reviews results. This API provides a direct pipeline to extensive product data, facilitating real-time analytics on pricing and customer feedback.

Employing these specific strategies enhances the precision and scope of the information collected, allowing businesses to adapt quickly to market changes and consumer trends. Through strategic application of the Walmart API in Python, companies can optimize their data gathering processes, ensuring comprehensive market analysis and informed decision-making.

Conclusion

In this tutorial, we explained how to use the Python libraries to scrape Walmart data and save them into a CSV file for later analysis. The script given is basic, and it serves as a starting point that you can modify to improve the efficiency of the scraping process. Improvements may include adding random time intervals between requests to simulate human browsing, using user-agents and proxies to mask the bot, and implementing advanced error handling to tackle scraping interruptions or failures.