Guide to scraping Walmart data with Python

Comments: 0

Web scraping is a powerful way to extract data from websites for different purposes, such as analysis, research, and intelligence in business. This tutorial helps you scrape Walmart product information in Python with focus on key strategies and techniques. Walmart scraping presents an example where we can mine details of products like name, price or reviews found at various pages under Walmart sites.

This guide will use the requests library for making HTTP requests and the lxml library for parsing HTML content.

Setting up the environment

Before we start, ensure you have Python installed on your machine. You can install the required libraries using pip:

pip install requests
pip install  lxml
pip install urllib3

Next, let's import the necessary libraries:

  • requests: for making HTTP requests to retrieve web pages;
  • lxml: for parsing HTML content;
  • csv: for writing the extracted data to a CSV file;
  • random: for selecting random proxies and User-Agent strings.
import requests
from lxml import html
import csv
import random
import urllib3
import ssl

Define product URLs

List of Walmart product URLs to scrape.

product_urls = [
    'link with https',
    'link with https',
    'link with https'
]

User-Agent strings and proxies

To scrape a website, it is very important that one uses the right headers, notably the User-Agent header, so as to imitate a request from an actual browser. What is more, one can avoid being restricted due to anti-bot measures put in place by the site owners using rotatable proxy servers. Below are examples of User-Agent strings along with a description of how to integrate proxy servers that require authorization based on the IP address.

user_agents = [
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
    'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
]

proxy = [
    '<ip>:<port>',
    '<ip>:<port>',
    '<ip>:<port>',
]

Headers for requests

Set headers to mimic browser requests and avoid detection.

headers = {
    'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7',
    'accept-language': 'en-IN,en;q=0.9',
    'dnt': '1',
    'priority': 'u=0, i',
    'sec-ch-ua': '"Not/A)Brand";v="8", "Chromium";v="126", "Google Chrome";v="126"',
    'sec-ch-ua-mobile': '?0',
    'sec-ch-ua-platform': '"Linux"',
    'sec-fetch-dest': 'document',
    'sec-fetch-mode': 'navigate',
    'sec-fetch-site': 'none',
    'sec-fetch-user': '?1',
    'upgrade-insecure-requests': '1',
}

Initialize data storage

Create an empty list to store product details.

product_details = []

The enumeration process for URL pages operates as follows: For each URL page, a GET request is sent using a randomly selected User-Agent and proxy. Upon receiving the HTML response, it is parsed to extract details such as the product name, price, and reviews. The extracted data is stored as a dictionary, which is subsequently appended to a list that was created earlier.

for url in product_urls:
   headers['user-agent'] = random.choice(user_agents)
   proxies = {
       'http': f'http://{random.choice(proxy)}',
       'https': f'http://{random.choice(proxy)}',
   }
   try:
       # Send an HTTP GET request to the URL
       response = requests.get(url=url, headers=headers, proxies=proxies, verify=False)
       print(response.status_code)
       response.raise_for_status()
   except requests.exceptions.RequestException as e:
       print(f'Error fetching data: {e}')

   # Parse the HTML content using lxml
   parser = html.fromstring(response.text)
   # Extract product title
   title = ''.join(parser.xpath('//h1[@id="main-title"]/text()'))
   # Extract product price
   price = ''.join(parser.xpath('//span[@itemprop="price"]/text()'))
   # Extract review details
   review_details = ''.join(parser.xpath('//div[@data-testid="reviews-and-ratings"]/div/span[@class="w_iUH7"]/text()'))

   # Store extracted details in a dictionary
   product_detail = {
       'title': title,
       'price': price,
       'review_details': review_details
   }
   # Append product details to the list
   product_details.append(product_detail)

Title:

1.png

Price:

2.png

Review detail:

3.png

Save data to CSV

  1. Open a new CSV file for writing.
  2. Define the field names (columns) for the CSV file.
  3. Create a csv.DictWriter object to write dictionaries to the CSV file.
  4. Write the header row to the CSV file.
  5. Loop through the product_details list and write each product dictionary as a row in the CSV file.
with open('walmart_products.csv', 'w', newline='') as csvfile:
    fieldnames = ['title', 'price', 'review_details']
    writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
    writer.writeheader()
    for product_detail in product_details:
        writer.writerow(product_detail)

Complete code:

Here is the complete code with comments to help you understand it better

import requests
from lxml import html
import csv
import random
import urllib3
import ssl

ssl._create_default_https_context = ssl._create_stdlib_context
urllib3.disable_warnings()


# List of product URLs to scrape
product_urls = [
   'link with https',
   'link with https',
   'link with https'
]

# Randomized User-Agent strings for anonymity
user_agents = [
   'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
   'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
   'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
]

# Proxy list for IP rotation
proxy = [
    '<ip>:<port>',
    '<ip>:<port>',
    '<ip>:<port>',
]


# Headers to mimic browser requests
headers = {
   'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7',
   'accept-language': 'en-IN,en;q=0.9',
   'dnt': '1',
   'priority': 'u=0, i',
   'sec-ch-ua': '"Not/A)Brand";v="8", "Chromium";v="126", "Google Chrome";v="126"',
   'sec-ch-ua-mobile': '?0',
   'sec-ch-ua-platform': '"Linux"',
   'sec-fetch-dest': 'document',
   'sec-fetch-mode': 'navigate',
   'sec-fetch-site': 'none',
   'sec-fetch-user': '?1',
   'upgrade-insecure-requests': '1',
}

# Initialize an empty list to store product details
product_details = []

# Loop through each product URL
for url in product_urls:
   headers['user-agent'] = random.choice(user_agents)
   proxies = {
       'http': f'http://{random.choice(proxy)}',
       'https': f'http://{random.choice(proxy)}',
   }
   try:
       # Send an HTTP GET request to the URL
       response = requests.get(url=url, headers=headers, proxies=proxies, verify=False)
       print(response.status_code)
       response.raise_for_status()
   except requests.exceptions.RequestException as e:
       print(f'Error fetching data: {e}')

   # Parse the HTML content using lxml
   parser = html.fromstring(response.text)
   # Extract product title
   title = ''.join(parser.xpath('//h1[@id="main-title"]/text()'))
   # Extract product price
   price = ''.join(parser.xpath('//span[@itemprop="price"]/text()'))
   # Extract review details
   review_details = ''.join(parser.xpath('//div[@data-testid="reviews-and-ratings"]/div/span[@class="w_iUH7"]/text()'))

   # Store extracted details in a dictionary
   product_detail = {
       'title': title,
       'price': price,
       'review_details': review_details
   }
   # Append product details to the list
   product_details.append(product_detail)

# Write the extracted data to a CSV file
with open('walmart_products.csv', 'w', newline='') as csvfile:
   fieldnames = ['title', 'price', 'review_details']
   writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
   writer.writeheader()
   for product_detail in product_details:
       writer.writerow(product_detail)

Our tutorial demonstrates how to utilize Python libraries to scrape product data from Walmart and save it in CSV format for subsequent analysis. The script provided is fundamental and offers a foundation that can be enhanced to increase the efficiency of the scraping process. Enhancements could include introducing random delays between requests to mimic human browsing patterns, implementing User-Agent and proxy rotation to avoid detection, and developing a robust error handling system to manage potential scraping interruptions or failures.

Comments:

0 comments