How to scrape Yelp using Python

26.11.2024

Comments: 0

Like:

Content of the article:

Step 1: Setting up the environment
Step 2: Sending a request to Yelp

Understanding HTTP headers
Implementing proxy rotation

Step 3: Parsing the HTML content with lxml

Identifying the Elements to Scrape
Using XPath for data extraction

Step 4: Extracting data from each restaurant listing
Step 5: Saving the data as JSON
Complete code

Scraping data from Yelp can provide valuable insights into local restaurants, including details like name, URL, cuisines, and ratings. Using the requests and lxml Python libraries, this tutorial will show how to scrape Yelp search results. Several techniques will be covered including using proxies, handling headers and extracting data with XPath.

Step 1: Setting up the environment

Before starting the scraping process, ensure you have Python installed and the required libraries:

pip install requests
pip install lxml

These libraries will help us send HTTP requests to Yelp, parse the HTML content, and extract the data we need.

Step 2: Sending a request to Yelp

First, we need to send a GET request to the Yelp search results page to fetch the HTML content. Here's how to do it:

import requests

# Yelp search page URL
url = "https://www.yelp.com/search?find_desc=restaurants&find_loc=San+Francisco%2C+CA"

# Send a GET request to fetch the HTML content
response = requests.get(url)

# Check if the request was successful
if response.status_code == 200:
    print("Successfully fetched the page content")
else:
    print("Failed to retrieve the page content")

Understanding HTTP headers

When making requests to a website, it's essential to include the appropriate HTTP headers. Headers can contain metadata about the request, such as the user agent, which identifies the browser or tool making the request. Including these headers can help avoid blocking or throttling by the target website.

Here’s how you can set up headers:

headers = {
    'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7',
    'accept-language': 'en-IN,en;q=0.9',
    'cache-control': 'no-cache',
    'dnt': '1',
    'pragma': 'no-cache',
    'priority': 'u=0, i',
    'sec-ch-ua': '"Not)A;Brand";v="99", "Google Chrome";v="127", "Chromium";v="127"',
    'sec-ch-ua-mobile': '?0',
    'sec-ch-ua-platform': '"Linux"',
    'sec-fetch-dest': 'document',
    'sec-fetch-mode': 'navigate',
    'sec-fetch-site': 'none',
    'sec-fetch-user': '?1',
    'upgrade-insecure-requests': '1',
    'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/127.0.0.0 Safari/537.36',
}

response = requests.get(url, headers=headers)

Implementing proxy rotation

When scraping a large volume of pages, there's a risk of your IP address being blocked by the target site. To prevent this, using proxy servers is recommended. For this guide, it's advisable to use dynamic proxy servers that feature automatic rotation. This way, you only need to set up the proxy server settings once, and the rotation will help maintain access by periodically changing the IP address, reducing the likelihood of being blocked.

proxies = {
    'http': 'http://username:password@proxy-server:port',
    'https': 'https://username:password@proxy-server:port'
}

response = requests.get(url, headers=headers, proxies=proxies)

Step 3: Parsing the HTML content with lxml

Once we have the HTML content, the next step is to parse it and extract the relevant data. We’ll use the lxml library for this purpose.

from lxml import html

# Parse the HTML content using lxml
parser = html.fromstring(response.content)

Identifying the Elements to Scrape

We need to target the individual restaurant listings on the search results page. These elements can be identified using XPath expressions. For Yelp, the listings are usually wrapped in a div element with a specific data-testid attribute.

# Extract individual restaurant elements
elements = parser.xpath('//div[@data-testid="serp-ia-card"]')[2:-1]

Using XPath for data extraction

XPath is a powerful tool for navigating and selecting nodes from an HTML document. In this tutorial, we’ll use XPath expressions to extract the restaurant name, URL, cuisines, and rating from each restaurant element.

Here are the specific XPaths for each data point:

Restaurant Name: .//div[@class="businessName__09f24__HG_pC y-css-ohs7lg"]/div/h3/a/text()
Restaurant URL: .//div[@class="businessName__09f24__HG_pC y-css-ohs7lg"]/div/h3/a/@href
Cuisines: .//div[@class="priceCategory__09f24___4Wsg iaPriceCategory__09f24__x9YrM y-css-2hdccn"]/div/div/div/a/button/span/text()
Rating: .//div[@class="y-css-9tnml4"]/@aria-label

Step 4: Extracting data from each restaurant listing

Now that we have the HTML content and have handled potential IP blocking, we can extract the required data from each restaurant listing.

restaurants_data = []

# Iterate over each restaurant element
for element in elements:
    # Extract the restaurant name
    name = element.xpath('.//div[@class="businessName__09f24__HG_pC y-css-ohs7lg"]/div/h3/a/text()')[0]

    # Extract the restaurant URL
    url = element.xpath('.//div[@class="businessName__09f24__HG_pC y-css-ohs7lg"]/div/h3/a/@href')[0]

    # Extract the cuisines
    cuisines = element.xpath('.//div[@class="priceCategory__09f24___4Wsg iaPriceCategory__09f24__x9YrM y-css-2hdccn"]/div/div/div/a/button/span/text()')

    # Extract the rating
    rating = element.xpath('.//div[@class="y-css-9tnml4"]/@aria-label')[0]

    # Create a dictionary to store the data
    restaurant_info = {
        "name": name,
        "url": url,
        "cuisines": cuisines,
        "rating": rating
    }

    # Add the restaurant info to the list
    restaurants_data.append(restaurant_info)

Step 5: Saving the data as JSON

After extracting the data, we need to save it in a structured format. JSON is a widely used format for this purpose.

import json

# Save the data to a JSON file
with open('yelp_restaurants.json', 'w') as f:
    json.dump(restaurants_data, f, indent=4)

print("Data extraction complete. Saved to yelp_restaurants.json")

Complete code

import requests
from lxml import html
import json

# Yelp search page URL
url = "https://www.yelp.com/search?find_desc=restaurants&find_loc=San+Francisco%2C+CA"

# Set up headers to mimic a browser request
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
    'Accept-Language': 'en-US,en;q=0.5'
}

# Set up proxies if required
proxies = {
    'http': 'http://username:password@proxy-server:port',
    'https': 'https://username:password@proxy-server:port'
}

# Send a GET request to fetch the HTML content
response = requests.get(url, headers=headers, proxies=proxies)

# Check if the request was successful
if response.status_code == 200:
    print("Successfully fetched the page content")
else:
    print("Failed to retrieve the page content")

# Parse the HTML content using lxml
parser = html.fromstring(response.content)

# Extract individual restaurant elements
elements = parser.xpath('//div[@data-testid="serp-ia-card"]')[2:-1]

# Initialize a list to hold the extracted data
restaurants_data = []

# Iterate over each restaurant element
for element in elements:
    # Extract the restaurant name
    name = element.xpath('.//div[@class="businessName__09f24__HG_pC y-css-ohs7lg"]/div/h3/a/text()')[0]

    # Extract the restaurant URL
    url = element.xpath('.//div[@class="businessName__09f24__HG_pC y-css-ohs7lg"]/div/h3/a/@href')[0]

    # Extract the cuisines
    cuisines = element.xpath('.//div[@class="priceCategory__09f24___4Wsg iaPriceCategory__09f24__x9YrM y-css-2hdccn"]/div/div/div/a/button/span/text()')

    # Extract the rating
    rating = element.xpath('.//div[@class="y-css-9tnml4"]/@aria-label')[0]

    # Create a dictionary to store the data
    restaurant_info = {
        "name": name,
        "url": url,
        "cuisines": cuisines,
        "rating": rating
    }

    # Add the restaurant info to the list
    restaurants_data.append(restaurant_info)

# Save the data to a JSON file
with open('yelp_restaurants.json', 'w') as f:
    json.dump(restaurants_data, f, indent=4)

print("Data extraction complete. Saved to yelp_restaurants.json")

It's crucial for users to properly configure HTTP headers and utilize proxies to circumvent restrictions and avoid blocking. For an optimized and safer scraping experience, consider automating IP rotation. Employing dynamic residential or mobile proxies can significantly enhance this process, reducing the likelihood of being detected and blocked.