Scraping data from Yelp can provide valuable insights into local restaurants, including details like name, URL, cuisines, and ratings. Using the requests and lxml Python libraries, this tutorial will show how to scrape Yelp search results. Several techniques will be covered including using proxies, handling headers and extracting data with XPath.
Before starting the scraping process, ensure you have Python installed and the required libraries:
pip install requests
pip install lxml
These libraries will help us send HTTP requests to Yelp, parse the HTML content, and extract the data we need.
First, we need to send a GET request to the Yelp search results page to fetch the HTML content. Here's how to do it:
import requests
# Yelp search page URL
url = "https://www.yelp.com/search?find_desc=restaurants&find_loc=San+Francisco%2C+CA"
# Send a GET request to fetch the HTML content
response = requests.get(url)
# Check if the request was successful
if response.status_code == 200:
print("Successfully fetched the page content")
else:
print("Failed to retrieve the page content")
When making requests to a website, it's essential to include the appropriate HTTP headers. Headers can contain metadata about the request, such as the user agent, which identifies the browser or tool making the request. Including these headers can help avoid blocking or throttling by the target website.
Here’s how you can set up headers:
headers = {
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7',
'accept-language': 'en-IN,en;q=0.9',
'cache-control': 'no-cache',
'dnt': '1',
'pragma': 'no-cache',
'priority': 'u=0, i',
'sec-ch-ua': '"Not)A;Brand";v="99", "Google Chrome";v="127", "Chromium";v="127"',
'sec-ch-ua-mobile': '?0',
'sec-ch-ua-platform': '"Linux"',
'sec-fetch-dest': 'document',
'sec-fetch-mode': 'navigate',
'sec-fetch-site': 'none',
'sec-fetch-user': '?1',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/127.0.0.0 Safari/537.36',
}
response = requests.get(url, headers=headers)
When scraping a large volume of pages, there's a risk of your IP address being blocked by the target site. To prevent this, using proxy servers is recommended. For this guide, it's advisable to use dynamic proxy servers that feature automatic rotation. This way, you only need to set up the proxy server settings once, and the rotation will help maintain access by periodically changing the IP address, reducing the likelihood of being blocked.
proxies = {
'http': 'http://username:password@proxy-server:port',
'https': 'https://username:password@proxy-server:port'
}
response = requests.get(url, headers=headers, proxies=proxies)
Once we have the HTML content, the next step is to parse it and extract the relevant data. We’ll use the lxml library for this purpose.
from lxml import html
# Parse the HTML content using lxml
parser = html.fromstring(response.content)
We need to target the individual restaurant listings on the search results page. These elements can be identified using XPath expressions. For Yelp, the listings are usually wrapped in a div element with a specific data-testid attribute.
# Extract individual restaurant elements
elements = parser.xpath('//div[@data-testid="serp-ia-card"]')[2:-1]
XPath is a powerful tool for navigating and selecting nodes from an HTML document. In this tutorial, we’ll use XPath expressions to extract the restaurant name, URL, cuisines, and rating from each restaurant element.
Here are the specific XPaths for each data point:
Now that we have the HTML content and have handled potential IP blocking, we can extract the required data from each restaurant listing.
restaurants_data = []
# Iterate over each restaurant element
for element in elements:
# Extract the restaurant name
name = element.xpath('.//div[@class="businessName__09f24__HG_pC y-css-ohs7lg"]/div/h3/a/text()')[0]
# Extract the restaurant URL
url = element.xpath('.//div[@class="businessName__09f24__HG_pC y-css-ohs7lg"]/div/h3/a/@href')[0]
# Extract the cuisines
cuisines = element.xpath('.//div[@class="priceCategory__09f24___4Wsg iaPriceCategory__09f24__x9YrM y-css-2hdccn"]/div/div/div/a/button/span/text()')
# Extract the rating
rating = element.xpath('.//div[@class="y-css-9tnml4"]/@aria-label')[0]
# Create a dictionary to store the data
restaurant_info = {
"name": name,
"url": url,
"cuisines": cuisines,
"rating": rating
}
# Add the restaurant info to the list
restaurants_data.append(restaurant_info)
After extracting the data, we need to save it in a structured format. JSON is a widely used format for this purpose.
import json
# Save the data to a JSON file
with open('yelp_restaurants.json', 'w') as f:
json.dump(restaurants_data, f, indent=4)
print("Data extraction complete. Saved to yelp_restaurants.json")
import requests
from lxml import html
import json
# Yelp search page URL
url = "https://www.yelp.com/search?find_desc=restaurants&find_loc=San+Francisco%2C+CA"
# Set up headers to mimic a browser request
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
'Accept-Language': 'en-US,en;q=0.5'
}
# Set up proxies if required
proxies = {
'http': 'http://username:password@proxy-server:port',
'https': 'https://username:password@proxy-server:port'
}
# Send a GET request to fetch the HTML content
response = requests.get(url, headers=headers, proxies=proxies)
# Check if the request was successful
if response.status_code == 200:
print("Successfully fetched the page content")
else:
print("Failed to retrieve the page content")
# Parse the HTML content using lxml
parser = html.fromstring(response.content)
# Extract individual restaurant elements
elements = parser.xpath('//div[@data-testid="serp-ia-card"]')[2:-1]
# Initialize a list to hold the extracted data
restaurants_data = []
# Iterate over each restaurant element
for element in elements:
# Extract the restaurant name
name = element.xpath('.//div[@class="businessName__09f24__HG_pC y-css-ohs7lg"]/div/h3/a/text()')[0]
# Extract the restaurant URL
url = element.xpath('.//div[@class="businessName__09f24__HG_pC y-css-ohs7lg"]/div/h3/a/@href')[0]
# Extract the cuisines
cuisines = element.xpath('.//div[@class="priceCategory__09f24___4Wsg iaPriceCategory__09f24__x9YrM y-css-2hdccn"]/div/div/div/a/button/span/text()')
# Extract the rating
rating = element.xpath('.//div[@class="y-css-9tnml4"]/@aria-label')[0]
# Create a dictionary to store the data
restaurant_info = {
"name": name,
"url": url,
"cuisines": cuisines,
"rating": rating
}
# Add the restaurant info to the list
restaurants_data.append(restaurant_info)
# Save the data to a JSON file
with open('yelp_restaurants.json', 'w') as f:
json.dump(restaurants_data, f, indent=4)
print("Data extraction complete. Saved to yelp_restaurants.json")
It's crucial for users to properly configure HTTP headers and utilize proxies to circumvent restrictions and avoid blocking. For an optimized and safer scraping experience, consider automating IP rotation. Employing dynamic residential or mobile proxies can significantly enhance this process, reducing the likelihood of being detected and blocked.
Comments: 0