In this article, we will demonstrate how to collect data from the Booking.com website with Python. We will obtain information including, but not limited to, hotel names, ratings, prices, location addresses and their descriptions. The code provided allows you to retrieve data from hotel pages by parsing HTML content and extracting embedded JSON data.
Before running the code to scrape data from Booking.com, you'll need to install the necessary Python libraries. Here’s how you can install the required dependencies:
To install the necessary libraries, you can use pip:
pip install requests lxml
These are the only external libraries you need, and the rest (json, csv) come pre-installed with Python.
When scraping data from Booking.com, it's important to understand the structure of the webpage and the kind of data you want to extract. Each hotel page on Booking.com contains embedded structured data in the form of JSON-LD, a format that allows easy extraction of details like name, location, and pricing. We'll be scraping this data.
As Booking.com is a dynamic site and implements measures to combat automated actions, we will use appropriate HTTP headers and proxies to ensure seamless scraping without the risk of blocking.
Headers mimic a user session in a browser and prevent detection by Booking.com's anti-scraping systems. Without properly configured headers, the server can easily identify automated scripts, which may lead to IP blocking or captcha challenges.
To avoid being blocked by Booking.com’s anti-scraping mechanisms, we will use custom headers to simulate a legitimate user browsing the website. Here’s how you can send an HTTP request with proper headers:
import requests
from lxml.html import fromstring
urls_list = ["https links"]
for url in urls_list:
headers = {
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7',
'accept-language': 'en-IN,en;q=0.9',
'cache-control': 'no-cache',
'dnt': '1',
'pragma': 'no-cache',
'priority': 'u=0, i',
'sec-ch-ua': '"Chromium";v="130", "Google Chrome";v="130", "Not?A_Brand";v="99"',
'sec-ch-ua-mobile': '?0',
'sec-ch-ua-platform': '"Linux"',
'sec-fetch-dest': 'document',
'sec-fetch-mode': 'navigate',
'sec-fetch-site': 'none',
'sec-fetch-user': '?1',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/130.0.0.0 Safari/537.36',
}
response = requests.get(url, headers=headers)
Using proxies is necessary when scraping sites like Booking.com, which apply strict request rate limits or track IP addresses. Proxies help distribute the load of requests across different IP addresses, preventing blocks. For this purpose, both free proxies and paid proxy services with authentication by username and password or IP address can be used. In our example, we use the latter option.
proxies = {
'http': '',
'https': ''
}
response = requests.get(url, headers=headers, proxies=proxies)
After sending the request, we parse the HTML content using lxml to locate the embedded JSON-LD data that contains hotel details. This step extracts the structured data from the web page that includes hotel names, prices, locations, and more.
parser = fromstring(response.text)
# Extract embedded JSON data
embeded_jsons = parser.xpath('//script[@type="application/ld+json"]/text()')
json_data = json.loads(embeded_jsons[0])
Once we have the parsed JSON data, we can extract relevant fields such as hotel name, address, rating, and pricing. Below is the code to extract hotel information from the JSON:
name = json_data['name']
location = json_data['hasMap']
priceRange = json_data['priceRange']
description = json_data['description']
url = json_data['url']
ratingValue = json_data['aggregateRating']['ratingValue']
reviewCount = json_data['aggregateRating']['reviewCount']
type_ = json_data['@type']
postalCode = json_data['address']['postalCode']
addressLocality = json_data['address']['addressLocality']
addressCountry = json_data['address']['addressCountry']
addressRegion = json_data['address']['addressRegion']
streetAddress = json_data['address']['streetAddress']
image_url = json_data['image']
room_types = parser.xpath("//a[contains(@href, '#RD')]/span/text()")
# Append the data to the all_data list
all_data.append({
"Name": name,
"Location": location,
"Price Range": priceRange,
"Rating": ratingValue,
"Review Count": reviewCount,
"Type": type_,
"Postal Code": postalCode,
"Address Locality": addressLocality,
"Country": addressCountry,
"Region": addressRegion,
"Street Address": streetAddress,
"URL": url,
"Image URL": image_url,
"Room Types": room_types
})
Once the data has been extracted, we can save it into a CSV file for further analysis:
# After all URLs are processed, write the data into a CSV file
with open('booking_data.csv', 'w', newline='') as csvfile:
fieldnames = ["Name", "Location", "Price Range", "Rating", "Review Count", "Type", "Postal Code",
"Address Locality", "Country", "Region", "Street Address", "URL", "Image URL", "Room Types"]
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
# Write the header
writer.writeheader()
# Write the rows of data
writer.writerows(all_data)
Here’s the complete code combining all the sections:
import requests
from lxml.html import fromstring
import json
import csv
# List of hotel URLs to scrape
urls_list = [
"Https link",
"Https link"
]
# Initialize an empty list to hold all the scraped data
all_data = []
proxies = {
'http': ''
}
# Loop through each URL to scrape data
for url in urls_list:
headers = {
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7',
'accept-language': 'en-IN,en;q=0.9',
'cache-control': 'no-cache',
'dnt': '1',
'pragma': 'no-cache',
'priority': 'u=0, i',
'sec-ch-ua': '"Chromium";v="130", "Google Chrome";v="130", "Not?A_Brand";v="99"',
'sec-ch-ua-mobile': '?0',
'sec-ch-ua-platform': '"Linux"',
'sec-fetch-dest': 'document',
'sec-fetch-mode': 'navigate',
'sec-fetch-site': 'none',
'sec-fetch-user': '?1',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/130.0.0.0 Safari/537.36',
}
# Sending the request to the website
response = requests.get(url, headers=headers, proxies=proxies)
# Parsing the HTML content
parser = fromstring(response.text)
# Extract embedded JSON data
embeded_jsons = parser.xpath('//script[@type="application/ld+json"]/text()')
json_data = json.loads(embeded_jsons[0])
# Extract all hotel details from JSON
name = json_data['name']
location = json_data['hasMap']
priceRange = json_data['priceRange']
description = json_data['description']
url = json_data['url']
ratingValue = json_data['aggregateRating']['ratingValue']
reviewCount = json_data['aggregateRating']['reviewCount']
type_ = json_data['@type']
postalCode = json_data['address']['postalCode']
addressLocality = json_data['address']['addressLocality']
addressCountry = json_data['address']['addressCountry']
addressRegion = json_data['address']['addressRegion']
streetAddress = json_data['address']['streetAddress']
image_url = json_data['image']
room_types = parser.xpath("//a[contains(@href, '#RD')]/span/text()")
# Append the data to the all_data list
all_data.append({
"Name": name,
"Location": location,
"Price Range": priceRange,
"Rating": ratingValue,
"Review Count": reviewCount,
"Type": type_,
"Postal Code": postalCode,
"Address Locality": addressLocality,
"Country": addressCountry,
"Region": addressRegion,
"Street Address": streetAddress,
"URL": url,
"Image URL": image_url,
"Room Types": room_types
})
# After all URLs are processed, write the data into a CSV file
with open('booking_data.csv', 'w', newline='') as csvfile:
fieldnames = ["Name", "Location", "Price Range", "Rating", "Review Count", "Type", "Postal Code",
"Address Locality", "Country", "Region", "Street Address", "URL", "Image URL", "Room Types"]
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
# Write the header
writer.writeheader()
# Write the rows of data
writer.writerows(all_data)
print("Data successfully saved to booking_data.csv")
This article showed how to scrape hotel data from Booking.com using Python. We emphasized the importance of using appropriate HTTP headers and proxies to bypass anti-scraping measures. The extracted data can be saved in a CSV file for further analysis. When scraping websites, always check the terms of service to avoid violating.
Comments: 0