How to scrape Booking.com data using Python

Comments: 0

In this article, we will demonstrate how to collect data from the Booking.com website with Python. We will obtain information including, but not limited to, hotel names, ratings, prices, location addresses and their descriptions. The code provided allows you to retrieve data from hotel pages by parsing HTML content and extracting embedded JSON data.

Installing required libraries

Before running the code to scrape data from Booking.com, you'll need to install the necessary Python libraries. Here’s how you can install the required dependencies:

  1. Requests Library: This is used to send HTTP requests to the website and fetch the HTML content of the pages.
  2. LXML Library: This is used for parsing the HTML content and extracting data using XPath.
  3. JSON: Built-in Python module used to handle JSON data.
  4. CSV: Built-in Python module used for writing scraped data into a CSV file.

To install the necessary libraries, you can use pip:


pip install requests lxml

These are the only external libraries you need, and the rest (json, csv) come pre-installed with Python.

Understanding the URL and data structure

When scraping data from Booking.com, it's important to understand the structure of the webpage and the kind of data you want to extract. Each hotel page on Booking.com contains embedded structured data in the form of JSON-LD, a format that allows easy extraction of details like name, location, and pricing. We'll be scraping this data.

Step-by-Step Scraping Process

As Booking.com is a dynamic site and implements measures to combat automated actions, we will use appropriate HTTP headers and proxies to ensure seamless scraping without the risk of blocking.

Sending HTTP Requests with Headers

Headers mimic a user session in a browser and prevent detection by Booking.com's anti-scraping systems. Without properly configured headers, the server can easily identify automated scripts, which may lead to IP blocking or captcha challenges.

To avoid being blocked by Booking.com’s anti-scraping mechanisms, we will use custom headers to simulate a legitimate user browsing the website. Here’s how you can send an HTTP request with proper headers:


import requests
from lxml.html import fromstring

urls_list = ["https links"]

for url in urls_list:
    headers = {
        'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7',
        'accept-language': 'en-IN,en;q=0.9',
        'cache-control': 'no-cache',
        'dnt': '1',
        'pragma': 'no-cache',
        'priority': 'u=0, i',
        'sec-ch-ua': '"Chromium";v="130", "Google Chrome";v="130", "Not?A_Brand";v="99"',
        'sec-ch-ua-mobile': '?0',
        'sec-ch-ua-platform': '"Linux"',
        'sec-fetch-dest': 'document',
        'sec-fetch-mode': 'navigate',
        'sec-fetch-site': 'none',
        'sec-fetch-user': '?1',
        'upgrade-insecure-requests': '1',
        'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/130.0.0.0 Safari/537.36',
    }

    response = requests.get(url, headers=headers)

Importance of proxies

Using proxies is necessary when scraping sites like Booking.com, which apply strict request rate limits or track IP addresses. Proxies help distribute the load of requests across different IP addresses, preventing blocks. For this purpose, both free proxies and paid proxy services with authentication by username and password or IP address can be used. In our example, we use the latter option.


proxies = {
    'http': '',
    'https': ''
}
response = requests.get(url, headers=headers, proxies=proxies)

Parsing the HTML and extracting JSON data

After sending the request, we parse the HTML content using lxml to locate the embedded JSON-LD data that contains hotel details. This step extracts the structured data from the web page that includes hotel names, prices, locations, and more.


parser = fromstring(response.text)

# Extract embedded JSON data
embeded_jsons = parser.xpath('//script[@type="application/ld+json"]/text()')
json_data = json.loads(embeded_jsons[0])

Extracting hotel information

Once we have the parsed JSON data, we can extract relevant fields such as hotel name, address, rating, and pricing. Below is the code to extract hotel information from the JSON:


name = json_data['name']
location = json_data['hasMap']
priceRange = json_data['priceRange']
description = json_data['description']
url = json_data['url']
ratingValue = json_data['aggregateRating']['ratingValue']
reviewCount = json_data['aggregateRating']['reviewCount']
type_ = json_data['@type']
postalCode = json_data['address']['postalCode']
addressLocality = json_data['address']['addressLocality']
addressCountry = json_data['address']['addressCountry']
addressRegion = json_data['address']['addressRegion']
streetAddress = json_data['address']['streetAddress']
image_url = json_data['image']
room_types = parser.xpath("//a[contains(@href, '#RD')]/span/text()")

# Append the data to the all_data list
all_data.append({
    "Name": name,
    "Location": location,
    "Price Range": priceRange,
    "Rating": ratingValue,
    "Review Count": reviewCount,
    "Type": type_,
    "Postal Code": postalCode,
    "Address Locality": addressLocality,
    "Country": addressCountry,
    "Region": addressRegion,
    "Street Address": streetAddress,
    "URL": url,
    "Image URL": image_url,
    "Room Types": room_types
})

Saving the data to CSV

Once the data has been extracted, we can save it into a CSV file for further analysis:


# After all URLs are processed, write the data into a CSV file
with open('booking_data.csv', 'w', newline='') as csvfile:
    fieldnames = ["Name", "Location", "Price Range", "Rating", "Review Count", "Type", "Postal Code", 
                  "Address Locality", "Country", "Region", "Street Address", "URL", "Image URL", "Room Types"]
    
    writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
    
    # Write the header
    writer.writeheader()
    
    # Write the rows of data
    writer.writerows(all_data)

Complete code

Here’s the complete code combining all the sections:


import requests
from lxml.html import fromstring
import json
import csv

# List of hotel URLs to scrape
urls_list = [
    "Https link", 
    "Https link"
]

# Initialize an empty list to hold all the scraped data
all_data = []

proxies = {
    'http': ''
}

# Loop through each URL to scrape data
for url in urls_list:
    headers = {
        'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7',
        'accept-language': 'en-IN,en;q=0.9',
        'cache-control': 'no-cache',
        'dnt': '1',
        'pragma': 'no-cache',
        'priority': 'u=0, i',
        'sec-ch-ua': '"Chromium";v="130", "Google Chrome";v="130", "Not?A_Brand";v="99"',
        'sec-ch-ua-mobile': '?0',
        'sec-ch-ua-platform': '"Linux"',
        'sec-fetch-dest': 'document',
        'sec-fetch-mode': 'navigate',
        'sec-fetch-site': 'none',
        'sec-fetch-user': '?1',
        'upgrade-insecure-requests': '1',
        'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/130.0.0.0 Safari/537.36',
    }

    # Sending the request to the website
    response = requests.get(url, headers=headers, proxies=proxies)
    
    # Parsing the HTML content
    parser = fromstring(response.text)
    
    # Extract embedded JSON data
    embeded_jsons = parser.xpath('//script[@type="application/ld+json"]/text()')
    json_data = json.loads(embeded_jsons[0])

    # Extract all hotel details from JSON
    name = json_data['name']
    location = json_data['hasMap']
    priceRange = json_data['priceRange']
    description = json_data['description']
    url = json_data['url']
    ratingValue = json_data['aggregateRating']['ratingValue']
    reviewCount = json_data['aggregateRating']['reviewCount']
    type_ = json_data['@type']
    postalCode = json_data['address']['postalCode']
    addressLocality = json_data['address']['addressLocality']
    addressCountry = json_data['address']['addressCountry']
    addressRegion = json_data['address']['addressRegion']
    streetAddress = json_data['address']['streetAddress']
    image_url = json_data['image']

    room_types = parser.xpath("//a[contains(@href, '#RD')]/span/text()")
    
    # Append the data to the all_data list
    all_data.append({
        "Name": name,
        "Location": location,
        "Price Range": priceRange,
        "Rating": ratingValue,
        "Review Count": reviewCount,
        "Type": type_,
        "Postal Code": postalCode,
        "Address Locality": addressLocality,
        "Country": addressCountry,
        "Region": addressRegion,
        "Street Address": streetAddress,
        "URL": url,
        "Image URL": image_url,
        "Room Types": room_types
    })

# After all URLs are processed, write the data into a CSV file
with open('booking_data.csv', 'w', newline='') as csvfile:
    fieldnames = ["Name", "Location", "Price Range", "Rating", "Review Count", "Type", "Postal Code", 
                  "Address Locality", "Country", "Region", "Street Address", "URL", "Image URL", "Room Types"]
    
    writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
    
    # Write the header
    writer.writeheader()
    
    # Write the rows of data
    writer.writerows(all_data)

print("Data successfully saved to booking_data.csv")

This article showed how to scrape hotel data from Booking.com using Python. We emphasized the importance of using appropriate HTTP headers and proxies to bypass anti-scraping measures. The extracted data can be saved in a CSV file for further analysis. When scraping websites, always check the terms of service to avoid violating.

Comments:

0 comments