Scraping Airbnb listing data with Python

23.08.2024

Comments: 0

Content of the article:

Understanding the architecture of Airbnb's website
Step-by-step guide on building an Airbnb scraping program

Step 1: Creating a virtual environment
Step 2: Installing the required libraries
Step 3: Import libraries
Step 4: Proxy integration
Step 5: Rotating proxies
Step 6: Set up WebDriver
Step 7: Finding and extracting the listing data
Step 8: Writing data to a CSV file

Results

Gaining access to Airbnb data is crucial for analyzing the real estate market, researching rental price dynamics, conducting competitive analysis, and assessing reviews and ratings. This can be accomplished by scraping web data. However, accessing this data can be challenging as scraping may violate the terms of use of the site.

Next, we will explore a step-by-step guide on how to develop a web scraper to extract data from Airbnb listings using Python and Selenium. This guide will also cover how to avoid potential blocks and restrictions imposed by the platform.

Understanding the architecture of Airbnb's website

The first step in creating a web scraper is understanding how to access the web pages you're interested in, since the structure of websites can often change. To familiarize yourself with the structure of a site, you can use the browser's developer tools to inspect the HTML of the web page.

To access Developer Tools, right-click on the webpage and select “Inspect” or use the shortcut:

CTRL+SHIFT+I for Windows;
Option + ⌘ + I on Mac.

Each listing container is wrapped in a div element with the following attribute: class="g1qv1ctd”.

By clicking on "location" and typing "London, UK" we can access the location offered in London. The website suggests adding check-in and check-out dates. It allows them to calculate the price of the rooms.

The URL for this page would look something like this:

url = "https://www.airbnb.com/s/London--United-Kingdom/homes?tab_id=home_tab&refinement_paths%5B%5D=%2Fhomes&flexible_trip_lengths%5B%5D=one_week&monthly_start_date=2024-01-01&monthly_length=3&price_filter_input_type=0&channel=EXPLORE&query=London%2C%20United%20Kingdom&place_id=ChIJdd4hrwug2EcRmSrV3Vo6llI&date_picker_type=calendar&source=structured_search_input_header&search_type=autocomplete_click"

From the search page, we will scrape the following attributes of the product listing data:

Listing URL;
Title;
Description;
Rating;
Price;
Additional Listing Information (No of beds and available dates).

Step-by-step guide on building an Airbnb scraping program

To start web scraping for Airbnb data, you need to set up your development environment first. Here are the steps to do that:

Step 1: Creating a virtual environment

Virtual environments allow you to isolate Python packages and their dependencies for different projects. This helps prevent conflicts and ensures that each project has the correct dependencies installed.

Creating a virtual environment on Windows

Open a command prompt with administrator privileges and run the following command to create a new virtual environment named “venv”:

python -m venv venv

Activate the virtual environment:

venv\Scripts\activate

Creating a virtual environment on macOS/Linux

Open a terminal and run the following command to create a new virtual environment named “venv”:

sudo python3 -m venv venv

Activate the virtual environment:

source venv/bin/activate

To deactivate the virtual environment, simply run the following command:

deactivate

Step 2: Installing the required libraries

Now that you have a virtual environment set up, you can install the necessary libraries.

Understanding the libraries:

Selenium: This powerful web scraping tool allows you to programmatically control a web browser. This lets you interact with web pages, including clicking buttons, filling forms, and navigating through pages as if you were a real user.
Seleniumwire: This library extends Selenium by allowing you to intercept and inspect HTTP requests and integrate proxies with your scraping operations. This is very important as Selenium does not have native proxy support.
BeautifulSoup4: This is a library designed for parsing HTML and XML files. It helps you extract specific information from web pages in a structured and efficient way.
lxml: A fast and robust HTML and XML parser that complements BeautifulSoup.

Within your activated virtual environment, run the following command to install the required libraries:

pip install selenium beautifulsoup4 lxml seleniumwire

Selenium drivers

Selenium requires a driver to interface with the chosen browser. We will use Chrome for this guide. However, please ensure you have installed the appropriate WebDriver for the browser of your choice.

Once downloaded, ensure the driver is placed in a directory accessible by your system's PATH environment variable. This will allow Selenium to find the driver and control the browser.

Step 3: Import libraries

At the beginning of your Python file, import the Seleniumwire and BeautifulSoup libraries. This is how you do it:

from seleniumwire import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup
import time
import csv
import random

We will also import the `random`, `time`, and `csv` libraries for various utilities.

Step 4: Proxy integration

Next, we define a list of proxies to avoid being blocked by Airbnb. When trying to send a request without a premium proxy, you may encounter an "Access Denied" response.

You can set up a proxy as follows:

# List of proxies
proxies = [
     "username:password@Your_proxy_IP_Address:Your_proxy_port1",
     "username:password@Your_proxy_IP_Address:Your_proxy_port2",
     "username:password@Your_proxy_IP_Address:Your_proxy_port3",
     "username:password@Your_proxy_IP_Address:Your_proxy_port4",
     "username:password@Your_proxy_IP_Address:Your_proxy_port5",

]

Ensure to replace "Your_proxy_IP_Address" and "Your_proxy_port" with the actual proxy address you obtained from Proxy-seller and also replace the values of “username” and “password” with your actual credentials.

Step 5: Rotating proxies

Rotating proxies is a crucial aspect of web scraping. Websites often block or restrict access to bots and scrapers when they receive multiple requests from the same IP address. By rotating through different proxy IP addresses, you can avoid detection, appear as multiple organic users, and bypass most anti-scraping measures implemented on the website.

To set up proxy rotation, import the “random” Library. We also define a function `get_proxy()` to select a proxy from our list. This function randomly selects a proxy from the list of proxies using the random.choice() method and returns the selected proxy.

def get_proxy():
    return random.choice(proxies)

Step 6: Set up WebDriver

Next, we define the main function called `listings()`. This is where we’ll set-up our “ChromeDriver”. This function uses Selenium to navigate the property listings page, waits for the page to load, and parses the HTML using Beautiful Soup.

def listings(url):

    proxy = get_proxy()
    proxy_options = {
        "proxy": {
            "http": f"http://{proxy}",
            "https": f"http://{proxy}",
            "no_proxy": "localhost,127.0.0.1",
        }
    }

    chrome_options = Options()
    chrome_options.add_argument("--headless")
  

    s = Service(
        "C:/Path_To_Your_WebDriver"
    )  # Replace with your path to ChromeDriver
    driver = webdriver.Chrome(
        service=s, seleniumwire_options=proxy_options, chrome_options=chrome_options
    )

    driver.get(url)

    time.sleep(8)  # Adjust based on website's load time

    soup = BeautifulSoup(driver.page_source, "lxml")

    driver.quit()

Here, we start by selecting a random proxy and setting up the proxy options. These options will be used to configure the webdriver to use the proxy server. Next, we set up the Chrome options. Add the --headless argument to run the browser in headless mode, which means that the browser will run in the background without a graphical user interface.

Then initialize the webdriver with the service, seleniumwire options, and Chrome options. The webdriver is then used to navigate to the given URL. We add a sleep time of 8 seconds to allow the page to load completely, and then parse the returned HTML using Beautiful Soup. After the parsing is done, it closes the webdriver.

Step 7: Finding and extracting the listing data

Once you have successfully obtained the HTML content, the next step is to extract relevant data for each listing. Using BeautifulSoup, we can easily navigate through the HTML structure and locate the elements containing the listing information.

Extracting listing elements

First, we identify all the listing elements on the page. These elements contain the data we're interested in, such as the listing URL, title, description, rating, price, and additional information.

listing_elements = soup.find_all("div", class_="g1qv1ctd")
for listing_element in listing_elements:

This code uses BeautifulSoup's find_all() method to locate all div elements with the class “g1qv1ctd”. These elements represent individual listings on the Airbnb page. It then loops through each of these listing elements to extract the relevant data.

Extracting listing URL

For each listing element found, we extract the URL of the listing.

URL_element = soup.find("a", class_="rfexzly")
listing_data["Listing URL"] = (
    "https://www.airbnb.com" + URL_element["href"] if URL_element else ""
)

Here, we search within our “soup” object for an anchor tag with the class “rfexzly”. If it finds this element, it extracts the 'href' attribute (which contains the relative URL) and appends it to the base URL to create the complete listing URL. If the element is not found, it assigns an empty string to avoid errors.

Extracting the listing title

First, we'll extract the URL for each listing. This will allow us to visit the individual listing pages later if needed.

title_element = listing_element.find("div", class_="t1jojoys")
listing_data["Title"] = (
    title_element.get_text(strip=True) if title_element else ""
)

The title is contained within a “div” element with the class “t1jojoys”. We retrieve the text content of this element, stripping any leading or trailing whitespace. An empty string is stored if the element is not found.

Extracting the listing description

Description_element = listing_element.find("span", class_="t6mzqp7")
listing_data["Description"] = (
    Description_element.get_text(strip=True) if Description_element else ""
)

Similar to the title extraction, this code finds a span element with the class "t6mzqp7". We then extract and clean the text content of this element, which contains a short description of the listing.

Extracting the listing rating

rating_element = listing_element.find("span", class_="ru0q88m")
listing_data["Rating"] = (
    rating_element.get_text(strip=True) if rating_element else ""
)

As seen in the code above, a span element with the class “ru0q88m” holds the rating value. We extract this value, ensuring to strip any unnecessary whitespace.

Extracting the listing price

Finally, we extract the price of the listing.

price_element = listing_element.select_one("._1y74zjx")
listing_data["Price"] = (
    f"{price_element.get_text(strip=True)} per night" if price_element else ""
)

This code locates the element with the class "_1y74zjx" within the current listing_element. If this element, which typically contains the price information, is found, its text content is extracted, cleaned, and appended with "per night" to form a more informative price string.

Extracting additional listing information

Some listings may have additional information that we can extract.

listing_info_element = listing_element.find("span", {"aria-hidden": "true"})
listing_data["Additional Listing information"] = (
    listing_info_element.get_text(strip=True) if listing_info_element else ""
)

We search for a span element with the attribute aria-hidden="true" to find any additional information about the listing. After extracting all relevant data from each listing element, we append the collected data to a list of listings.

listings.append(listing_data)

Once all listings have been processed, we return the list of listings, each represented as a dictionary containing the extracted data.

return listings

Step 8: Writing data to a CSV file

After successfully scraping data from Airbnb's Listing pages, the next important step is storing this valuable information for future analysis and reference. We use the csv library for this task. We open a CSV file in write mode and create a csv.DictWriter object. We then write the header and the data to the file.

airbnb_listings = listings(url)

csv_file_path = "proxy_web_listings_output.csv"

with open(csv_file_path, "w", encoding="utf-8", newline="") as csv_file:
    fieldnames = [
        "Listing URL",
        "Title",
        "Description",
        "Rating",
        "Price",
        "Additional Listing information",
    ]
    writer = csv.DictWriter(csv_file, fieldnames=fieldnames)
    writer.writeheader()
    for listing in airbnb_listings:
        writer.writerow(listing)

print(f"Data has been exported to {csv_file_path}")

Here is a complete code we used for this tutorial:

from seleniumwire import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup
import time
import csv
import random

# List of proxies
proxies = [ 
 "username:password@Your_proxy_IP_Address:Your_proxy_port1",
 "username:password@Your_proxy_IP_Address:Your_proxy_port2",
 "username:password@Your_proxy_IP_Address:Your_proxy_port3",
 "username:password@Your_proxy_IP_Address:Your_proxy_port4",
 "username:password@Your_proxy_IP_Address:Your_proxy_port5",
]

def get_proxy():
    return random.choice(proxies)


def listings(url):

    proxy = get_proxy()
    proxy_options = {
        "proxy": {
            "http": f"http://{proxy}",
            "https": f"http://{proxy}",
            "no_proxy": "localhost,127.0.0.1",
        }
    }

    chrome_options = Options()
    chrome_options.add_argument("--headless")
  

    s = Service(
        "C:/Path_To_Your_WebDriver"
    )  # Replace with your path to ChromeDriver
    driver = webdriver.Chrome(
        service=s, seleniumwire_options=proxy_options, chrome_options=chrome_options
    )

    driver.get(url)

    time.sleep(8)  # Adjust based on website's load time

    soup = BeautifulSoup(driver.page_source, "lxml")

    driver.quit()

    listings = []

    # Find all the listing elements on the page
    listing_elements = soup.find_all("div", class_="g1qv1ctd")

    for listing_element in listing_elements:
        # Extract data from each listing element
        listing_data = {}

        # Listing URL
        URL_element = soup.find("a", class_="rfexzly")
        listing_data["Listing URL"] = (
            "https://www.airbnb.com" + URL_element["href"] if URL_element else ""
        )

        # Title
        title_element = listing_element.find("div", class_="t1jojoys")
        listing_data["Title"] = (
            title_element.get_text(strip=True) if title_element else ""
        )

        # Description
        Description_element = listing_element.find("span", class_="t6mzqp7")
        listing_data["Description"] = (
            Description_element.get_text(strip=True) if Description_element else ""
        )

        # Rating
        rating_element = listing_element.find("span", class_="ru0q88m")
        listing_data["Rating"] = (
            rating_element.get_text(strip=True) if rating_element else ""
        )

        # Price
        price_element = listing_element.select_one("._1y74zjx")
        listing_data["Price"] = (
            f"{price_element.get_text(strip=True)} per night" if price_element else ""
        )

        # Additional listing info
        listing_info_element = listing_element.find("span", {"aria-hidden": "true"})
        listing_data["Additional Listing information"] = (
            listing_info_element.get_text(strip=True) if listing_info_element else ""
        )

        # Append the listing data to the list
        listings.append(listing_data)

    return listings


url = "https://www.airbnb.com/s/London--United-Kingdom/homes?tab_id=home_tab&refinement_paths%5B%5D=%2Fhomes&flexible_trip_lengths%5B%5D=one_week&monthly_start_date=2024-01-01&monthly_length=3&price_filter_input_type=0&channel=EXPLORE&query=London%2C%20United%20Kingdom&place_id=ChIJdd4hrwug2EcRmSrV3Vo6llI&date_picker_type=calendar&source=structured_search_input_header&search_type=autocomplete_click"


airbnb_listings = listings(url)

csv_file_path = "proxy_web_listings_output.csv"

with open(csv_file_path, "w", encoding="utf-8", newline="") as csv_file:
    fieldnames = [
        "Listing URL",
        "Title",
        "Description",
        "Rating",
        "Price",
        "Additional Listing information",
    ]
    writer = csv.DictWriter(csv_file, fieldnames=fieldnames)
    writer.writeheader()
    for listing in airbnb_listings:
        writer.writerow(listing)

print(f"Data has been exported to {csv_file_path}")

This part of the code ensures that the scraped data is stored in a CSV file named "proxy_web_listings_output.csv".

Results

The results of our scraper are saved to a CSV file called “proxy_web_listings_output.csv” as seen below.

This guide effectively explains how to scrape data from Airbnb listings using Python, enabling the extraction of key details such as prices, availability, and reviews. It emphasizes the importance of using proxies and rotating them to prevent being blocked by Airbnb's anti-bot measures.