Gaining access to Airbnb data is crucial for analyzing the real estate market, researching rental price dynamics, conducting competitive analysis, and assessing reviews and ratings. This can be accomplished by scraping web data. However, accessing this data can be challenging as scraping may violate the terms of use of the site.
Next, we will explore a step-by-step guide on how to develop a web scraper to extract data from Airbnb listings using Python and Selenium. This guide will also cover how to avoid potential blocks and restrictions imposed by the platform.
The first step in creating a web scraper is understanding how to access the web pages you're interested in, since the structure of websites can often change. To familiarize yourself with the structure of a site, you can use the browser's developer tools to inspect the HTML of the web page.
To access Developer Tools, right-click on the webpage and select “Inspect” or use the shortcut:
Each listing container is wrapped in a div element with the following attribute: class="g1qv1ctd”.
By clicking on "location" and typing "London, UK" we can access the location offered in London. The website suggests adding check-in and check-out dates. It allows them to calculate the price of the rooms.
The URL for this page would look something like this:
url = "https://www.airbnb.com/s/London--United-Kingdom/homes?tab_id=home_tab&refinement_paths%5B%5D=%2Fhomes&flexible_trip_lengths%5B%5D=one_week&monthly_start_date=2024-01-01&monthly_length=3&price_filter_input_type=0&channel=EXPLORE&query=London%2C%20United%20Kingdom&place_id=ChIJdd4hrwug2EcRmSrV3Vo6llI&date_picker_type=calendar&source=structured_search_input_header&search_type=autocomplete_click"
From the search page, we will scrape the following attributes of the product listing data:
To start web scraping for Airbnb data, you need to set up your development environment first. Here are the steps to do that:
Virtual environments allow you to isolate Python packages and their dependencies for different projects. This helps prevent conflicts and ensures that each project has the correct dependencies installed.
Open a command prompt with administrator privileges and run the following command to create a new virtual environment named “venv”:
python -m venv venv
Activate the virtual environment:
venv\Scripts\activate
Open a terminal and run the following command to create a new virtual environment named “venv”:
sudo python3 -m venv venv
Activate the virtual environment:
source venv/bin/activate
To deactivate the virtual environment, simply run the following command:
deactivate
Now that you have a virtual environment set up, you can install the necessary libraries.
Within your activated virtual environment, run the following command to install the required libraries:
pip install selenium beautifulsoup4 lxml seleniumwire
Selenium requires a driver to interface with the chosen browser. We will use Chrome for this guide. However, please ensure you have installed the appropriate WebDriver for the browser of your choice.
Once downloaded, ensure the driver is placed in a directory accessible by your system's PATH environment variable. This will allow Selenium to find the driver and control the browser.
At the beginning of your Python file, import the Seleniumwire and BeautifulSoup libraries. This is how you do it:
from seleniumwire import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup
import time
import csv
import random
We will also import the `random`, `time`, and `csv` libraries for various utilities.
Next, we define a list of proxies to avoid being blocked by Airbnb. When trying to send a request without a premium proxy, you may encounter an "Access Denied" response.
You can set up a proxy as follows:
# List of proxies
proxies = [
"username:password@Your_proxy_IP_Address:Your_proxy_port1",
"username:password@Your_proxy_IP_Address:Your_proxy_port2",
"username:password@Your_proxy_IP_Address:Your_proxy_port3",
"username:password@Your_proxy_IP_Address:Your_proxy_port4",
"username:password@Your_proxy_IP_Address:Your_proxy_port5",
]
Ensure to replace "Your_proxy_IP_Address" and "Your_proxy_port" with the actual proxy address you obtained from Proxy-seller and also replace the values of “username” and “password” with your actual credentials.
Rotating proxies is a crucial aspect of web scraping. Websites often block or restrict access to bots and scrapers when they receive multiple requests from the same IP address. By rotating through different proxy IP addresses, you can avoid detection, appear as multiple organic users, and bypass most anti-scraping measures implemented on the website.
To set up proxy rotation, import the “random” Library. We also define a function `get_proxy()` to select a proxy from our list. This function randomly selects a proxy from the list of proxies using the random.choice() method and returns the selected proxy.
def get_proxy():
return random.choice(proxies)
Next, we define the main function called `listings()`. This is where we’ll set-up our “ChromeDriver”. This function uses Selenium to navigate the property listings page, waits for the page to load, and parses the HTML using Beautiful Soup.
def listings(url):
proxy = get_proxy()
proxy_options = {
"proxy": {
"http": f"http://{proxy}",
"https": f"http://{proxy}",
"no_proxy": "localhost,127.0.0.1",
}
}
chrome_options = Options()
chrome_options.add_argument("--headless")
s = Service(
"C:/Path_To_Your_WebDriver"
) # Replace with your path to ChromeDriver
driver = webdriver.Chrome(
service=s, seleniumwire_options=proxy_options, chrome_options=chrome_options
)
driver.get(url)
time.sleep(8) # Adjust based on website's load time
soup = BeautifulSoup(driver.page_source, "lxml")
driver.quit()
Here, we start by selecting a random proxy and setting up the proxy options. These options will be used to configure the webdriver to use the proxy server. Next, we set up the Chrome options. Add the --headless argument to run the browser in headless mode, which means that the browser will run in the background without a graphical user interface.
Then initialize the webdriver with the service, seleniumwire options, and Chrome options. The webdriver is then used to navigate to the given URL. We add a sleep time of 8 seconds to allow the page to load completely, and then parse the returned HTML using Beautiful Soup. After the parsing is done, it closes the webdriver.
Once you have successfully obtained the HTML content, the next step is to extract relevant data for each listing. Using BeautifulSoup, we can easily navigate through the HTML structure and locate the elements containing the listing information.
First, we identify all the listing elements on the page. These elements contain the data we're interested in, such as the listing URL, title, description, rating, price, and additional information.
listing_elements = soup.find_all("div", class_="g1qv1ctd")
for listing_element in listing_elements:
This code uses BeautifulSoup's find_all() method to locate all div elements with the class “g1qv1ctd”. These elements represent individual listings on the Airbnb page. It then loops through each of these listing elements to extract the relevant data.
For each listing element found, we extract the URL of the listing.
URL_element = soup.find("a", class_="rfexzly")
listing_data["Listing URL"] = (
"https://www.airbnb.com" + URL_element["href"] if URL_element else ""
)
Here, we search within our “soup” object for an anchor tag with the class “rfexzly”. If it finds this element, it extracts the 'href' attribute (which contains the relative URL) and appends it to the base URL to create the complete listing URL. If the element is not found, it assigns an empty string to avoid errors.
First, we'll extract the URL for each listing. This will allow us to visit the individual listing pages later if needed.
title_element = listing_element.find("div", class_="t1jojoys")
listing_data["Title"] = (
title_element.get_text(strip=True) if title_element else ""
)
The title is contained within a “div” element with the class “t1jojoys”. We retrieve the text content of this element, stripping any leading or trailing whitespace. An empty string is stored if the element is not found.
Description_element = listing_element.find("span", class_="t6mzqp7")
listing_data["Description"] = (
Description_element.get_text(strip=True) if Description_element else ""
)
Similar to the title extraction, this code finds a span element with the class "t6mzqp7". We then extract and clean the text content of this element, which contains a short description of the listing.
rating_element = listing_element.find("span", class_="ru0q88m")
listing_data["Rating"] = (
rating_element.get_text(strip=True) if rating_element else ""
)
As seen in the code above, a span element with the class “ru0q88m” holds the rating value. We extract this value, ensuring to strip any unnecessary whitespace.
Finally, we extract the price of the listing.
price_element = listing_element.select_one("._1y74zjx")
listing_data["Price"] = (
f"{price_element.get_text(strip=True)} per night" if price_element else ""
)
This code locates the element with the class "_1y74zjx" within the current listing_element. If this element, which typically contains the price information, is found, its text content is extracted, cleaned, and appended with "per night" to form a more informative price string.
Some listings may have additional information that we can extract.
listing_info_element = listing_element.find("span", {"aria-hidden": "true"})
listing_data["Additional Listing information"] = (
listing_info_element.get_text(strip=True) if listing_info_element else ""
)
We search for a span element with the attribute aria-hidden="true" to find any additional information about the listing. After extracting all relevant data from each listing element, we append the collected data to a list of listings.
listings.append(listing_data)
Once all listings have been processed, we return the list of listings, each represented as a dictionary containing the extracted data.
return listings
After successfully scraping data from Airbnb's Listing pages, the next important step is storing this valuable information for future analysis and reference. We use the csv library for this task. We open a CSV file in write mode and create a csv.DictWriter object. We then write the header and the data to the file.
airbnb_listings = listings(url)
csv_file_path = "proxy_web_listings_output.csv"
with open(csv_file_path, "w", encoding="utf-8", newline="") as csv_file:
fieldnames = [
"Listing URL",
"Title",
"Description",
"Rating",
"Price",
"Additional Listing information",
]
writer = csv.DictWriter(csv_file, fieldnames=fieldnames)
writer.writeheader()
for listing in airbnb_listings:
writer.writerow(listing)
print(f"Data has been exported to {csv_file_path}")
Here is a complete code we used for this tutorial:
from seleniumwire import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup
import time
import csv
import random
# List of proxies
proxies = [
"username:password@Your_proxy_IP_Address:Your_proxy_port1",
"username:password@Your_proxy_IP_Address:Your_proxy_port2",
"username:password@Your_proxy_IP_Address:Your_proxy_port3",
"username:password@Your_proxy_IP_Address:Your_proxy_port4",
"username:password@Your_proxy_IP_Address:Your_proxy_port5",
]
def get_proxy():
return random.choice(proxies)
def listings(url):
proxy = get_proxy()
proxy_options = {
"proxy": {
"http": f"http://{proxy}",
"https": f"http://{proxy}",
"no_proxy": "localhost,127.0.0.1",
}
}
chrome_options = Options()
chrome_options.add_argument("--headless")
s = Service(
"C:/Path_To_Your_WebDriver"
) # Replace with your path to ChromeDriver
driver = webdriver.Chrome(
service=s, seleniumwire_options=proxy_options, chrome_options=chrome_options
)
driver.get(url)
time.sleep(8) # Adjust based on website's load time
soup = BeautifulSoup(driver.page_source, "lxml")
driver.quit()
listings = []
# Find all the listing elements on the page
listing_elements = soup.find_all("div", class_="g1qv1ctd")
for listing_element in listing_elements:
# Extract data from each listing element
listing_data = {}
# Listing URL
URL_element = soup.find("a", class_="rfexzly")
listing_data["Listing URL"] = (
"https://www.airbnb.com" + URL_element["href"] if URL_element else ""
)
# Title
title_element = listing_element.find("div", class_="t1jojoys")
listing_data["Title"] = (
title_element.get_text(strip=True) if title_element else ""
)
# Description
Description_element = listing_element.find("span", class_="t6mzqp7")
listing_data["Description"] = (
Description_element.get_text(strip=True) if Description_element else ""
)
# Rating
rating_element = listing_element.find("span", class_="ru0q88m")
listing_data["Rating"] = (
rating_element.get_text(strip=True) if rating_element else ""
)
# Price
price_element = listing_element.select_one("._1y74zjx")
listing_data["Price"] = (
f"{price_element.get_text(strip=True)} per night" if price_element else ""
)
# Additional listing info
listing_info_element = listing_element.find("span", {"aria-hidden": "true"})
listing_data["Additional Listing information"] = (
listing_info_element.get_text(strip=True) if listing_info_element else ""
)
# Append the listing data to the list
listings.append(listing_data)
return listings
url = "https://www.airbnb.com/s/London--United-Kingdom/homes?tab_id=home_tab&refinement_paths%5B%5D=%2Fhomes&flexible_trip_lengths%5B%5D=one_week&monthly_start_date=2024-01-01&monthly_length=3&price_filter_input_type=0&channel=EXPLORE&query=London%2C%20United%20Kingdom&place_id=ChIJdd4hrwug2EcRmSrV3Vo6llI&date_picker_type=calendar&source=structured_search_input_header&search_type=autocomplete_click"
airbnb_listings = listings(url)
csv_file_path = "proxy_web_listings_output.csv"
with open(csv_file_path, "w", encoding="utf-8", newline="") as csv_file:
fieldnames = [
"Listing URL",
"Title",
"Description",
"Rating",
"Price",
"Additional Listing information",
]
writer = csv.DictWriter(csv_file, fieldnames=fieldnames)
writer.writeheader()
for listing in airbnb_listings:
writer.writerow(listing)
print(f"Data has been exported to {csv_file_path}")
This part of the code ensures that the scraped data is stored in a CSV file named "proxy_web_listings_output.csv".
The results of our scraper are saved to a CSV file called “proxy_web_listings_output.csv” as seen below.
This guide effectively explains how to scrape data from Airbnb listings using Python, enabling the extraction of key details such as prices, availability, and reviews. It emphasizes the importance of using proxies and rotating them to prevent being blocked by Airbnb's anti-bot measures.
Comments: 0