TripAdvisor is a widely used travel portal where users write reviews for accommodation, dining, sightseeing, etc. Enriching the Data with TripAdvisor reviews could be helpful for Travel Analysis, Competitor Studies, etc. In this guide, we shall cover how TripAdvisor data can be extracted using python and the data stored in a csv format.
To build this scraper, we’ll use the following Python libraries:
Install the required libraries using pip:
pip install requests lxml
When scraping data from websites like TripAdvisor, it's crucial to properly configure the request headers, especially the User-Agent. By setting this header, you can disguise your requests as those coming from legitimate users, which minimizes the risk of your scraping activities triggering blocks due to unusual traffic patterns. Additionally, employing proxy servers is essential to circumvent restrictions related to the number of requests permissible from a single IP address, thus facilitating more extensive data collection efforts.
We’ll go through the process of scraping a list of hotel pages, extracting details, and saving them into a CSV file. Let’s break down each part.
To begin, import the necessary libraries:
import requests
from lxml.html import fromstring
import csv
Then, define a list of URLs for the hotel pages you plan to scrape data from:
urls_list = [
'Https link',
'Https link'
]
To ensure that your requests mimic those from a real browser, it's crucial to configure headers properly. This step helps to bypass anti-bot protection systems on websites and minimizes the risk of being blocked.
headers = {
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7',
'accept-language': 'en-IN,en;q=0.9',
'cache-control': 'no-cache',
'dnt': '1',
'pragma': 'no-cache',
'sec-ch-ua': '"Chromium";v="130", "Google Chrome";v="130", "Not?A_Brand";v="99"',
'sec-ch-ua-mobile': '?0',
'sec-ch-ua-platform': '"Linux"',
'sec-fetch-dest': 'document',
'sec-fetch-mode': 'navigate',
'sec-fetch-site': 'none',
'sec-fetch-user': '?1',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/130.0.0.0 Safari/537.36',
}
Proxies can help bypass IP-based restrictions. In the example below, we use a proxy with IP address authentication. Here’s how to add a proxy to requests.
proxies = {
'http': 'http://your_proxy_address:port',
'https': 'http://your_proxy_address:port',
}
response = requests.get(url, headers=headers, proxies=proxies)
For each URL, send a request and parse the response HTML:
extracted_data = []
for url in urls_list:
response = requests.get(url, headers=headers) # Add proxies=proxies if needed
parser = fromstring(response.text)
Using XPath, we can target specific elements on the page:
title = parser.xpath('//h1[@data-automation="mainH1"]/text()')[0]
about = parser.xpath('//div[@class="_T FKffI bmUTE"]/div/div/text()')[0].strip()
images_url = parser.xpath('//div[@data-testid="media_window_test"]/div/div/button/picture/source/@srcset')
price = parser.xpath('//div[@data-automation="commerce_module_visible_price"]/text()')[0]
ratings = parser.xpath('//div[@class="jVDab W f u w GOdjs"]/@aria-label')[0].split(' ')[0]
features = parser.xpath('//div[@class="f Q2 _Y tyUdl"]/div[2]/span/span/span/text()')
reviews = parser.xpath('//span[@class="JguWG"]/span//text()')
listing_by = parser.xpath('//div[@class="biGQs _P pZUbB KxBGd"]/text()')[0]
similar_experiences = parser.xpath('//div[@data-automation="shelfCard"]/a/@href')
Store the extracted information in a dictionary and append it to a list:
data = {
'title': title,
'about': about,
'price': price,
'listing by': listing_by,
'ratings': ratings,
'image_urls': images_url,
'features': features,
'reviews': reviews,
'similar_experiences': similar_experiences
}
extracted_data.append(data)
After scraping, save the data into a CSV file:
csv_columns = ['title', 'about', 'price', 'listing by', 'ratings', 'image_urls', 'features', 'reviews', 'similar_experiences']
with open("tripadvisor_data.csv", 'w', newline='', encoding='utf-8') as csvfile:
writer = csv.DictWriter(csvfile, fieldnames=csv_columns)
writer.writeheader()
for data in extracted_data:
writer.writerow(data)
import requests
from lxml.html import fromstring
import csv
urls_list = [
'Https link',
'Https link'
]
headers = {
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7',
'accept-language': 'en-IN,en;q=0.9',
'cache-control': 'no-cache',
'dnt': '1',
'pragma': 'no-cache',
'sec-ch-ua': '"Chromium";v="130", "Google Chrome";v="130", "Not?A_Brand";v="99"',
'sec-ch-ua-mobile': '?0',
'sec-ch-ua-platform': '"Linux"',
'sec-fetch-dest': 'document',
'sec-fetch-mode': 'navigate',
'sec-fetch-site': 'none',
'sec-fetch-user': '?1',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/130.0.0.0 Safari/537.36',
}
proxies = {
'http': 'http://your_proxy_address:port',
'https': 'http://your_proxy_address:port',
}
extracted_data = []
for url in urls_list:
response = requests.get(url, headers=headers, proxies=proxies)
parser = fromstring(response.text)
title = parser.xpath('//h1[@data-automation="mainH1"]/text()')[0]
about = parser.xpath('//div[@class="_T FKffI bmUTE"]/div/div/text()')[0].strip()
images_url = parser.xpath('//div[@data-testid="media_window_test"]/div/div/button/picture/source/@srcset')
price = parser.xpath('//div[@data-automation="commerce_module_visible_price"]/text()')[0]
ratings = parser.xpath('//div[@class="jVDab W f u w GOdjs"]/@aria-label')[0].split(' ')[0]
features = parser.xpath('//div[@class="f Q2 _Y tyUdl"]/div[2]/span/span/span/text()')
reviews = parser.xpath('//span[@class="JguWG"]/span//text()')
listing_by = parser.xpath('//div[@class="biGQs _P pZUbB KxBGd"]/text()')[0]
similar_experiences = parser.xpath('//div[@data-automation="shelfCard"]/a/@href')
data = {
'title': title,
'about': about,
'price': price,
'listing by': listing_by,
'ratings': ratings,
'image_urls': images_url,
'features': features,
'reviews': reviews,
'similar_experiences': similar_experiences
}
extracted_data.append(data)
csv_columns = ['title', 'about', 'price', 'listing by', 'ratings', 'image_urls', 'features', 'reviews', 'similar_experiences']
with open("tripadvisor_data.csv", 'w', newline='', encoding='utf-8') as csvfile:
writer = csv.DictWriter(csvfile, fieldnames=csv_columns)
writer.writeheader()
for data in extracted_data:
writer.writerow(data)
print('saved into tripadvisor_data.csv')
This guide not only lays a technical foundation for data scraping but also opens up avenues for comprehensive analysis in the tourism sector. The methods and techniques detailed here empower users to delve deeper into market trends and consumer behaviors. Such insights are crucial for developing robust reputation management strategies and conducting competitive analyses. By leveraging this guide, users can enhance their understanding of the dynamics within the tourism industry, making it possible to devise informed, strategic decisions that drive success.
Comments: 0