When it comes to travel planning, competition analysis or research purposes, scraping flight-related information from Google Flights could yield significant insights. Here is a step-by-step tutorial on how to scrape flight information using Python, Playwright and lxml libraries.
Before diving into the scraping process, ensure you have the necessary Python libraries installed:
pip install playwright
Pip install lxml
To use Playwright, you also need to install the browser binaries:
playwright install chromium
We'll focus on extracting flight data from the Google Flights search results page.
To scrape data from Google Flights effectively, you must familiarize yourself with the HTML structure of the website. Here's how you can use Chrome DevTools to inspect elements and retrieve the necessary XPath expressions for scraping:
List of XPath expressions used:
From Location: //input[@aria-label="Where from?"]/@value
To Location: //input[@aria-label="Where to?"]/@value
Departure Date: //input[@placeholder="Departure"]/@value
Return Date: //input[@placeholder="Return"]/@value
Note: This XPath returns multiple elements, each corresponding to an individual flight.
Flight Elements: //li[@class="pIav2d"]
Airway: .//div[@class="sSHqwe tPgKwe ogfYpf"]/span/text()
Details: .//span[@class="mv1WYe"]/@aria-label
Departure Time: .//span[@aria-describedby="gEvJbfc1583"]/span/text()
Arrival Time: .//span[@aria-describedby="gEvJbfc1584"]/span/text()
Travel Time: .//div[@class="gvkrdb AdWm1c tPgKwe ogfYpf"]/text()
Price: .//div[@class="YMlIz FpEdX"]/span/text()
We use Playwright to interact with the web page and extract its content. This approach helps handle dynamic content that JavaScript might load.
Using Playwright helps handle dynamic content loaded by JavaScript. It launches a headless browser, navigates to the URL, and extracts the page content.
from playwright.sync_api import sync_playwright
# URL for the Google Flights search page
url = "https link"
def get_page_content(url):
"""Fetches the HTML content of the given URL using Playwright."""
with sync_playwright() as p:
browser = p.chromium.launch(headless=True) # Launch browser in headless mode
context = browser.new_context() # Create a new browser context
page = context.new_page() # Open a new page
page.goto(url) # Navigate to the specified URL
content = page.content() # Get the page content
browser.close() # Close the browser
return content
# Fetch the page content
page_content = get_page_content(url)
Next, we parse the HTML content of the response using lxml to extract common flight details such as departure and return dates.
from lxml import html
# Creating the parser
tree = html.fromstring(page_content)
# Extracting common flight details using XPath
from_location = tree.xpath('//input[@aria-label="Where from?"]/@value')[0] # Get the 'from' location
to_location = tree.xpath('//input[@aria-label="Where to?"]/@value')[0] # Get the 'to' location
departure_date = tree.xpath('//input[@placeholder="Departure"]/@value')[0] # Get the departure date
return_date = tree.xpath('//input[@placeholder="Return"]/@value')[0] # Get the return date
We then parse the HTML content to extract specific flight information based on the identified XPath expressions.
# Initialize an empty list to store flight details
flights = []
# Extract flight elements from the parsed HTML using XPath
flight_elements = tree.xpath('//li[@class="pIav2d"]')
# Loop through each flight element and extract details
for flight in flight_elements:
# Extract the airline name
airway = flight.xpath('.//div[@class="sSHqwe tPgKwe ogfYpf"]/span/text()')[0].strip()
# Extract flight details such as layovers
details = flight.xpath('.//span[@class="mv1WYe"]/@aria-label')[0]
# Extract the departure time
departure = flight.xpath('.//span[@jscontroller="cNtv4b"]/span/text()')[0].strip()
# Extract the arrival time
arrival = flight.xpath('.//span[@jscontroller="cNtv4b"]/span/text()')[1].strip()
# Extract the total travel time
travel_time = flight.xpath('.//div[@class="gvkrdb AdWm1c tPgKwe ogfYpf"]/text()')[0].strip()
# Extract the price of the flight
price = flight.xpath('.//div[@class="U3gSDe"]/div/div[2]/span/text()')[0].strip()
# Append the extracted details to the flights list as a dictionary
flights.append({
'Airway': airway,
'Details': details,
'Departure': departure,
'Arrival': arrival,
'Travel Time': travel_time,
'Price': price,
'From': from_location,
'To': to_location,
'Departure Date': departure_date,
'Return Date': return_date
})
Finally, we use Python's built-in CSV module to save the extracted data into a CSV file for further analysis.
import csv
# Define CSV file path
csv_file = 'google_flights.csv'
# Define CSV fieldnames
fieldnames = ['Airway', 'Details', 'Departure', 'Arrival', 'Travel Time', 'Price', 'From', 'To', 'Departure Date', 'Return Date']
# Writing data to CSV file
with open(csv_file, mode='w', newline='', encoding='utf-8') as file:
writer = csv.DictWriter(file, fieldnames=fieldnames)
writer.writeheader()
for flight in flights:
writer.writerow(flight)
print(f"Data saved to {csv_file}")
from playwright.sync_api import sync_playwright
from lxml import html
import csv
# URL for the Google Flights search page
url = "https link"
def get_page_content(url):
"""Fetches the HTML content of the given URL using Playwright."""
with sync_playwright() as p:
browser = p.chromium.launch(headless=False) # Launch browser in headful mode
context = browser.new_context() # Create a new browser context
page = context.new_page() # Open a new page
page.goto(url) # Navigate to the specified URL
page.wait_for_timeout(10000) # Wait for 10 seconds to ensure the page loads completely
content = page.content() # Get the page content
browser.close() # Close the browser
return content
# Fetch the page content
page_content = get_page_content(url)
# Parse the HTML content using lxml
tree = html.fromstring(page_content)
# Extracting flight search details
from_location = tree.xpath('//input[@aria-label="Where from?"]/@value')[0]
to_location = tree.xpath('//input[@aria-label="Where to?"]/@value')[0]
departure_date = tree.xpath('//input[@placeholder="Departure"]/@value')[0]
return_date = tree.xpath('//input[@placeholder="Return"]/@value')[0]
# Initialize a list to store flight details
flights = []
# Extract flight elements from the parsed HTML
flight_elements = tree.xpath('//li[@class="pIav2d"]')
for flight in flight_elements:
airway = flight.xpath('.//div[@class="sSHqwe tPgKwe ogfYpf"]/span/text()')[0].strip()
details = flight.xpath('.//span[@class="mv1WYe"]/@aria-label')[0]
departure = flight.xpath('.//span[@jscontroller="cNtv4b"]/span/text()')[0].strip()
arrival = flight.xpath('.//span[@jscontroller="cNtv4b"]/span/text()')[1].strip()
travel_time = flight.xpath('.//div[@class="gvkrdb AdWm1c tPgKwe ogfYpf"]/text()')[0].strip()
price = flight.xpath('.//div[@class="U3gSDe"]/div/div[2]/span/text()')[0].strip()
# Append flight details to the list
flights.append({
'Airway': airway,
'Details': details,
'Departure': departure,
'Arrival': arrival,
'Travel Time': travel_time,
'Price': price,
'From': from_location,
'To': to_location,
'Departure Date': departure_date,
'Return Date': return_date
})
# Define the CSV file path
csv_file = 'google_flights.csv'
# Define CSV fieldnames
fieldnames = ['Airway', 'Details', 'Departure', 'Arrival', 'Travel Time', 'Price', 'From', 'To', 'Departure Date', 'Return Date']
# Writing the extracted flight details to a CSV file
with open(csv_file, mode='w', newline='', encoding='utf-8') as file:
writer = csv.DictWriter(file, fieldnames=fieldnames)
writer.writeheader() # Write the header row
for flight in flights:
writer.writerow(flight) # Write each flight's details
print(f"Data saved to {csv_file}")
To reduce the risk of detection while scraping data, it's advisable to incorporate delays between requests and utilize proxies. Implementing delays helps mimic human interaction, making it harder for websites to detect automated scraping activities. For proxy selection, residential dynamic proxies are recommended because they offer a high trust level and are less likely to be blocked due to their dynamic nature. Alternatively, you can use a pool of static ISP proxies, which provide a stable and fast connection, enhancing the reliability of your data extraction process. These strategies help evade protective measures that websites use to identify and block scraping bots.
Comments: 0