Guide to Google Flights Data Scraping with Python

08.11.2024

Comments: 0

Content of the article:

Setting up your environment
Step-by-step scraping process

Step 1. Understanding the website's structure
Step 2. Sending HTTP requests and extracting page content with Playwright
Step 3. Extracting common details using XPath
Step 4. Extracting specific flight data using lxml
Step 5. Saving data to CSV

Putting everything together

When it comes to travel planning, competition analysis or research purposes, scraping flight-related information from Google Flights could yield significant insights. Here is a step-by-step tutorial on how to scrape flight information using Python, Playwright and lxml libraries.

Setting up your environment

Before diving into the scraping process, ensure you have the necessary Python libraries installed:

pip install playwright
Pip install lxml

To use Playwright, you also need to install the browser binaries:

playwright install chromium

Step-by-step scraping process

We'll focus on extracting flight data from the Google Flights search results page.

Step 1. Understanding the website's structure

To scrape data from Google Flights effectively, you must familiarize yourself with the HTML structure of the website. Here's how you can use Chrome DevTools to inspect elements and retrieve the necessary XPath expressions for scraping:

Open Chrome DevTools by right-clicking on the Google Flights page and selecting "Inspect", or use the shortcut Ctrl+Shift+I (Windows/Linux) or Cmd+Option+I (Mac).
Inspect elements by hovering over different parts of the page. This will highlight the HTML structure in the DevTools. Click on the specific elements to view their attributes, which are crucial for creating accurate XPath expressions.
Retrieve XPath expressions by right-clicking on the desired element in the Elements panel, selecting "Copy", and then choosing "Copy XPath". This copies the XPath expression directly to your clipboard, ready for use in your scraping script.

List of XPath expressions used:

From Location: //input[@aria-label="Where from?"]/@value
To Location: //input[@aria-label="Where to?"]/@value
Departure Date: //input[@placeholder="Departure"]/@value
Return Date: //input[@placeholder="Return"]/@value

Note: This XPath returns multiple elements, each corresponding to an individual flight.

Flight Elements: //li[@class="pIav2d"]
Airway: .//div[@class="sSHqwe tPgKwe ogfYpf"]/span/text()
Details: .//span[@class="mv1WYe"]/@aria-label
Departure Time: .//span[@aria-describedby="gEvJbfc1583"]/span/text()
Arrival Time: .//span[@aria-describedby="gEvJbfc1584"]/span/text()
Travel Time: .//div[@class="gvkrdb AdWm1c tPgKwe ogfYpf"]/text()
Price: .//div[@class="YMlIz FpEdX"]/span/text()

Step 2. Sending HTTP requests and extracting page content with Playwright

We use Playwright to interact with the web page and extract its content. This approach helps handle dynamic content that JavaScript might load.

Using Playwright helps handle dynamic content loaded by JavaScript. It launches a headless browser, navigates to the URL, and extracts the page content.

from playwright.sync_api import sync_playwright

# URL for the Google Flights search page
url = "https link"

def get_page_content(url):
    """Fetches the HTML content of the given URL using Playwright."""
    with sync_playwright() as p:
        browser = p.chromium.launch(headless=True)  # Launch browser in headless mode
        context = browser.new_context()  # Create a new browser context
        page = context.new_page()  # Open a new page
        page.goto(url)  # Navigate to the specified URL
        content = page.content()  # Get the page content
        browser.close()  # Close the browser
    return content

# Fetch the page content
page_content = get_page_content(url)

Step 3. Extracting common details using XPath

Next, we parse the HTML content of the response using lxml to extract common flight details such as departure and return dates.

from lxml import html

# Creating the parser
tree = html.fromstring(page_content)

# Extracting common flight details using XPath
from_location = tree.xpath('//input[@aria-label="Where from?"]/@value')[0]  # Get the 'from' location
to_location = tree.xpath('//input[@aria-label="Where to?"]/@value')[0]  # Get the 'to' location
departure_date = tree.xpath('//input[@placeholder="Departure"]/@value')[0]  # Get the departure date
return_date = tree.xpath('//input[@placeholder="Return"]/@value')[0]  # Get the return date

Step 4. Extracting specific flight data using lxml

We then parse the HTML content to extract specific flight information based on the identified XPath expressions.

# Initialize an empty list to store flight details
flights = []

# Extract flight elements from the parsed HTML using XPath
flight_elements = tree.xpath('//li[@class="pIav2d"]')

# Loop through each flight element and extract details
for flight in flight_elements:
    # Extract the airline name
    airway = flight.xpath('.//div[@class="sSHqwe tPgKwe ogfYpf"]/span/text()')[0].strip()
    
    # Extract flight details such as layovers
    details = flight.xpath('.//span[@class="mv1WYe"]/@aria-label')[0]
    
    # Extract the departure time
    departure = flight.xpath('.//span[@jscontroller="cNtv4b"]/span/text()')[0].strip()
    
    # Extract the arrival time
    arrival = flight.xpath('.//span[@jscontroller="cNtv4b"]/span/text()')[1].strip()
    
    # Extract the total travel time
    travel_time = flight.xpath('.//div[@class="gvkrdb AdWm1c tPgKwe ogfYpf"]/text()')[0].strip()
    
    # Extract the price of the flight
    price = flight.xpath('.//div[@class="U3gSDe"]/div/div[2]/span/text()')[0].strip()

    # Append the extracted details to the flights list as a dictionary
    flights.append({
        'Airway': airway,
        'Details': details,
        'Departure': departure,
        'Arrival': arrival,
        'Travel Time': travel_time,
        'Price': price,
        'From': from_location,
        'To': to_location,
        'Departure Date': departure_date,
        'Return Date': return_date
    })

Step 5. Saving data to CSV

Finally, we use Python's built-in CSV module to save the extracted data into a CSV file for further analysis.

import csv

# Define CSV file path
csv_file = 'google_flights.csv'

# Define CSV fieldnames
fieldnames = ['Airway', 'Details', 'Departure', 'Arrival', 'Travel Time', 'Price', 'From', 'To', 'Departure Date', 'Return Date']

# Writing data to CSV file
with open(csv_file, mode='w', newline='', encoding='utf-8') as file:
    writer = csv.DictWriter(file, fieldnames=fieldnames)
    writer.writeheader()
    for flight in flights:
        writer.writerow(flight)

print(f"Data saved to {csv_file}")

Putting everything together

from playwright.sync_api import sync_playwright
from lxml import html
import csv

# URL for the Google Flights search page
url = "https link"

def get_page_content(url):
    """Fetches the HTML content of the given URL using Playwright."""
    with sync_playwright() as p:
        browser = p.chromium.launch(headless=False)  # Launch browser in headful mode
        context = browser.new_context()  # Create a new browser context
        page = context.new_page()  # Open a new page
        page.goto(url)  # Navigate to the specified URL
        page.wait_for_timeout(10000)  # Wait for 10 seconds to ensure the page loads completely
        content = page.content()  # Get the page content
        browser.close()  # Close the browser
    return content

# Fetch the page content
page_content = get_page_content(url)

# Parse the HTML content using lxml
tree = html.fromstring(page_content)

# Extracting flight search details
from_location = tree.xpath('//input[@aria-label="Where from?"]/@value')[0]
to_location = tree.xpath('//input[@aria-label="Where to?"]/@value')[0]
departure_date = tree.xpath('//input[@placeholder="Departure"]/@value')[0]
return_date = tree.xpath('//input[@placeholder="Return"]/@value')[0]

# Initialize a list to store flight details
flights = []

# Extract flight elements from the parsed HTML
flight_elements = tree.xpath('//li[@class="pIav2d"]')
for flight in flight_elements:
    airway = flight.xpath('.//div[@class="sSHqwe tPgKwe ogfYpf"]/span/text()')[0].strip()
    details = flight.xpath('.//span[@class="mv1WYe"]/@aria-label')[0]
    departure = flight.xpath('.//span[@jscontroller="cNtv4b"]/span/text()')[0].strip()
    arrival = flight.xpath('.//span[@jscontroller="cNtv4b"]/span/text()')[1].strip()
    travel_time = flight.xpath('.//div[@class="gvkrdb AdWm1c tPgKwe ogfYpf"]/text()')[0].strip()
    price = flight.xpath('.//div[@class="U3gSDe"]/div/div[2]/span/text()')[0].strip()

    # Append flight details to the list
    flights.append({
        'Airway': airway,
        'Details': details,
        'Departure': departure,
        'Arrival': arrival,
        'Travel Time': travel_time,
        'Price': price,
        'From': from_location,
        'To': to_location,
        'Departure Date': departure_date,
        'Return Date': return_date
    })

# Define the CSV file path
csv_file = 'google_flights.csv'

# Define CSV fieldnames
fieldnames = ['Airway', 'Details', 'Departure', 'Arrival', 'Travel Time', 'Price', 'From', 'To', 'Departure Date', 'Return Date']

# Writing the extracted flight details to a CSV file
with open(csv_file, mode='w', newline='', encoding='utf-8') as file:
    writer = csv.DictWriter(file, fieldnames=fieldnames)
    writer.writeheader()  # Write the header row
    for flight in flights:
        writer.writerow(flight)  # Write each flight's details

print(f"Data saved to {csv_file}")

To reduce the risk of detection while scraping data, it's advisable to incorporate delays between requests and utilize proxies. Implementing delays helps mimic human interaction, making it harder for websites to detect automated scraping activities. For proxy selection, residential dynamic proxies are recommended because they offer a high trust level and are less likely to be blocked due to their dynamic nature. Alternatively, you can use a pool of static ISP proxies, which provide a stable and fast connection, enhancing the reliability of your data extraction process. These strategies help evade protective measures that websites use to identify and block scraping bots.