Guide to scraping Zillow Real Estate data with Python

31.10.2024

Comments: 0

Content of the article:

Installing required libraries and starting scraping

Step 1. Understanding Zillow’s HTML structure
Step 2. Making HTTP requests
Step 3. Parsing HTML content
Step 4. Data extraction
Step 5. Saving data to JSON

Handling multiple URLs
Full code

Extracting real property information from Zillow can offer perfect analysis for the market and investments. This post aims to discuss scraping Zillow property listings with Python where it will major on essential steps taken and guidelines. This guide will show you how to scrape information from the Zillow website using libraries like requests, and lxml.

Installing required libraries and starting scraping

Before we start, ensure you have Python installed on your system. You’ll also need to install the following libraries:

pip install requests
pip install lxml

Step 1. Understanding Zillow’s HTML structure

To extract data from Zillow, you need to understand the structure of the webpage. Open a property listing page on Zillow and inspect the elements you want to scrape (e.g., property title, rent estimate price, and assessment price).

Title:

Price Details:

Step 2. Making HTTP requests

Now let's send HTTP requests. First, we need to fetch the HTML content of the Zillow page. We’ll use the requests library to send an HTTP GET request to the target URL. We will also set up the request headers to mimic a real browser request and use proxies to avoid IP blocking.

import requests

# Define the target URL for the Zillow property listing
url = "https://www.zillow.com/homedetails/1234-Main-St-Some-City-CA-90210/12345678_zpid/"

# Set up the request headers to mimic a browser request
headers = {
    'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7',
    'accept-language': 'en-US,en;q=0.9',
    'cache-control': 'no-cache',
    'dnt': '1',
    'pragma': 'no-cache',
    'sec-ch-ua': '"Not/A)Brand";v="99", "Google Chrome";v="91", "Chromium";v="91"',
    'sec-ch-ua-mobile': '?0',
    'sec-fetch-dest': 'document',
    'sec-fetch-mode': 'navigate',
    'sec-fetch-site': 'same-origin',
    'sec-fetch-user': '?1',
    'upgrade-insecure-requests': '1',
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
}

# Optionally, set up proxies to avoid IP blocking
proxies = {
    'http': 'http://username:password@your_proxy_address',
    'https://username:password@your_proxy_address',
}


# Send the HTTP GET request with headers and proxies
response = requests.get(url, headers=headers, proxies=proxies)
response.raise_for_status()  # Ensure we got a valid response

Step 3. Parsing HTML content

Next, we need to parse the HTML content using lxml. We’ll use the fromstring function from the lxml.html module to parse the HTML content of the webpage into an Element object.

from lxml.html import fromstring

# Parse the HTML content using lxml
parser = fromstring(response.text)

Step 4. Data extraction

Now, we will extract specific data points such as the property title, rent estimate price, and assessment price using XPath queries on the parsed HTML content.

# Extracting the property title using XPath
title = ' '.join(parser.xpath('//h1[@class="Text-c11n-8-99-3__sc-aiai24-0 dFxMdJ"]/text()'))

# Extracting the property rent estimate price using XPath
rent_estimate_price = parser.xpath('//span[@class="Text-c11n-8-99-3__sc-aiai24-0 dFhjAe"]//text()')[-2]

# Extracting the property assessment price using XPath
assessment_price = parser.xpath('//span[@class="Text-c11n-8-99-3__sc-aiai24-0 dFhjAe"]//text()')[-1]

# Store the extracted data in a dictionary
property_data = {
    'title': title,
    'Rent estimate price': rent_estimate_price,
    'Assessment price': assessment_price
}

Step 5. Saving data to JSON

Finally, we will save the extracted data to a JSON file for further processing.

import json

# Define the output JSON file name
output_file = 'zillow_properties.json'

# Open the file in write mode and dump the data
with open(output_file, 'w') as f:
    json.dump(all_properties, f, indent=4)

print(f"Scraped data saved to {output_file}")

Handling multiple URLs

To scrape multiple property listings, we will iterate over a list of URLs and repeat the data extraction process for each one.

# List of URLs to scrape
urls = [
    "https://www.zillow.com/homedetails/1234-Main-St-Some-City-CA-90210/12345678_zpid/",
    "https://www.zillow.com/homedetails/5678-Another-St-Some-City-CA-90210/87654321_zpid/"
]

# List to store data for all properties
all_properties = []

for url in urls:
    # Send the HTTP GET request with headers and proxies
    response = requests.get(url, headers=headers, proxies=proxies)
    response.raise_for_status()  # Ensure we got a valid response

    # Parse the HTML content using lxml
    parser = fromstring(response.text)

    # Extract data using XPath
    title = ' '.join(parser.xpath('//h1[@class="Text-c11n-8-99-3__sc-aiai24-0 dFxMdJ"]/text()'))
    rent_estimate_price = parser.xpath('//span[@class="Text-c11n-8-99-3__sc-aiai24-0 dFhjAe"]//text()')[-2]
    assessment_price = parser.xpath('//span[@class="Text-c11n-8-99-3__sc-aiai24-0 dFhjAe"]//text()')[-1]

    # Store the extracted data in a dictionary
    property_data = {
        'title': title,
        'Rent estimate price': rent_estimate_price,
        'Assessment price': assessment_price
    }

    # Append the property data to the list
    all_properties.append(property_data)

Full code

Here is the complete code to scrape Zillow property data and save it to a JSON file:

import requests
from lxml.html import fromstring
import json

# Define the target URLs for Zillow property listings
urls = [
    "https://www.zillow.com/homedetails/1234-Main-St-Some-City-CA-90210/12345678_zpid/",
    "https://www.zillow.com/homedetails/5678-Another-St-Some-City-CA-90210/87654321_zpid/"
]

# Set up the request headers to mimic a browser request
headers = {
    'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7',
    'accept-language': 'en-US,en;q=0.9',
    'cache-control': 'no-cache',
    'dnt': '1',
    'pragma': 'no-cache',
    'sec-ch-ua': '"Not/A)Brand";v="99", "Google Chrome";v="91", "Chromium";v="91"',
    'sec-ch-ua-mobile': '?0',
    'sec-fetch-dest': 'document',
    'sec-fetch-mode': 'navigate',
    'sec-fetch-site': 'same-origin',
    'sec-fetch-user': '?1',
    'upgrade-insecure-requests': '1',
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
}

# Optionally, set up proxies to avoid IP blocking
proxies = {
    'http': 'http://username:password@your_proxy_address',
    'https': 'https://username:password@your_proxy_address',
}

# List to store data for all properties
all_properties = []

for url in urls:
    try:
        # Send the HTTP GET request with headers and proxies
        response = requests.get(url, headers=headers, proxies=proxies)
        response.raise_for_status()  # Ensure we got a valid response

        # Parse the HTML content using lxml
        parser = fromstring(response.text)

        # Extract data using XPath
        title = ' '.join(parser.xpath('//h1[@class="Text-c11n-8-99-3__sc-aiai24-0 dFxMdJ"]/text()'))
        rent_estimate_price = parser.xpath('//span[@class="Text-c11n-8-99-3__sc-aiai24-0 dFhjAe"]//text()')[-2]
        assessment_price = parser.xpath('//span[@class="Text-c11n-8-99-3__sc-aiai24-0 dFhjAe"]//text()')[-1]

        # Store the extracted data in a dictionary
        property_data = {
            'title': title,
            'Rent estimate price': rent_estimate_price,
            'Assessment price': assessment_price
        }

        # Append the property data to the list
        all_properties.append(property_data)

    except requests.exceptions.HTTPError as e:
        print(f"HTTP error occurred: {e}")
    except requests.exceptions.RequestException as e:
        print(f"An error occurred: {e}")

# Define the output JSON file name
output_file = 'zillow_properties.json'

# Open the file in write mode and dump the data
with open(output_file, 'w') as f:
    json.dump(all_properties, f, indent=4)

print(f"Scraped data saved to {output_file}")

By understanding the structure of HTML pages and leveraging powerful libraries such as requests and lxml, you can efficiently extract property details. Employing proxies and rotating User-Agents enables you to make a large volume of requests to sites like Zillow without the risk of being blocked. For these activities, static ISP proxies or rotating residential proxies are considered optimal choices.