Extracting real property information from Zillow can offer perfect analysis for the market and investments. This post aims to discuss scraping Zillow property listings with Python where it will major on essential steps taken and guidelines. This guide will show you how to scrape information from the Zillow website using libraries like requests, and lxml.
Before we start, ensure you have Python installed on your system. You’ll also need to install the following libraries:
pip install requests
pip install lxml
To extract data from Zillow, you need to understand the structure of the webpage. Open a property listing page on Zillow and inspect the elements you want to scrape (e.g., property title, rent estimate price, and assessment price).
Title:
Price Details:
Now let's send HTTP requests. First, we need to fetch the HTML content of the Zillow page. We’ll use the requests library to send an HTTP GET request to the target URL. We will also set up the request headers to mimic a real browser request and use proxies to avoid IP blocking.
import requests
# Define the target URL for the Zillow property listing
url = "https://www.zillow.com/homedetails/1234-Main-St-Some-City-CA-90210/12345678_zpid/"
# Set up the request headers to mimic a browser request
headers = {
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7',
'accept-language': 'en-US,en;q=0.9',
'cache-control': 'no-cache',
'dnt': '1',
'pragma': 'no-cache',
'sec-ch-ua': '"Not/A)Brand";v="99", "Google Chrome";v="91", "Chromium";v="91"',
'sec-ch-ua-mobile': '?0',
'sec-fetch-dest': 'document',
'sec-fetch-mode': 'navigate',
'sec-fetch-site': 'same-origin',
'sec-fetch-user': '?1',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
}
# Optionally, set up proxies to avoid IP blocking
proxies = {
'http': 'http://username:password@your_proxy_address',
'https://username:password@your_proxy_address',
}
# Send the HTTP GET request with headers and proxies
response = requests.get(url, headers=headers, proxies=proxies)
response.raise_for_status() # Ensure we got a valid response
Next, we need to parse the HTML content using lxml. We’ll use the fromstring function from the lxml.html module to parse the HTML content of the webpage into an Element object.
from lxml.html import fromstring
# Parse the HTML content using lxml
parser = fromstring(response.text)
Now, we will extract specific data points such as the property title, rent estimate price, and assessment price using XPath queries on the parsed HTML content.
# Extracting the property title using XPath
title = ' '.join(parser.xpath('//h1[@class="Text-c11n-8-99-3__sc-aiai24-0 dFxMdJ"]/text()'))
# Extracting the property rent estimate price using XPath
rent_estimate_price = parser.xpath('//span[@class="Text-c11n-8-99-3__sc-aiai24-0 dFhjAe"]//text()')[-2]
# Extracting the property assessment price using XPath
assessment_price = parser.xpath('//span[@class="Text-c11n-8-99-3__sc-aiai24-0 dFhjAe"]//text()')[-1]
# Store the extracted data in a dictionary
property_data = {
'title': title,
'Rent estimate price': rent_estimate_price,
'Assessment price': assessment_price
}
Finally, we will save the extracted data to a JSON file for further processing.
import json
# Define the output JSON file name
output_file = 'zillow_properties.json'
# Open the file in write mode and dump the data
with open(output_file, 'w') as f:
json.dump(all_properties, f, indent=4)
print(f"Scraped data saved to {output_file}")
To scrape multiple property listings, we will iterate over a list of URLs and repeat the data extraction process for each one.
# List of URLs to scrape
urls = [
"https://www.zillow.com/homedetails/1234-Main-St-Some-City-CA-90210/12345678_zpid/",
"https://www.zillow.com/homedetails/5678-Another-St-Some-City-CA-90210/87654321_zpid/"
]
# List to store data for all properties
all_properties = []
for url in urls:
# Send the HTTP GET request with headers and proxies
response = requests.get(url, headers=headers, proxies=proxies)
response.raise_for_status() # Ensure we got a valid response
# Parse the HTML content using lxml
parser = fromstring(response.text)
# Extract data using XPath
title = ' '.join(parser.xpath('//h1[@class="Text-c11n-8-99-3__sc-aiai24-0 dFxMdJ"]/text()'))
rent_estimate_price = parser.xpath('//span[@class="Text-c11n-8-99-3__sc-aiai24-0 dFhjAe"]//text()')[-2]
assessment_price = parser.xpath('//span[@class="Text-c11n-8-99-3__sc-aiai24-0 dFhjAe"]//text()')[-1]
# Store the extracted data in a dictionary
property_data = {
'title': title,
'Rent estimate price': rent_estimate_price,
'Assessment price': assessment_price
}
# Append the property data to the list
all_properties.append(property_data)
Here is the complete code to scrape Zillow property data and save it to a JSON file:
import requests
from lxml.html import fromstring
import json
# Define the target URLs for Zillow property listings
urls = [
"https://www.zillow.com/homedetails/1234-Main-St-Some-City-CA-90210/12345678_zpid/",
"https://www.zillow.com/homedetails/5678-Another-St-Some-City-CA-90210/87654321_zpid/"
]
# Set up the request headers to mimic a browser request
headers = {
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7',
'accept-language': 'en-US,en;q=0.9',
'cache-control': 'no-cache',
'dnt': '1',
'pragma': 'no-cache',
'sec-ch-ua': '"Not/A)Brand";v="99", "Google Chrome";v="91", "Chromium";v="91"',
'sec-ch-ua-mobile': '?0',
'sec-fetch-dest': 'document',
'sec-fetch-mode': 'navigate',
'sec-fetch-site': 'same-origin',
'sec-fetch-user': '?1',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
}
# Optionally, set up proxies to avoid IP blocking
proxies = {
'http': 'http://username:password@your_proxy_address',
'https': 'https://username:password@your_proxy_address',
}
# List to store data for all properties
all_properties = []
for url in urls:
try:
# Send the HTTP GET request with headers and proxies
response = requests.get(url, headers=headers, proxies=proxies)
response.raise_for_status() # Ensure we got a valid response
# Parse the HTML content using lxml
parser = fromstring(response.text)
# Extract data using XPath
title = ' '.join(parser.xpath('//h1[@class="Text-c11n-8-99-3__sc-aiai24-0 dFxMdJ"]/text()'))
rent_estimate_price = parser.xpath('//span[@class="Text-c11n-8-99-3__sc-aiai24-0 dFhjAe"]//text()')[-2]
assessment_price = parser.xpath('//span[@class="Text-c11n-8-99-3__sc-aiai24-0 dFhjAe"]//text()')[-1]
# Store the extracted data in a dictionary
property_data = {
'title': title,
'Rent estimate price': rent_estimate_price,
'Assessment price': assessment_price
}
# Append the property data to the list
all_properties.append(property_data)
except requests.exceptions.HTTPError as e:
print(f"HTTP error occurred: {e}")
except requests.exceptions.RequestException as e:
print(f"An error occurred: {e}")
# Define the output JSON file name
output_file = 'zillow_properties.json'
# Open the file in write mode and dump the data
with open(output_file, 'w') as f:
json.dump(all_properties, f, indent=4)
print(f"Scraped data saved to {output_file}")
By understanding the structure of HTML pages and leveraging powerful libraries such as requests and lxml, you can efficiently extract property details. Employing proxies and rotating User-Agents enables you to make a large volume of requests to sites like Zillow without the risk of being blocked. For these activities, static ISP proxies or rotating residential proxies are considered optimal choices.
Comments: 0