使用 Python 搜刮 Zillow 房地产数据指南

31.10.2024

评论: 0

喜歡:

文章的内容::

安装所需库并开始刮擦

第 1 步。了解 Zillow 的 HTML 结构
第 2 步。提出 HTTP 请求
第 3 步。解析 HTML 内容
第 4 步。提取数据
第 5 步。将数据保存为 JSON

处理多个 URL
完整代码

从 Zillow 中提取房地产信息可以为市场和投资提供完美的分析。本篇文章旨在讨论使用 Python 抓取 Zillow 房地产列表的基本步骤和指南。本指南将向您展示如何使用请求和 lxml 等库从 Zillow 网站上抓取信息。

安装所需库并开始刮擦

在开始之前，请确保您的系统已安装 Python。您还需要安装以下库：

pip install requests
pip install lxml

第 1 步。了解 Zillow 的 HTML 结构

要从 Zillow 中提取数据，您需要了解网页的结构。在 Zillow 上打开一个房产列表页面，检查您要抓取的元素（例如，房产标题、租金估算价格和评估价格）。

标题:

价格详情:

第 2 步。提出 HTTP 请求

现在让我们发送 HTTP 请求。首先，我们需要获取 Zillow 页面的 HTML 内容。我们将使用请求库向目标 URL 发送 HTTP GET 请求。我们还将设置请求头以模拟真实的浏览器请求，并使用代理来避免 IP 屏蔽。

import requests

# 定义 Zillow 房地产列表的目标 URL
url = "https://www.zillow.com/homedetails/1234-Main-St-Some-City-CA-90210/12345678_zpid/"

# 设置请求标头以模拟浏览器请求
headers = {
    'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7',
    'accept-language': 'en-US,en;q=0.9',
    'cache-control': 'no-cache',
    'dnt': '1',
    'pragma': 'no-cache',
    'sec-ch-ua': '"Not/A)Brand";v="99", "Google Chrome";v="91", "Chromium";v="91"',
    'sec-ch-ua-mobile': '?0',
    'sec-fetch-dest': 'document',
    'sec-fetch-mode': 'navigate',
    'sec-fetch-site': 'same-origin',
    'sec-fetch-user': '?1',
    'upgrade-insecure-requests': '1',
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
}

# 可选择设置代理，以避免 IP 屏蔽
proxies = {
    'http': 'http://username:password@your_proxy_address',
    'https://username:password@your_proxy_address',
}


# 发送带标头和代理的 HTTP GET 请求
response = requests.get(url, headers=headers, proxies=proxies)
response.raise_for_status()  # Ensure we got a valid response

第 3 步。解析 HTML 内容

接下来，我们需要使用 lxml 解析 HTML 内容。我们将使用 lxml.html 模块中的 fromstring 函数将网页的 HTML 内容解析为一个元素对象。

from lxml.html import fromstring

# 使用 lxml 解析 HTML 内容
parser = fromstring(response.text)

第 4 步。提取数据

现在，我们将在解析后的 HTML 内容上使用 XPath 查询来提取特定的数据点，如房产标题、租金估算价格和评估价格。

# 使用 XPath 提取属性标题
title = ' '.join(parser.xpath('//h1[@class="Text-c11n-8-99-3__sc-aiai24-0 dFxMdJ"]/text()'))

# 使用 XPath 提取物业租金估算价格
rent_estimate_price = parser.xpath('//span[@class="Text-c11n-8-99-3__sc-aiai24-0 dFhjAe"]//text()')[-2]

# 使用 XPath 提取物业评估价格
assessment_price = parser.xpath('//span[@class="Text-c11n-8-99-3__sc-aiai24-0 dFhjAe"]//text()')[-1]

# 将提取的数据存储在字典中
property_data = {
    'title': title,
    'Rent estimate price': rent_estimate_price,
    'Assessment price': assessment_price
}

第 5 步。将数据保存为 JSON

最后，我们将把提取的数据保存到 JSON 文件中，以便进一步处理。

import json

# 定义输出 JSON 文件名
output_file = 'zillow_properties.json'

# 以写入模式打开文件并转储数据
with open(output_file, 'w') as f:
    json.dump(all_properties, f, indent=4)

print(f"Scraped data saved to {output_file}")

处理多个 URL

要抓取多个房产列表，我们将遍历一个 URL 列表，并对每个 URL 重复数据提取过程。

# 要搜索的 URL 列表
urls = [
    "https://www.zillow.com/homedetails/1234-Main-St-Some-City-CA-90210/12345678_zpid/",
    "https://www.zillow.com/homedetails/5678-Another-St-Some-City-CA-90210/87654321_zpid/"
]

# 存储所有属性数据的列表
all_properties = []

for url in urls:
    # 发送带标头和代理的 HTTP GET 请求
    response = requests.get(url, headers=headers, proxies=proxies)
    response.raise_for_status()  # Ensure we got a valid response

    # 使用 lxml 解析 HTML 内容
    parser = fromstring(response.text)

    # 使用 XPath 提取数据
    title = ' '.join(parser.xpath('//h1[@class="Text-c11n-8-99-3__sc-aiai24-0 dFxMdJ"]/text()'))
    rent_estimate_price = parser.xpath('//span[@class="Text-c11n-8-99-3__sc-aiai24-0 dFhjAe"]//text()')[-2]
    assessment_price = parser.xpath('//span[@class="Text-c11n-8-99-3__sc-aiai24-0 dFhjAe"]//text()')[-1]

    # 将提取的数据存储在字典中
    property_data = {
        'title': title,
        'Rent estimate price': rent_estimate_price,
        'Assessment price': assessment_price
    }

    # 将属性数据添加到列表中
    all_properties.append(property_data)

完整代码

以下是完整的代码，用于抓取 Zillow 房地产数据并将其保存到 JSON 文件中：

import requests
from lxml.html import fromstring
import json

# 为 Zillow 房地产列表定义目标 URL
urls = [
    "https://www.zillow.com/homedetails/1234-Main-St-Some-City-CA-90210/12345678_zpid/",
    "https://www.zillow.com/homedetails/5678-Another-St-Some-City-CA-90210/87654321_zpid/"
]

# 设置请求标头以模拟浏览器请求
headers = {
    'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7',
    'accept-language': 'en-US,en;q=0.9',
    'cache-control': 'no-cache',
    'dnt': '1',
    'pragma': 'no-cache',
    'sec-ch-ua': '"Not/A)Brand";v="99", "Google Chrome";v="91", "Chromium";v="91"',
    'sec-ch-ua-mobile': '?0',
    'sec-fetch-dest': 'document',
    'sec-fetch-mode': 'navigate',
    'sec-fetch-site': 'same-origin',
    'sec-fetch-user': '?1',
    'upgrade-insecure-requests': '1',
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
}

# 可选择设置代理，以避免 IP 屏蔽
proxies = {
    'http': 'http://username:password@your_proxy_address',
    'https': 'https://username:password@your_proxy_address',
}

# 存储所有属性数据的列表
all_properties = []

for url in urls:
    try:
        # 发送带标头和代理的 HTTP GET 请求
        response = requests.get(url, headers=headers, proxies=proxies)
        response.raise_for_status()  # Ensure we got a valid response

        # 使用 lxml 解析 HTML 内容
        parser = fromstring(response.text)

        # 使用 XPath 提取数据
        title = ' '.join(parser.xpath('//h1[@class="Text-c11n-8-99-3__sc-aiai24-0 dFxMdJ"]/text()'))
        rent_estimate_price = parser.xpath('//span[@class="Text-c11n-8-99-3__sc-aiai24-0 dFhjAe"]//text()')[-2]
        assessment_price = parser.xpath('//span[@class="Text-c11n-8-99-3__sc-aiai24-0 dFhjAe"]//text()')[-1]

        # 将提取的数据存储在字典中
        property_data = {
            'title': title,
            'Rent estimate price': rent_estimate_price,
            'Assessment price': assessment_price
        }

        # 将属性数据添加到列表中
        all_properties.append(property_data)

    except requests.exceptions.HTTPError as e:
        print(f"HTTP error occurred: {e}")
    except requests.exceptions.RequestException as e:
        print(f"An error occurred: {e}")

# 定义输出 JSON 文件名
output_file = 'zillow_properties.json'

# 以写入模式打开文件并转储数据
with open(output_file, 'w') as f:
    json.dump(all_properties, f, indent=4)

print(f"Scraped data saved to {output_file}")

通过了解 HTML 页面的结构并利用请求和 lxml 等强大的库，您可以高效地提取房产详细信息。使用代理和轮换用户代理可以让您向 Zillow 等网站发出大量请求，而不会有被拦截的风险。对于这些活动，静态 ISP 代理或轮换住宅代理被认为是最佳选择。

0 评论

上一篇文章

下一篇文章