如何使用 Python 搜刮 Yelp

26.11.2024

评论: 0

喜歡:

文章的内容::

步骤 1：设置环境
步骤 2：向 Yelp 发送请求

了解 HTTP 标头
实施代理轮换

步骤 3：使用 lxml 解析 HTML 内容

确定要搜索的元素
使用 XPath 提取数据

步骤 4：从每个餐厅列表中提取数据
第 5 步：将数据保存为 JSON 格式
完整代码

从 Yelp 抓取数据可以提供有关本地餐馆的宝贵信息，包括名称、URL、菜系和评分等详细信息。本教程将介绍如何使用 requests 和 lxml Python 库抓取 Yelp 搜索结果。本教程将介绍几种技术，包括使用代理、处理标头和使用 XPath 提取数据。

步骤 1：设置环境

在开始刮擦过程之前，请确保已安装 Python 和所需的库：

pip install requests
pip install lxml

这些库将帮助我们向 Yelp 发送 HTTP 请求，解析 HTML 内容，并提取我们需要的数据。

步骤 2：向 Yelp 发送请求

首先，我们需要向 Yelp 搜索结果页面发送 GET 请求，以获取 HTML 内容。具体方法如下

import requests

# Yelp 搜索页面 URL
url = "https://www.yelp.com/search?find_desc=restaurants&find_loc=San+Francisco%2C+CA"

# 发送 GET 请求以获取 HTML 内容
response = requests.get(url)

# 检查请求是否成功
if response.status_code == 200:
    print("Successfully fetched the page content")
else:
    print("Failed to retrieve the page content")

了解 HTTP 标头

向网站发出请求时，必须包含适当的 HTTP 标头。标头可以包含有关请求的元数据，如用户代理，它可以识别发出请求的浏览器或工具。包含这些标头有助于避免目标网站的阻止或节流。

下面介绍如何设置标题：

headers = {
    'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7',
    'accept-language': 'en-IN,en;q=0.9',
    'cache-control': 'no-cache',
    'dnt': '1',
    'pragma': 'no-cache',
    'priority': 'u=0, i',
    'sec-ch-ua': '"Not)A;Brand";v="99", "Google Chrome";v="127", "Chromium";v="127"',
    'sec-ch-ua-mobile': '?0',
    'sec-ch-ua-platform': '"Linux"',
    'sec-fetch-dest': 'document',
    'sec-fetch-mode': 'navigate',
    'sec-fetch-site': 'none',
    'sec-fetch-user': '?1',
    'upgrade-insecure-requests': '1',
    'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/127.0.0.0 Safari/537.36',
}

response = requests.get(url, headers=headers)

实施代理轮换

在抓取大量网页时，您的 IP 地址有可能被目标网站屏蔽。为避免这种情况，建议使用代理服务器。在本指南中，建议使用具有自动轮换功能的动态代理服务器。这样，您只需设置一次代理服务器，轮换将通过定期更改 IP 地址来帮助保持访问，从而降低被屏蔽的可能性。

proxies = {
    'http': 'http://username:password@proxy-server:port',
    'https': 'https://username:password@proxy-server:port'
}

response = requests.get(url, headers=headers, proxies=proxies)

步骤 3：使用 lxml 解析 HTML 内容

获得 HTML 内容后，下一步就是解析并提取相关数据。为此，我们将使用 lxml 库。

from lxml import html

# 使用 lxml 解析 HTML 内容
parser = html.fromstring(response.content)

确定要搜索的元素

我们需要针对搜索结果页面上的单个餐馆列表进行定位。可以使用 XPath 表达式识别这些元素。对于 Yelp，列表通常被包裹在一个带有特定 data-testid 属性的 div 元素中。

# 提取单个餐厅元素
elements = parser.xpath('//div[@data-testid="serp-ia-card"]')[2:-1]

使用 XPath 提取数据

XPath 是从 HTML 文档中导航和选择节点的强大工具。在本教程中，我们将使用 XPath 表达式从每个餐厅元素中提取餐厅名称、URL、菜系和评分。

以下是每个数据点的具体 XPaths：

餐厅名称: .//div[@class="businessName__09f24__HG_pC y-css-ohs7lg"]/div/h3/a/text()
餐厅 URL: .//div[@class="businessName__09f24__HG_pC y-css-ohs7lg"]/div/h3/a/@href
美食: .//div[@class="priceCategory__09f24___4Wsg iaPriceCategory__09f24__x9YrM y-css-2hdccn"]/div/div/div/a/button/span/text()
评级: .//div[@class="y-css-9tnml4"]/@aria-label

步骤 4：从每个餐厅列表中提取数据

有了 HTML 内容并处理了潜在的 IP 屏蔽问题后，我们就可以从每个餐厅列表中提取所需的数据。

restaurants_data = []

# 对每个餐厅元素进行迭代
for element in elements:
    # 提取餐厅名称
    name = element.xpath('.//div[@class="businessName__09f24__HG_pC y-css-ohs7lg"]/div/h3/a/text()')[0]

    # 提取餐厅 URL
    url = element.xpath('.//div[@class="businessName__09f24__HG_pC y-css-ohs7lg"]/div/h3/a/@href')[0]

    # 提取美食
    cuisines = element.xpath('.//div[@class="priceCategory__09f24___4Wsg iaPriceCategory__09f24__x9YrM y-css-2hdccn"]/div/div/div/a/button/span/text()')

    # 提取评级
    rating = element.xpath('.//div[@class="y-css-9tnml4"]/@aria-label')[0]

    # 创建字典来存储数据
    restaurant_info = {
        "name": name,
        "url": url,
        "cuisines": cuisines,
        "rating": rating
    }

    # 将餐厅信息添加到列表中
    restaurants_data.append(restaurant_info)

第 5 步：将数据保存为 JSON 格式

提取数据后，我们需要将其保存为结构化格式。为此，JSON 是一种广泛使用的格式。

import json

# 将数据保存到 JSON 文件
with open('yelp_restaurants.json', 'w') as f:
    json.dump(restaurants_data, f, indent=4)

print("Data extraction complete. Saved to yelp_restaurants.json")

完整代码

import requests
from lxml import html
import json

# Yelp 搜索页面 URL
url = "https://www.yelp.com/search?find_desc=restaurants&find_loc=San+Francisco%2C+CA"

# 设置标头以模拟浏览器请求
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
    'Accept-Language': 'en-US,en;q=0.5'
}

# 根据需要设置代理
proxies = {
    'http': 'http://username:password@proxy-server:port',
    'https': 'https://username:password@proxy-server:port'
}

# 发送 GET 请求以获取 HTML 内容
response = requests.get(url, headers=headers, proxies=proxies)

# 检查请求是否成功
if response.status_code == 200:
    print("Successfully fetched the page content")
else:
    print("Failed to retrieve the page content")

# 使用 lxml 解析 HTML 内容
parser = html.fromstring(response.content)

# 提取单个餐厅元素
elements = parser.xpath('//div[@data-testid="serp-ia-card"]')[2:-1]

# 初始化一个列表，用于保存提取的数据
restaurants_data = []

# 对每个餐厅元素进行迭代
for element in elements:
    # 提取餐厅名称
    name = element.xpath('.//div[@class="businessName__09f24__HG_pC y-css-ohs7lg"]/div/h3/a/text()')[0]

    # 提取餐厅 URL
    url = element.xpath('.//div[@class="businessName__09f24__HG_pC y-css-ohs7lg"]/div/h3/a/@href')[0]

    # 提取美食
    cuisines = element.xpath('.//div[@class="priceCategory__09f24___4Wsg iaPriceCategory__09f24__x9YrM y-css-2hdccn"]/div/div/div/a/button/span/text()')

    # 提取评级
    rating = element.xpath('.//div[@class="y-css-9tnml4"]/@aria-label')[0]

    # 创建字典来存储数据
    restaurant_info = {
        "name": name,
        "url": url,
        "cuisines": cuisines,
        "rating": rating
    }

    # 将餐厅信息添加到列表中
    restaurants_data.append(restaurant_info)

# 将数据保存到 JSON 文件
with open('yelp_restaurants.json', 'w') as f:
    json.dump(restaurants_data, f, indent=4)

print("Data extraction complete. Saved to yelp_restaurants.json")

对于用户来说，正确配置 HTTP 标头和使用代理来规避限制和阻止是至关重要的。为了获得优化和更安全的搜索体验，可以考虑自动轮换 IP。使用动态住宅或移动代理可以大大加强这一过程，降低被检测和阻止的可能性。

0 评论

上一篇文章

下一篇文章