使用 Python 抓取亚马逊评论指南

评论: 0

使用 Python 抓取亚马逊评论非常有用--无论是分析竞争对手、查看真实客户的评价,还是挖掘市场趋势。如果您想知道如何使用 Python 抓取亚马逊评论,本教程将指导您使用 Requests 包和 BeautifulSoup 以编程方式抓取评论内容。

步骤 1.安装所需程序库

在此之前,你需要安装几个库。两个核心依赖库,即用于网络调用的 Requests 和用于 HTML 树遍历的 BeautifulSoup,都可以在单行终端中安装:

pip install requests
pip install beautifulsoup4

步骤 2.配置刮擦流程

我们将重点介绍使用 python 进行亚马逊评论搜索,并逐步检查搜索过程的每个阶段。

了解网站结构

了解网站的 HTML 结构对于识别评论元素至关重要。评论部分包括评论者句柄、星级和书面评论等字段;必须通过浏览器检查工具找到这些字段。

产品标题和 URL:

1.png

总评分:

2.png

审查部分:

3.png

作者姓名:

4.png

评级:

5.png

如何:

6.png

发送 HTTP 请求

页眉起着重要作用。设置 User-Agent 字符串和其他标头是为了模仿普通浏览器,减少被发现的机会。如果你想正确地完成这项工作,亚马逊的 python 搜索指南会告诉你如何设置这些标头和代理,以保证你的请求顺利进行并不被发现。

代理

代理允许 IP 轮换,以降低禁令和速率限制的风险。它们对于大规模搜索尤其重要。

完整的请求标头

包括各种标头,如 Accept-Encoding、Accept-Language、Referer、Connection 和 Upgrade-Insecure-Requests 等,模仿合法的浏览器请求,减少被标记为僵尸的机会。

import requests

url = "https://www.amazon.com/Portable-Mechanical-Keyboard-MageGee-Backlit/product-reviews/B098LG3N6R/ref=cm_cr_dp_d_show_all_btm?ie=UTF8&reviewerType=all_reviews"

# Example of a proxy provided by the proxy service
proxy = {
    'http': 'http://your_proxy_ip:your_proxy_port',
    'https': 'https://your_proxy_ip:your_proxy_port'
}

headers = {
    'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7',
    'accept-language': 'en-US,en;q=0.9',
    'cache-control': 'no-cache',
    'dnt': '1',
    'pragma': 'no-cache',
    'sec-ch-ua': '"Not/A)Brand";v="99", "Google Chrome";v="91", "Chromium";v="91"',
    'sec-ch-ua-mobile': '?0',
    'sec-fetch-dest': 'document',
    'sec-fetch-mode': 'navigate',
    'sec-fetch-site': 'same-origin',
    'sec-fetch-user': '?1',
    'upgrade-insecure-requests': '1',
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
}

# Send HTTP GET request to the URL with headers and proxy
try:
    response = requests.get(url, headers=headers, proxies=proxy, timeout=10)
    response.raise_for_status()  # Raise an error if the request failed

except requests.exceptions.RequestException as e:
    print(f"Error: {e}")

步骤 3.使用 BeautifulSoup 提取产品详细信息

页面加载后,BeautifulSoup 会将原始 HTML 转化为可搜索的树状结构。从该结构中,搜索器会抓取规范的产品链接、页面标题和任何可见的评级汇总。

from bs4 import BeautifulSoup

soup = BeautifulSoup(response.content, 'html.parser')

# Extracting common product details
product_url = soup.find('a', {'data-hook': 'product-link'}).get('href', '')
product_title = soup.find('a', {'data-hook': 'product-link'}).get_text(strip=True)
total_rating = soup.find('span', {'data-hook': 'rating-out-of-text'}).get_text(strip=True)

步骤 4.使用 BeautifulSoup 提取评论数据

我们回到相同的 HTML 结构,这次重点是收集评论者姓名、星级和书面评论,所有这些都使用 Python 通过预定义的选择器高效地抓取亚马逊评论。

reviews = []
review_elements = soup.find_all('div', {'data-hook': 'review'})
for review in review_elements:
    author_name = review.find('span', class_='a-profile-name').get_text(strip=True)
    rating_given = review.find('i', class_='review-rating').get_text(strip=True)
    comment = review.find('span', class_='review-text').get_text(strip=True)

    reviews.append({
        'Product URL': product_url,
        'Product Title': product_title,
        'Total Rating': total_rating,
        'Author': author_name,
        'Rating': rating_given,
        'Comment': comment,
    })

步骤 5.将数据保存为 CSV

Python 内置的 csv.writer 可以将收集到的审查数据保存到 .csv 文件中,以便日后分析。

import csv

# Define CSV file path
csv_file = 'amazon_reviews.csv'

# Define CSV fieldnames
fieldnames = ['Product URL', 'Product Title', 'Total Rating', 'Author', 'Rating', 'Comment']

# Writing data to CSV file
with open(csv_file, mode='w', newline='', encoding='utf-8') as file:
    writer = csv.DictWriter(file, fieldnames=fieldnames)
    writer.writeheader()
    for review in reviews:
        writer.writerow(review)

print(f"Data saved to {csv_file}")

完整代码

介绍了将请求生成、解析和文件输出步骤连接在一起的代码块,将整个刮擦工作流程封装在一个可运行的脚本中:

import requests
from bs4 import BeautifulSoup
import csv
import urllib3

urllib3.disable_warnings()

# URL of the Amazon product reviews page
url = "https://www.amazon.com/Portable-Mechanical-Keyboard-MageGee-Backlit/product-reviews/B098LG3N6R/ref=cm_cr_dp_d_show_all_btm?ie=UTF8&reviewerType=all_reviews"

# Proxy provided by the proxy service with IP-authorization
path_proxy = 'your_proxy_ip:your_proxy_port'
proxy = {
   'http': f'http://{path_proxy}',
   'https': f'https://{path_proxy}'
}

# Headers for the HTTP request
headers = {
   'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7',
   'accept-language': 'en-US,en;q=0.9',
   'cache-control': 'no-cache',
   'dnt': '1',
   'pragma': 'no-cache',
   'sec-ch-ua': '"Not/A)Brand";v="99", "Google Chrome";v="91", "Chromium";v="91"',
   'sec-ch-ua-mobile': '?0',
   'sec-fetch-dest': 'document',
   'sec-fetch-mode': 'navigate',
   'sec-fetch-site': 'same-origin',
   'sec-fetch-user': '?1',
   'upgrade-insecure-requests': '1',
   'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
}

# Send HTTP GET request to the URL with headers and handle exceptions
try:
   response = requests.get(url, headers=headers, timeout=10, proxies=proxy, verify=False)
   response.raise_for_status()  # Raise an error if the request failed

except requests.exceptions.RequestException as e:
   print(f"Error: {e}")

# Use BeautifulSoup to parse the HTML and grab the data you need
soup = BeautifulSoup(response.content, 'html.parser')

# Extracting common product details
product_url = soup.find('a', {'data-hook': 'product-link'}).get('href', '')  # Extract product URL
product_title = soup.find('a', {'data-hook': 'product-link'}).get_text(strip=True)  # Extract product title
total_rating = soup.find('span', {'data-hook': 'rating-out-of-text'}).get_text(strip=True)  # Extract total rating

# Extracting individual reviews
reviews = []
review_elements = soup.find_all('div', {'data-hook': 'review'})
for review in review_elements:
   author_name = review.find('span', class_='a-profile-name').get_text(strip=True)  # Extract author name
   rating_given = review.find('i', class_='review-rating').get_text(strip=True)  # Extract rating given
   comment = review.find('span', class_='review-text').get_text(strip=True)  # Extract review comment

   # Store each review in a dictionary
   reviews.append({
       'Product URL': product_url,
       'Product Title': product_title,
       'Total Rating': total_rating,
       'Author': author_name,
       'Rating': rating_given,
       'Comment': comment,
   })

# Define CSV file path
csv_file = 'amazon_reviews.csv'

# Define CSV fieldnames
fieldnames = ['Product URL', 'Product Title', 'Total Rating', 'Author', 'Rating', 'Comment']

# Writing data to CSV file
with open(csv_file, mode='w', newline='', encoding='utf-8') as file:
   writer = csv.DictWriter(file, fieldnames=fieldnames)
   writer.writeheader()
   for review in reviews:
       writer.writerow(review)

# Print confirmation message
print(f"Data saved to {csv_file}")

可靠的代理可提高绕过拦截的几率,并有助于减少反僵尸过滤器的检测。对于搜索,住宅代理通常因其信任因素而受到青睐,而静态 ISP 代理则能提供速度和稳定性。

结论

使用 Python 抓取亚马逊产品评论是完全可能的,而且 Python 提供了实现这一目标的必要工具。只需使用几个库,并在页面上仔细探查,就能获得各种有用的信息:从客户的真实想法到发现竞争对手的失误之处。

当然,这其中也有一些障碍:亚马逊并不完全喜欢搜索器。因此,如果您想以 Python 的方式大规模搜索亚马逊的产品评论,您需要使用代理服务器来保持低调。最可靠的选择是住宅代理(信任度高、IP 可轮换)或 ISP 静态代理(快速、稳定)。

评论:

0 评论