如何使用 Python 抓取电子商务网站

22.06.2025

评论: 0

文章的内容::

编写电子商务数据抓取脚本

步骤 1。了解网站的 HTML 结构
步骤 2。发送 HTTP 请求
步骤 3。使用 XPath 和 lxml 提取数据
步骤 4。解决潜在问题
步骤 5。将数据保存到 CSV 文件

完整代码
电子商务数据挖掘：最终想法

对产品详细信息进行电子商务数据挖掘有助于进行竞争分析、价格监控和市场调研。使用 Python 可以方便地从产品页面进行数据搜刮。本电子商务数据挖掘教程将向您展示如何结合使用请求和 lxml 从在线商店获取信息。

为电子商务抓取网页包括从互联网上的商店获取产品信息，如标题、价格或标识符编号。Python 中的众多库不仅使这一工作变得简单，而且相当高效。在本文中，我们将重点讨论使用 Python 对电子商务网站进行网页刮擦。我们将以 Costco 网站为对象。

编写电子商务数据抓取脚本

首先，让我们确保拥有本脚本所需的所有可用 Python 电子商务刮擦工具或库：


pip install requests
pip install lxml

我们将重点从网站的特定页面中提取产品名称、功能和品牌。

步骤 1。了解网站的 HTML 结构

要开始构建电子商务产品搜索器，首先必须了解给定网页的结构。访问网站并打开要收集信息的页面，检查所需元素（如产品名称、功能、品牌等）。

步骤 2。发送 HTTP 请求

首先，我们将导入请求库，特别是为产品页面发送 GET。此外，我们还要配置请求头，使其与浏览器请求类似。


import requests

# 要搜索的产品 URL 列表
urls = [
    "https://www.costco.com/kirkland-signature-men's-sneaker.product.4000216649.html",
    "https://www.costco.com/adidas-ladies'-puremotion-shoe.product.4000177646.html"
]

headers = {
    'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7',
    'accept-language': 'en-US,en;q=0.9',
    'cache-control': 'no-cache',
    'dnt': '1',
    'pragma': 'no-cache',
    'sec-ch-ua': '"Not/A)Brand";v="99", "Google Chrome";v="91", "Chromium";v="91"',
    'sec-ch-ua-mobile': '?0',
    'sec-fetch-dest': 'document',
    'sec-fetch-mode': 'navigate',
    'sec-fetch-site': 'same-origin',
    'sec-fetch-user': '?1',
    'upgrade-insecure-requests': '1',
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
}

# 循环浏览每个 URL 并发送 GET 请求
for url in urls:
    response = requests.get(url, headers=headers)
    if response.status_code == 200:
        html_content = response.text
        # 后续步骤将增加进一步处理
    else:
        print(f"Failed to retrieve {url}. Status code: {response.status_code}")

步骤 3。使用 XPath 和 lxml 提取数据

有了 lxml，我们就能从 html 中提取所需的信息。在处理电子商务数据搜刮时，这一点至关重要。


from lxml import html

# 用于存储刮擦数据的列表
scraped_data = []

# 循环浏览每个 URL 并发送 GET 请求
for url in urls:
    response = requests.get(url)
    if response.status_code == 200:
        html_content = response.content
        # 使用 lxml 解析 HTML 内容
        tree = html.fromstring(html_content)
        
       # 使用 XPath 提取数据
        product_name = tree.xpath('//h1[@automation-id="productName"]/text()')[0].strip()
        product_feature = tree.xpath('//ul[@class="pdp-features"]//li//text()')
        product_brand = tree.xpath('//div[@itemprop="brand"]/text()')[0].strip()
        
        # 将提取的数据添加到列表中


        scraped_data.append({'Product Name': product_name, 'Product Feature': product_feature, 'Brand': product_brand})
    else:
        print(f"Failed to retrieve {url}. Status code: {response.status_code}")

# 打印扫描数据
for item in scraped_data:
    print(item)

步骤 4。解决潜在问题

当我们尝试用 Python 搜刮电子商务网站时，我们需要了解大多数网站都有某种形式的反僵尸软件。使用代理和旋转用户代理可以帮助减轻它们的怀疑。

使用带 IP 授权的代理：


proxies = {
    'http': 'http://your_proxy_ip:your_proxy_port',
    'https': 'https://your_proxy_ip:your_proxy_port'
}
response = requests.get(url, proxies=proxies)

旋转用户代理：


import random

user_agents = [
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
    # 根据需要添加更多用户代理
]

headers['user-agent'] = random.choice(user_agents)

response = requests.get(url, headers=headers)

步骤 5。将数据保存到 CSV 文件

最后，提取的数据将以 CSV 格式存储，这样我就可以在以后进行更高级的电子商务数据挖掘分析。


import csv

csv_file = 'costco_products.csv'
fieldnames = ['Product Name', 'Product Feature', 'Brand']

# 将数据写入 CSV 文件
try:
    with open(csv_file, mode='w', newline='', encoding='utf-8') as file:
        writer = csv.DictWriter(file, fieldnames=fieldnames)
        writer.writeheader()
        for item in scraped_data:
            writer.writerow(item)
    print(f"Data saved to {csv_file}")
except IOError:
    print(f"Error occurred while writing data to {csv_file}")

完整代码

这是用于有效进行电子商务数据搜刮的脚本的最终版本。可以复制粘贴，方便使用。


import requests
import urllib3
from lxml import html
import csv
import random
import ssl

ssl._create_default_https_context = ssl._create_stdlib_context
urllib3.disable_warnings()

# 要搜索的产品 URL 列表
urls = [
   "https://www.costco.com/kirkland-signature-men's-sneaker.product.4000216649.html",
   "https://www.costco.com/adidas-ladies'-puremotion-shoe.product.4000177646.html"
]

# headers
headers = {
   'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7',
   'accept-language': 'en-US,en;q=0.9',
   'cache-control': 'no-cache',
   'dnt': '1',
   'pragma': 'no-cache',
   'sec-ch-ua': '"Not/A)Brand";v="99", "Google Chrome";v="91", "Chromium";v="91"',
   'sec-ch-ua-mobile': '?0',
   'sec-fetch-dest': 'document',
   'sec-fetch-mode': 'navigate',
   'sec-fetch-site': 'same-origin',
   'sec-fetch-user': '?1',
   'upgrade-insecure-requests': '1',
   'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
}

# 旋转请求的用户代理列表
user_agents = [
   'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
   'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
   # 根据需要添加更多用户代理
]


# 轮换请求的代理列表
proxies = [
    {'http': 'http://your_proxy_ip:your_proxy_port', 'https': 'https://your_proxy_ip:your_proxy_port'},
    {'http': 'http://your_proxy_ip2:your_proxy_port2', 'https': 'https://your_proxy_ip2:your_proxy_port2'},
    # 根据需要添加更多代理
]

# 用于存储刮擦数据的列表
scraped_data = []

# 循环浏览每个 URL 并发送 GET 请求
for url in urls:
   # 为请求标头随机选择一个用户代理
   headers['user-agent'] = random.choice(user_agents)
   # 为请求选择一个随机代理
   proxy = random.choice(proxies)

   # 向 URL 发送 HTTP GET 请求，并附带标头和代理
   response = requests.get(url, headers=headers, proxies=proxy, verify=False)
   if response.status_code == 200:
       # 存储响应中的 HTML 内容
       html_content = response.content
       # 使用 lxml 解析 HTML 内容
       tree = html.fromstring(html_content)

       # 使用 XPath 提取数据
       product_name = tree.xpath('//h1[@automation-id="productName"]/text()')[0].strip()
       product_feature = tree.xpath('//ul[@class="pdp-features"]//li//text()')
       product_brand = tree.xpath('//div[@itemprop="brand"]/text()')[0].strip()

       # 将提取的数据添加到列表中
       scraped_data.append({'Product Name': product_name, 'Product Feature': product_feature, 'Brand': product_brand})
   else:
       # 如果请求失败，则打印错误信息
       print(f"Failed to retrieve {url}. Status code: {response.status_code}")

# CSV 文件设置
csv_file = 'costco_products.csv'
fieldnames = ['Product Name', 'Product Feature', 'Brand']

# 将数据写入 CSV 文件
try:
   with open(csv_file, mode='w', newline='', encoding='utf-8') as file:
       writer = csv.DictWriter(file, fieldnames=fieldnames)
       writer.writeheader()
       for item in scraped_data:
           writer.writerow(item)
   print(f"Data saved to {csv_file}")
except IOError:
   # 如果写入文件失败，则打印错误信息
   print(f"Error occurred while writing data to {csv_file}")

现在，Python 电子商务搜索器已经完成。

电子商务数据挖掘：最终想法

Costco 在线商店的电子商务网络刮擦器用 Python 展示了如何有效地获取产品数据，用于分析和优化业务决策。有了正确的脚本和库 Requests 以及 Lxml 来提供自动提取器，就可以对网站进行刮擦，而不会因为反僵尸 API 而中断工作流程。最后，在进行电子商务网络刮擦时，必须始终遵守道德规范。

0 评论

上一篇文章

下一篇文章