Uncovering information from e-retailers such as AliExpress can be highly beneficial for product information gathering, monitoring price fluctuations, collecting reviews, and so on. In this article, we will explore the process of acquiring information about products (such as name, price, rating, etc.) and also review scraping product reviews. We will also demonstrate how to make the scraper dynamic by passing the product URL, automatically retrieving the product ID, and saving the data into a CSV file.
This tutorial will use Playwright to render dynamic content and requests to fetch review data. We’ll also ensure the scraper is ethical and complies with best practices.
Before we begin, ensure you have the following Python libraries installed:
You can install these packages by running the following commands:
# Install Playwright
pip install playwright
# Install Requests
pip install requests
# Install lxml for parsing HTML
pip install lxml
# Install Pandas for data manipulation and saving
pip install pandas
After installing Playwright, you will also need to install the required browser binaries:
playwright install
This will download and set up the necessary browser for Playwright to function properly.
AliExpress product pages are dynamic, meaning they load content via JavaScript. To handle this, we’ll use Playwright, a Python library that allows you to control a headless browser and interact with dynamic content.
Here's how you can send a request and navigate to the product page:
from playwright.async_api import async_playwright
async def get_page_content(url):
async with async_playwright() as p:
# Launch the browser with a proxy if needed (can be removed if not using proxy)
browser = await p.firefox.launch(
headless=False,
proxy={"server": '', 'username': '', 'password': ''}
)
page = await browser.new_page()
await page.goto(url, timeout=60000)
# Extract page content
content = await page.content()
await browser.close()
return content
# Example URL
url = 'https://www.aliexpress.com/item/3256805354456256.html'
Once we have the page content, we can extract the product data using lxml and XPath queries. We will gather details like the product title, price, rating, number of reviews, and the number of items sold.
from lxml.html import fromstring
def extract_product_data(content):
parser = fromstring(content)
# Extract product details using XPath
title = parser.xpath('//h1[@data-pl="product-title"]/text()')[0].strip()
price = parser.xpath('//div[@class="price--current--I3Zeidd product-price-current"]/span/text()')[0].strip()
rating = ' '.join(parser.xpath('//a[@class="reviewer--rating--xrWWFzx"]/strong/text()')).strip()
total_reviews = parser.xpath('//a[@class="reviewer--reviews--cx7Zs_V"]/text()')[0].strip()
sold_count = parser.xpath('//span[@class="reviewer--sold--ytPeoEy"]/text()')[0].strip()
product_data = {
'title': title,
'price': price,
'rating': rating,
'total_reviews': total_reviews,
'sold_count': sold_count
}
return product_data
This code uses XPath to extract relevant product details from the HTML content of the page.
AliExpress has a separate API endpoint for fetching product reviews. You can extract the product ID from the URL dynamically and use it to fetch reviews via requests. In this function:
import requests
def extract_product_id(url):
# Extract product ID from the URL
product_id = url.split('/')[-1].split('.')[0]
return product_id
def scrape_reviews(product_id, page_num=1, page_size=10):
headers = {
'accept': 'application/json, text/plain, */*',
'accept-language': 'en-IN,en;q=0.9',
'referer': f'https://www.aliexpress.com/item/{product_id}.html',
'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/129.0.0.0 Safari/537.36',
}
params = {
'productId': product_id,
'lang': 'en_US',
'country': 'US',
'page': str(page_num),
'pageSize': str(page_size),
'filter': 'all',
'sort': 'complex_default',
}
response = requests.get('https://feedback.aliexpress.com/pc/searchEvaluation.do', params=params, headers=headers)
reviews = response.json()['data']['evaViewList']
# Extract review text only
review_texts = [review['buyerFeedback'] for review in reviews]
return review_texts
After scraping the product details and reviews, we save this data into a CSV file using the pandas library.
import pandas as pd
def save_to_csv(product_data, reviews, product_id):
# Save product details to CSV
df_product = pd.DataFrame([product_data])
df_product.to_csv(f'product_{product_id}_data.csv', index=False)
# Save reviews to CSV
df_reviews = pd.DataFrame({'reviews': reviews})
df_reviews.to_csv(f'product_{product_id}_reviews.csv', index=False)
print(f"Data saved for product {product_id}.")
The product details and reviews are saved into separate CSV files with the product ID included in the filename for easy identification.
Here’s how the complete dynamic workflow works:
# Extract product ID from the URL
def extract_product_id(url):
return url.split('/')[-1].split('.')[0]
from playwright.async_api import async_playwright
from lxml.html import fromstring
import requests
import pandas as pd
# Get page content using Playwright
async def get_page_content(url):
async with async_playwright() as p:
browser = await p.firefox.launch(
headless=False,
proxy={"server": '', 'username': '', 'password': ''}
)
page = await browser.new_page()
await page.goto(url, timeout=60000)
content = await page.content()
await browser.close()
return content
# Extract product data
def extract_product_data(content):
parser = fromstring(content)
title = parser.xpath('//h1[@data-pl="product-title"]/text()')[0].strip()
price = parser.xpath('//div[@class="price--current--I3Zeidd product-price-current"]/span/text()')[0].strip()
rating = ' '.join(parser.xpath('//a[@class="reviewer--rating--xrWWFzx"]/strong/text()')).strip()
total_reviews = parser.xpath('//a[@class="reviewer--reviews--cx7Zs_V"]/text()')[0].strip()
sold_count = parser.xpath('//span[@class="reviewer--sold--ytPeoEy"]/text()')[0].strip()
return {
'title': title,
'price': price,
'rating': rating,
'total_reviews': total_reviews,
'sold_count': sold_count
}
# Extract product ID from the URL
def extract_product_id(url):
return url.split('/')[-1].split('.')[0]
# Scrape reviews
def scrape_reviews(product_id, page_num=1, page_size=10):
headers = {
'accept': 'application/json, text/plain, */*',
'referer': f'https://www.aliexpress.com/item/{product_id}.html',
'user-agent': 'Mozilla/5.0'
}
params = {
'productId': product_id,
'lang': 'en_US',
'page': str(page_num),
'pageSize': str(page_size),
}
response = requests.get('https://feedback.aliexpress.com/pc/searchEvaluation.do', params=params, headers=headers)
reviews = response.json()['data']['evaViewList']
return [review['buyerFeedback'] for review in reviews]
# Save product data and reviews to CSV
def save_to_csv(product_data, reviews, product_id):
pd.DataFrame([product_data]).to_csv(f'product_{product_id}_data.csv', index=False)
pd.DataFrame({'reviews': reviews}).to_csv(f'product_{product_id}_reviews.csv', index=False)
print(f'Saved into: product_{product_id}_data.csv')
print(f'Saved into: product_{product_id}_reviews.csv')
# Main function
async def main(url):
content = await get_page_content(url)
product_data = extract_product_data(content)
product_id = extract_product_id(url)
reviews = scrape_reviews(product_id)
save_to_csv(product_data, reviews, product_id)
# Run the scraper
import asyncio
url = 'https://www.aliexpress.com/item/3256805354456256.html'
asyncio.run(main(url))
When scraping data, it's important to follow ethical guidelines:
Following these guidelines will help you scrape ethically and responsibly, minimizing risks for both users and the AliExpress system.
Comments: 0