How to scrape AliExpress data using Python

Comments: 0

Uncovering information from e-retailers such as AliExpress can be highly beneficial for product information gathering, monitoring price fluctuations, collecting reviews, and so on. In this article, we will explore the process of acquiring information about products (such as name, price, rating, etc.) and also review scraping product reviews. We will also demonstrate how to make the scraper dynamic by passing the product URL, automatically retrieving the product ID, and saving the data into a CSV file.

This tutorial will use Playwright to render dynamic content and requests to fetch review data. We’ll also ensure the scraper is ethical and complies with best practices.

Requirements

Before we begin, ensure you have the following Python libraries installed:

  • Playwright: used to interact with the browser and render dynamic content.
  • Requests: used to fetch reviews via the AliExpress API.
  • lxml: for parsing the HTML content.
  • Pandas: this is used to save the data to a CSV file.

You can install these packages by running the following commands:


# Install Playwright
pip install playwright


# Install Requests
pip install requests


# Install lxml for parsing HTML
pip install lxml


# Install Pandas for data manipulation and saving
pip install pandas

After installing Playwright, you will also need to install the required browser binaries:


playwright install

This will download and set up the necessary browser for Playwright to function properly.

Step 1. Sending requests with Playwright

AliExpress product pages are dynamic, meaning they load content via JavaScript. To handle this, we’ll use Playwright, a Python library that allows you to control a headless browser and interact with dynamic content.

Here's how you can send a request and navigate to the product page:


from playwright.async_api import async_playwright

async def get_page_content(url):
    async with async_playwright() as p:
        # Launch the browser with a proxy if needed (can be removed if not using proxy)
        browser = await p.firefox.launch(
            headless=False,
            proxy={"server": '', 'username': '', 'password': ''}
        )
        page = await browser.new_page()
        await page.goto(url, timeout=60000)

        # Extract page content
        content = await page.content()
        await browser.close()
        
        return content

# Example URL
url = 'https://www.aliexpress.com/item/3256805354456256.html'

Step 2. Extracting product data

Once we have the page content, we can extract the product data using lxml and XPath queries. We will gather details like the product title, price, rating, number of reviews, and the number of items sold.


from lxml.html import fromstring

def extract_product_data(content):
    parser = fromstring(content)
    
    # Extract product details using XPath
    title = parser.xpath('//h1[@data-pl="product-title"]/text()')[0].strip()
    price = parser.xpath('//div[@class="price--current--I3Zeidd product-price-current"]/span/text()')[0].strip()
    rating = ' '.join(parser.xpath('//a[@class="reviewer--rating--xrWWFzx"]/strong/text()')).strip()
    total_reviews = parser.xpath('//a[@class="reviewer--reviews--cx7Zs_V"]/text()')[0].strip()
    sold_count = parser.xpath('//span[@class="reviewer--sold--ytPeoEy"]/text()')[0].strip()

    product_data = {
        'title': title,
        'price': price,
        'rating': rating,
        'total_reviews': total_reviews,
        'sold_count': sold_count
    }

    return product_data


This code uses XPath to extract relevant product details from the HTML content of the page.

Step 3. Scraping product reviews

AliExpress has a separate API endpoint for fetching product reviews. You can extract the product ID from the URL dynamically and use it to fetch reviews via requests. In this function:

  1. The product ID is extracted from the product URL dynamically.
  2. We fetch the reviews using the AliExpress review API.
  3. The review texts are extracted and returned as a list.

import requests

def extract_product_id(url):
    # Extract product ID from the URL
    product_id = url.split('/')[-1].split('.')[0]
    return product_id

def scrape_reviews(product_id, page_num=1, page_size=10):
    headers = {
        'accept': 'application/json, text/plain, */*',
        'accept-language': 'en-IN,en;q=0.9',
        'referer': f'https://www.aliexpress.com/item/{product_id}.html',
        'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/129.0.0.0 Safari/537.36',
    }

    params = {
        'productId': product_id,
        'lang': 'en_US',
        'country': 'US',
        'page': str(page_num),
        'pageSize': str(page_size),
        'filter': 'all',
        'sort': 'complex_default',
    }

    response = requests.get('https://feedback.aliexpress.com/pc/searchEvaluation.do', params=params, headers=headers)
    reviews = response.json()['data']['evaViewList']

    # Extract review text only
    review_texts = [review['buyerFeedback'] for review in reviews]
    
    return review_texts

Step 4. Saving data into a CSV file

After scraping the product details and reviews, we save this data into a CSV file using the pandas library.


import pandas as pd

def save_to_csv(product_data, reviews, product_id):
    # Save product details to CSV
    df_product = pd.DataFrame([product_data])
    df_product.to_csv(f'product_{product_id}_data.csv', index=False)

    # Save reviews to CSV
    df_reviews = pd.DataFrame({'reviews': reviews})
    df_reviews.to_csv(f'product_{product_id}_reviews.csv', index=False)
    
    print(f"Data saved for product {product_id}.")

The product details and reviews are saved into separate CSV files with the product ID included in the filename for easy identification.

Step 5. Dynamic product ID retrieval

Here’s how the complete dynamic workflow works:

  1. Pass any AliExpress product URL.
  2. The product ID is extracted from the URL.
  3. The scraper fetches product data and reviews.
  4. Data is saved into CSV files with the product ID included.

# Extract product ID from the URL
def extract_product_id(url):
    return url.split('/')[-1].split('.')[0]

Final complete code


from playwright.async_api import async_playwright
from lxml.html import fromstring
import requests
import pandas as pd

# Get page content using Playwright
async def get_page_content(url):
    async with async_playwright() as p:
        browser = await p.firefox.launch(
            headless=False,
            proxy={"server": '', 'username': '', 'password': ''}
        )
        page = await browser.new_page()
        await page.goto(url, timeout=60000)
        content = await page.content()
        await browser.close()
        return content

# Extract product data
def extract_product_data(content):
    parser = fromstring(content)
    title = parser.xpath('//h1[@data-pl="product-title"]/text()')[0].strip()
    price = parser.xpath('//div[@class="price--current--I3Zeidd product-price-current"]/span/text()')[0].strip()
    rating = ' '.join(parser.xpath('//a[@class="reviewer--rating--xrWWFzx"]/strong/text()')).strip()
    total_reviews = parser.xpath('//a[@class="reviewer--reviews--cx7Zs_V"]/text()')[0].strip()
    sold_count = parser.xpath('//span[@class="reviewer--sold--ytPeoEy"]/text()')[0].strip()

    return {
        'title': title,
        'price': price,
        'rating': rating,
        'total_reviews': total_reviews,
        'sold_count': sold_count
    }

# Extract product ID from the URL
def extract_product_id(url):
    return url.split('/')[-1].split('.')[0]

# Scrape reviews
def scrape_reviews(product_id, page_num=1, page_size=10):
    headers = {
        'accept': 'application/json, text/plain, */*',
        'referer': f'https://www.aliexpress.com/item/{product_id}.html',
        'user-agent': 'Mozilla/5.0'
    }
    params = {
        'productId': product_id,
        'lang': 'en_US',
        'page': str(page_num),
        'pageSize': str(page_size),
    }
    response = requests.get('https://feedback.aliexpress.com/pc/searchEvaluation.do', params=params, headers=headers)
    reviews = response.json()['data']['evaViewList']
    return [review['buyerFeedback'] for review in reviews]

# Save product data and reviews to CSV
def save_to_csv(product_data, reviews, product_id):
    pd.DataFrame([product_data]).to_csv(f'product_{product_id}_data.csv', index=False)
    pd.DataFrame({'reviews': reviews}).to_csv(f'product_{product_id}_reviews.csv', index=False)
    print(f'Saved into: product_{product_id}_data.csv')
    print(f'Saved into: product_{product_id}_reviews.csv')

# Main function
async def main(url):
    content = await get_page_content(url)
    product_data = extract_product_data(content)
    product_id = extract_product_id(url)
    reviews = scrape_reviews(product_id)
    save_to_csv(product_data, reviews, product_id)

# Run the scraper
import asyncio
url = 'https://www.aliexpress.com/item/3256805354456256.html'
asyncio.run(main(url))

Ethical considerations

When scraping data, it's important to follow ethical guidelines:

  1. Respect AliExpress’s terms of service: Always check the terms of service before scraping a website. Avoid violating their rules to prevent getting banned.
  2. Throttle your requests: Sending too many requests in a short time can overload their servers. Consider adding delays between requests.
  3. Avoid personal data: Do not collect or scrape personal information without proper consent.

Following these guidelines will help you scrape ethically and responsibly, minimizing risks for both users and the AliExpress system.

Comments:

0 comments