How to scrape Google Trends data using Python

Comments: 0

Using Python and Playwright to scrape data from Google Trends enables a detailed examination of keyword popularity and the monitoring of trend shifts over time. This approach delivers crucial insights for marketing analytics.

Prerequisites

Before diving into the code, ensure you have the following tools installed:

  • Python 3.7+;
  • Playwright library.

You can install Playwright using pip:

pip install playwright

For using Playwright with asynchronous code, you’ll also need the asyncio library, which is included in Python 3.7+ by default.

Configuring Playwright for working with Google Trends

We'll use Playwright, a powerful browser automation tool, to navigate the Google Trends website and download CSV files containing trend data. This tutorial will guide you through the entire process.

Playwright installation

First, ensure Playwright is installed:

playwright install

If you don’t want to install all the browsers you just use this command to install chromium browser only.

playwright install chromium

Proxy configuration

When scraping platforms like Google, which actively counter bot activity, using proxies is essential. Proxies enable IP rotation, helping to reduce the risk of getting blocked. In our script, we utilize private proxies to route our requests.

proxy = {
    "server": "IP:PORT",
    "username": "your_username",
    "password": "your_password"
}

Replace the variables IP, PORT, username, and password with the actual data from your proxy server.

Step-by-step process of working with Playwright

In this example, we first navigate to google.com to bypass any potential blocks before heading to the Google Trends page. This is done to mimic normal user behavior and avoid detection.

Step 1: Preparing to work with Google Trends

This step involves preliminary actions to prevent being flagged and blocked by Google:

  • Launching the browser: this involves starting an instance of the Chromium browser configured with proxy settings. The use of proxies helps in reducing the chances of detection by disguising the scraping activity as regular browser usage;
  • Navigating to Google: by accessing google.com first, it acclimatizes Google’s tracking systems to the presence of what it perceives as a new user. This simple navigation step lowers the likelihood of subsequent activities being classified as bot-like, thus avoiding immediate blocking.
import asyncio
from playwright.async_api import Playwright, async_playwright

async def run(playwright: Playwright) -> None:
    # Launching the browser with proxy settings
    browser = await playwright.chromium.launch(headless=False, proxy={
        "server": "IP:PORT",
        "username": "your_username",
        "password": "your_password"
    })
    
    # Creating a new browser context
    context = await browser.new_context()
    
    # Opening a new page
    page = await context.new_page()
    
    # Visiting Google to mimic normal browsing
    await page.goto("https://google.com")

Step 2: Navigating and downloading data from Google Trends

Next, navigate directly to the Google Trends page where the required data is located. Google Trends provides options for downloading the data directly in CSV format, which facilitates the extraction process. Automate the action of clicking the "Download" button to begin the data download. This allows for the extraction of trend data without manual intervention. Once the “Download” button becomes visible, the automation should proceed to click it, initiating the download of the CSV file that contains the needed trend data.

 # Navigating to Google Trends
    await page.goto("https://trends.google.com/trends/explore?q=%2Fg%2F11bc6c__s2&date=now%201-d&geo=US&hl=en-US")
    
    # Waiting for the download button and clicking it
    async with page.expect_download() as download_info:
        await page.get_by_role("button", name="file_download").first.click()
    
    # Handling the download
    download = await download_info.value
    print(download.suggested_filename)

Step 3: Saving data and ending the session

The downloaded CSV file is automatically saved in a specified directory on your local device.

 # Saving the downloaded file
    await download.save_as("/path/to/save/" + download.suggested_filename)

Complete code example

Here’s the complete code for downloading Google Trends data as a CSV file using Playwright:

import asyncio
import os
import re
from playwright.async_api import Playwright, async_playwright


async def run(playwright: Playwright) -> None:
   # Launch browser with proxy settings
   browser = await playwright.chromium.launch(headless=False, proxy={
       "server": "IP:PORT",
       "username": "your_username",
       "password": "your_password"
   })

   # Create a new browser context
   context = await browser.new_context()

   # Open a new page
   page = await context.new_page()

   # Visit Google to avoid detection
   await page.goto("https://google.com")

   # Navigate to Google Trends
   await page.goto("https://trends.google.com/trends/explore?q=%2Fg%2F11bc6c__s2&date=now%201-d&geo=US&hl=en-US")

   # Click the download button
   async with page.expect_download() as download_info:
       await page.get_by_role("button", name=re.compile(r"file_download")).first.click()

   # Save the downloaded file
   download = await download_info.value
   destination_path = os.path.join("path/to/save", download.suggested_filename)
   await download.save_as(destination_path)

   # Close the context and browser
   await context.close()
   await browser.close()


async def main() -> None:
   async with async_playwright() as playwright:
       await run(playwright)


asyncio.run(main())

Following this guide, you can efficiently download trend data, manage proxy rotation, and bypass bot protection mechanisms. For effective blocking avoidance, using reliable proxy servers is crucial. Residential proxies, which offer dynamic IP addresses and don't need rotation configuration, are highly recommended. Alternatively, static ISP proxies are also effective; purchase the required number of IPs and set up regular IP rotation in your script. Either choice ensures minimal risk of blocking and captcha, facilitating faster and smoother data scraping.

Comments:

0 comments