Step-by-step guide to scraping Google News with Python

06.12.2024

Comments: 0

Content of the article:

Step 1: Setting Up the Environment
Step 2: Understanding the Target URL and XPath Structure
Step 3: Fetching Google News content
Step 4: Parsing the HTML content with lxml
Step 5: Extracting news data

Extracting main news articles
Extracting related articles within each main new element

Step 6: Saving the data as JSON
Using proxies
Customizing request headers
Complete code example

In gathering the most recent news headlines, monitoring news trends, and performing sentiment analysis on present matters, scraping Google News proves to be an invaluable tool. In this article, we will guide you through the process of scraping Google News by means of Python. We will employ a requests library to obtain page content; lxml for parsing HTML documents and extracting data needed. By the end of this tutorial, you will learn how to extract news headlines and their respective links from Google News into structured JSON format.

Step 1: Setting Up the Environment

Before we begin, make sure you have Python installed on your system. You can install the required libraries by running the following commands:

pip install requests 
pip install lxml

These libraries will allow us to make HTTP requests and parse the HTML content of the webpage.

Step 2: Understanding the Target URL and XPath Structure

We'll be scraping the Google News page at the following URL:

URL = "https://news.google.com/topics/CAAqKggKIiRDQkFTRlFvSUwyMHZNRGRqTVhZU0JXVnVMVWRDR2dKSlRpZ0FQAQ?hl=en-US&gl=US&ceid=US%3Aen"

This page contains multiple news items, each with a main headline and related articles. The XPath structure for these items is as follows:

Main News Container: //c-wiz[@jsrenderer="ARwRbe"]
Main News Title: //c-wiz[@jsrenderer="ARwRbe"]/c-wiz/div/article/a/text()
Main News Link: //c-wiz[@jsrenderer="ARwRbe"]/c-wiz/div/article/a/@href
Related News Container: //c-wiz[@jsrenderer="ARwRbe"]/c-wiz/div/div/article/
Related News Title: //c-wiz[@jsrenderer="ARwRbe"]/c-wiz/div/div/article/a/text()
Related News Link: //c-wiz[@jsrenderer="ARwRbe"]/c-wiz/div/div/article/a/@href

The HTML structure of Google News remains consistent across different pages, ensuring that the specified XPath elements are applicable universally.

Step 3: Fetching Google News content

We'll start by fetching the content of the Google News page using the requests library. Here's the code to fetch the page content:

import requests

url = "https://news.google.com/topics/CAAqKggKIiRDQkFTRlFvSUwyMHZNRGRqTVhZU0JXVnVMVWRDR2dKSlRpZ0FQAQ?hl=en-US&gl=US&ceid=US%3Aen"
response = requests.get(url)

if response.status_code == 200:
    page_content = response.content
else:
    print(f"Failed to retrieve the page. Status code: {response.status_code}")

This code sends a GET request to the Google News URL and stores the HTML content of the page in the page_content variable.

Step 4: Parsing the HTML content with lxml

With the HTML content in hand, we can use lxml to parse the page and extract the news headlines and links.

from lxml import html

# Parse the HTML content
parser = html.fromstring(page_content)

Step 5: Extracting news data

Google News organizes its articles in specific containers. We'll first extract these containers using their XPath and then iterate through them to extract the individual news headlines and links.

Extracting main news articles

The main news articles are located under the following XPath:

main_news_elements = parser.xpath('//c-wiz[@jsrenderer="ARwRbe"]')

We can now loop through first 10 valid elements and extract the titles and links:

news_data = []

for element in main_news_elements[:10]:
    title = element.xpath('.//c-wiz/div/article/a/text()')[0]
    link = "https://news.google.com" + element.xpath('.//c-wiz/div/article/a/@href')[0][1:]

    
    # Ensure data exists before appending to the list
    if title and link:
        news_data.append({
        "main_title": title,
        "main_link": link,
    })

Extracting related articles within each main new element

The main news element has subsections where related news is present. We can extract these using a similar approach:

related_articles = []
related_news_elements = element.xpath('.//c-wiz/div/div/article')

for related_element in related_news_elements:
    related_title = related_element.xpath('.//a/text()')[0]
    related_link = "https://news.google.com" + related_element.xpath('.//a/@href')[0][1:]
    related_articles.append({"title": related_title, "link": related_link})

news_data.append({
    "main_title": title,
    "main_link": link,
    "related_articles": related_articles
})

Step 6: Saving the data as JSON

After extracting the data, we can save it in a JSON file for later use.

import json

with open('google_news_data.json', 'w') as f:
    json.dump(news_data, f, indent=4)

This code will create a file named google_news_data.json containing all the scraped news headlines and their corresponding links.

Using proxies

When scraping large amounts of data, especially from high-traffic sites like Google News, you might encounter issues like IP blocking or rate limiting. To avoid this, you can use proxies. Proxies allow you to route your requests through different IP addresses, making it harder for the website to detect and block your scraping activities.

For this tutorial, you can use a proxy by modifying the requests.get call:

proxies = {
    "http": "http://your_proxy_ip:port",
    "https": "https://your_proxy_ip:port",
}

response = requests.get(url, proxies=proxies)

If you're working with a service provider that manages proxy rotation, you only need to configure the service in your requests. The provider will handle the rotation and IP pool management on their end.

Customizing request headers

Sometimes, websites may block requests that don't have the proper headers, such as a user-agent string that identifies the request as coming from a browser. You can customize your headers to avoid detection:

headers = {
    'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7',
    'accept-language': 'en-IN,en;q=0.9',
    'cache-control': 'no-cache',
    'dnt': '1',
    'pragma': 'no-cache',
    'priority': 'u=0, i',
    'sec-ch-ua': '"Not)A;Brand";v="99", "Google Chrome";v="127", "Chromium";v="127"',
    'sec-ch-ua-arch': '"x86"',
    'sec-ch-ua-bitness': '"64"',
    'sec-ch-ua-full-version-list': '"Not)A;Brand";v="99.0.0.0", "Google Chrome";v="127.0.6533.72", "Chromium";v="127.0.6533.72"',
    'sec-ch-ua-mobile': '?0',
    'sec-ch-ua-model': '""',
    'sec-ch-ua-platform': '"Linux"',
    'sec-ch-ua-platform-version': '"6.5.0"',
    'sec-ch-ua-wow64': '?0',
    'sec-fetch-dest': 'document',
    'sec-fetch-mode': 'navigate',
    'sec-fetch-site': 'none',
    'sec-fetch-user': '?1',
    'upgrade-insecure-requests': '1',
    'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/127.0.0.0 Safari/537.36',
}

response = requests.get(url, headers=headers)

Complete code example

Here’s the full code, combining all the steps:

import requests
import urllib3
from lxml import html
import json

urllib3.disable_warnings()

# URL and headers
url = "https://news.google.com/topics/CAAqKggKIiRDQkFTRlFvSUwyMHZNRGRqTVhZU0JXVnVMVWRDR2dKSlRpZ0FQAQ?hl=en-US&gl=US&ceid=US%3Aen"
headers = {
   'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7',
   'accept-language': 'en-IN,en;q=0.9',
   'cache-control': 'no-cache',
   'dnt': '1',
   'pragma': 'no-cache',
   'priority': 'u=0, i',
   'sec-ch-ua': '"Not)A;Brand";v="99", "Google Chrome";v="127", "Chromium";v="127"',
   'sec-ch-ua-arch': '"x86"',
   'sec-ch-ua-bitness': '"64"',
   'sec-ch-ua-full-version-list': '"Not)A;Brand";v="99.0.0.0", "Google Chrome";v="127.0.6533.72", "Chromium";v="127.0.6533.72"',
   'sec-ch-ua-mobile': '?0',
   'sec-ch-ua-model': '""',
   'sec-ch-ua-platform': '"Linux"',
   'sec-ch-ua-platform-version': '"6.5.0"',
   'sec-ch-ua-wow64': '?0',
   'sec-fetch-dest': 'document',
   'sec-fetch-mode': 'navigate',
   'sec-fetch-site': 'none',
   'sec-fetch-user': '?1',
   'upgrade-insecure-requests': '1',
   'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/127.0.0.0 Safari/537.36',
}

# Proxies configuration (replace with your proxy details)
proxy = 'ip:port'
proxies = {
   "http": f"http://{proxy}",
   "https": f"https://{proxy}",
}

# Fetch the page content with the specified headers and proxies
response = requests.get(url, headers=headers, proxies=proxies, verify=False)

# Check if the request was successful
if response.status_code == 200:
   page_content = response.content
else:
   print(f"Failed to retrieve the page. Status code: {response.status_code}")
   exit()

# Parse the HTML content using lxml
parser = html.fromstring(page_content)

# Extract the main news and related articles
main_news_elements = parser.xpath('//*[@id="i10-panel"]/c-wiz/c-wiz')

# Initialize a list to store the extracted news data
news_data = []

# Loop through each main news element
for element in main_news_elements[:10]:
   # Extract the main news title and link
   title = element.xpath('.//c-wiz/div/article/a/text()')
   link = element.xpath('.//c-wiz/div/article/a/@href')

   # Initialize a list to store related articles for this main news
   related_articles = []

   # Extract related news elements within the same block
   related_news_elements = element.xpath('.//c-wiz/div/div/article')

   # Loop through each related news element and extract the title and link
   for related_element in related_news_elements:
       related_title = related_element.xpath('.//a/text()')[0]
       related_link = "https://news.google.com" + related_element.xpath('.//a/@href')[0][1:]
       related_articles.append({"title": related_title, "link": related_link})

   # Append the main news and its related articles to the news_data list
   if title is not None:
       news_data.append({
           "main_title": title,
           "main_link": f'https://news.google.com{link}',
           "related_articles": related_articles
       })
   else:
       continue


# Save the extracted data to a JSON file
with open("google_news_data.json", "w") as json_file:
   json.dump(news_data, json_file, indent=4)

print('Data extraction complete. Saved to google_news_data.json')

Scraping Google News using Python, along with the requests and lxml libraries, facilitates detailed analysis of news trends. Implementing proxies and configuring request headers is crucial to avoid blocks and maintain scraper stability. Ideal proxies for this purpose include IPv4 and IPv6 datacenter proxies and ISP proxies, which offer high speeds and low ping. Additionally, dynamic residential proxies are highly effective due to their superior trust factor.