How to scrape YouTube using Python: A Step-by-step guide

Comments: 0

Scraping data from YouTube can be challenging due to its dynamic content and anti-scraping measures. However, with the right tools and techniques, you can efficiently extract useful information. In this article, we'll walk you through the process of scraping YouTube video data using Python, Playwright, and lxml.

Environment setup

Install the necessary libraries using pip:

pip install playwright 
pip install lxml

Install the Playwright browser binaries:

playwright install

To just install Chromium browser binaries use the following command:

playwright install chromium

For web scraping YouTube data with Python, you'll primarily need the following libraries:

  1. Playwright: A powerful library for automating headless browsers, enabling you to interact with web pages as if you were a real user;
  2. lxml: A fast and feature-rich library for processing XML and HTML in Python, supporting XPath for querying documents;
  3. CSV Module: A built-in Python library for saving extracted data into a CSV file.

Step 1: Import required libraries

import asyncio
from playwright.async_api import Playwright, async_playwright
from lxml import html
import csv

Step 2: Headless browser automation

Launch a headless browser with Playwright, navigate to the YouTube video URL, and wait for the page to fully load.

Scroll the page to load more comments.

browser = await playwright.chromium.launch(headless=True)
context = await browser.new_context()
page = await context.new_page()

# Navigating to the YouTube video URL
await page.goto("https://www.youtube.com/watch?v=Ct8Gxo8StBU", wait_until="networkidle")

# Scrolling down to load more comments
for _ in range(20):
    await page.mouse.wheel(0, 200)
    await asyncio.sleep(0.2)

# Giving some time for additional content to load
await page.wait_for_timeout(1000)

Step 3: HTML content parsing

Extract the page's HTML content using Playwright and parse it with lxml.

# Extracting the page content
page_content = await page.content()

# Parsing the HTML content
parser = html.fromstring(page_content)

Step 4: Data extraction

Extract the required data points (e.g., title, channel, comments) using XPath expressions.

Collect all relevant data, including video metadata and comments.

# Extracting video data
title = parser.xpath('//div[@id="title"]/h1/yt-formatted-string/text()')[0]
channel = parser.xpath('//yt-formatted-string[@id="text"]/a/text()')[0]
channel_link = 'https://www.youtube.com' + parser.xpath('//yt-formatted-string[@id="text"]/a/@href')[0]
posted = parser.xpath('//yt-formatted-string[@id="info"]/span/text()')[2]
total_views = parser.xpath('//yt-formatted-string[@id="info"]/span/text()')[0]
total_comments = parser.xpath('//h2[@id="count"]/yt-formatted-string/span/text()')[0]
comments_list = parser.xpath('//yt-attributed-string[@id="content-text"]/span/text()')

Step 5: Saving data

Save the extracted data into a CSV file for easy analysis and storage.

# Saving the data to a CSV file
with open('youtube_video_data.csv', 'w', newline='', encoding='utf-8') as file:
    writer = csv.writer(file)
    writer.writerow(["Title", "Channel", "Channel Link", "Posted", "Total Views", "Total Comments", "Comments"])
    writer.writerow([title, channel, channel_link, posted, total_views, total_comments, ", ".join(comments_list)])

Proxy implementation

Proxies play a crucial role in web scraping, especially when dealing with large-scale data extraction or sites with strict anti-bot measures like YouTube. Here's how proxies are implemented in the Playwright script:

Proxy setup:

  1. The proxy parameter in playwright.chromium.launch() is used to route all browser traffic through a specified proxy server.
  2. The proxy server details, including the server address, username, and password, must be configured.

browser = await playwright.chromium.launch(
        headless=True,
        proxy={"server": "http://your_proxy_ip:port", "username": "your_username", "password": "your_password"}
    )

Benefits of using proxies:

  • IP masking: proxies hide your original IP address, reducing the likelihood of being blocked.
  • Request distribution: by rotating proxies, you can distribute requests across different IP addresses, mimicking traffic from multiple users.
  • Access restricted content: proxies can help bypass regional restrictions or access content that might be limited to certain IP ranges.

This implementation ensures your scraping activities are less likely to be detected and blocked by YouTube's anti-bot mechanisms.

Complete code implementation

Below is the complete code to scrape YouTube video data using Playwright and lxml, including proxy implementation.

import asyncio
from playwright.async_api import Playwright, async_playwright
from lxml import html
import csv

# Asynchronous function to run Playwright and extract data
async def run(playwright: Playwright) -> None:
    # Launching headless browser with proxy settings
    browser = await playwright.chromium.launch(
        headless=True,
        proxy={"server": "http://your_proxy_ip:port", "username": "your_username", "password": "your_password"}
    )
    context = await browser.new_context()
    page = await context.new_page()

    # Navigating to the YouTube video URL
    await page.goto("https://www.youtube.com/watch?v=Ct8Gxo8StBU", wait_until="networkidle")

    # Scrolling down to load more comments
    for _ in range(20):
        await page.mouse.wheel(0, 200)
        await asyncio.sleep(0.2)
    
    # Giving some time for additional content to load
    await page.wait_for_timeout(1000)
    
    # Extracting the page content
    page_content = await page.content()

    # Closing browser
    await context.close()
    await browser.close()

    # Parsing the HTML content
    parser = html.fromstring(page_content)

    # Extracting video data
    title = parser.xpath('//div[@id="title"]/h1/yt-formatted-string/text()')[0]
    channel = parser.xpath('//yt-formatted-string[@id="text"]/a/text()')[0]
    channel_link = 'https://www.youtube.com' + parser.xpath('//yt-formatted-string[@id="text"]/a/@href')[0]
    posted = parser.xpath('//yt-formatted-string[@id="info"]/span/text()')[2]
    total_views = parser.xpath('//yt-formatted-string[@id="info"]/span/text()')[0]
    total_comments = parser.xpath('//h2[@id="count"]/yt-formatted-string/span/text()')[0]
    comments_list = parser.xpath('//yt-attributed-string[@id="content-text"]/span/text()')

    # Saving the data to a CSV file
    with open('youtube_video_data.csv', 'w', newline='', encoding='utf-8') as file:
        writer = csv.writer(file)
        writer.writerow(["Title", "Channel", "Channel Link", "Posted", "Total Views", "Total Comments", "Comments"])
        writer.writerow([title, channel, channel_link, posted, total_views, total_comments, ", ".join(comments_list)])

# Running the asynchronous function
async def main():
    async with async_playwright() as playwright:
        await run(playwright)

asyncio.run(main())

When setting up an environment for scraping data from YouTube, it's crucial to focus on using proxies to circumvent platform restrictions effectively. Carefully selecting proxy servers is essential to minimize blocking risks and ensure operation anonymity. Static ISP proxies are highly recommended for their fast connection speeds and stability. Additionally, residential proxies offer dynamic IP addresses with a high trust factor, making them less likely to be flagged by YouTube’s security systems. It's also vital to adhere to ethical standards during data collection to prevent violations of YouTube's terms of service.

Comments:

0 comments