How to Scrape Vital YouTube Data With Python

Comments: 0

YouTube content creators have to analyze how every video is performing, which includes both analyzing comments and the creator’s own content in relation to other videos in the same or a different category.

Going through all of the videos manually can be exhausting and difficult. This scenario is where a special script would come in handy. In this guide, we will show you how to scrape YouTube and create such a script that aims to automate the data-gathering process that is usually carried out manually.

API vs Web-Scraping: When to Use Which

Accessing YouTube data usually comes down to two options: calling the official YouTube Data API or scraping the website with tools such as Selenium. The API is designed and supported for this purpose, offering stable access to video metadata, channel statistics, comments, and search results with documented quotas and predictable responses. In most cases, development teams should rely on the API as the primary integration layer.

Web-scraping becomes relevant only when the UI exposes fields that the API does not. Typical gaps include:

  • Rich description formatting and inline links;
  • Lazily loaded or filtered comments and replies;
  • Live engagement widgets and experimental UI-only elements.

Scraping remains a last resort. It is slower, more fragile (front-end changes can break selectors and the whole scraper as a result), and introduces ongoing compliance and maintenance risks.

Important disclaimer: Scraping YouTube may violate YouTube’s Terms of Service and can trigger rate limiting or IP blocking. Always respect local laws and platform policies, avoid aggressive request patterns, and prefer official API access whenever it covers the required data.

Understanding Data Structure to Scrape Youtube with Python

Before diving into web-scraping on YouTube with Python, it helps to understand how the site structures its data. It has so many features available that have an endless array of data types to choose from pertaining to user activities and video statistics. Some key parameters from the platform include video titles and descriptions, the tags added, amount of views, likes and comments, as well as the channel and playlist information. These elements are significant not only for content marketers and creators, but also for analytics pipelines that assess video performance and inform content strategy.

As described above, the YouTube Data API already exposes most of these metrics programmatically. The API also allows access to subscriber counts as well as videos on the channel which provides a good amount of data for analysis and integration purposes.

Yet, certain UI-only elements remain unavailable through the API and can only be retrieved via web-scraping. For example, obtaining some detailed viewer engagement metric, such as the sentiment of comments, rich description formatting, or the specific time when viewers interact with parts of the page, would require some approaches to web scrape YouTube pages. This technique is usually more complicated and more fragile, because it depends on ever-changing front-end markup and must respect the platform’s policies and technical limits on automated access.

In the following sections, the article demonstrates how to build a Python script that scrapes selected YouTube data efficiently, while staying aligned with the constraints and trade-offs outlined above.

Using Proxies to Avoid Detection While Scraping YouTube

In order to scrape Youtube videos with Python, the use of proxies is essential to evade IP bans and bots traversal prevention methods. Here are some types and their descriptions:

  1. Residential proxies are connected to genuine real IP addresses and are used as authentic connections for the websites. To scrape Youtube data, where trust is required to a large extent in order to not get caught, proxies are the best option. They make it possible for the scraper to behave like a genuine user, hence the chances of being detected as a bot are minimized.
  2. ISP proxies provide the middle ground between residential IPs and datacenter proxies. They are provided by internet service providers, who issue authentic IP addresses, which are notoriously difficult to flag as proxies. This quality makes ISP proxies very effective in cases when need to scrape Youtube search results, which need both authenticity and outstanding performance.
  3. Even though Datacenter ones boast the highest speeds, they can easily be identified by platforms like YouTube due to coming from large data centers. The risk of being blocked while scraping is high, even though they are efficient and easy to use. These types are best when the need for rapid data processing outweighs the risks posed by detection.
  4. Mobile proxies provide the most legitimate solution due to routing connections through mobile devices on cellular networks. Their use for scraping tasks is the most effective as they are less likely to be blocked, as mobile IPs are often rotated by service providers, making mobile proxies far less likely to get flagged. But, it needs to be noted that their speed might be much lower than other types.

When using these types strategically, it is possible to scrape data from Youtube without being detected, allowing for continued access to data while abiding to the platform's Terms of Service. Also, understanding them properly will help you a lot when you will need to find a proxy for scraping.

Additional Proxy Tips and IP Rotation

Beyond choosing the right proxy type, it is important to rotate IP addresses and vary your behavior:

  • rotate the proxy/IP after each video or after a small batch of requests,
  • introduce random delays between actions (for example, between 2 and 7 seconds),
  • avoid running long scraping sessions from a single IP,
  • combine IP rotation with different User-Agent strings and window sizes.

IP rotation helps to distribute your requests across multiple identities, which reduces the chance that YouTube will detect and block your scraper based on repeated traffic from a single address.

Creating a Scraper to Extract Data From YouTube

Let’s start with how to scrape Youtube by creating the script, there are some packages that must be installed. We will first install Selenium-wire which is a proxy extension of Selenium as well as it has some key classes and modules. Use the following command string in your command interface to install these packages:

pip install selenium-wire selenium blinker==1.7.0

Let us now turn our attention to the imports.

Step 1: Importing libraries and packages

At this point, we need to load the libraries and packages which are targeted towards web elements for which the scripts have been written. Additionally, we should include modules for data processing and runtime management to ensure efficient execution of the script.

from seleniumwire import webdriver  # selenium-wire driver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import json
import time

The json module assists in converting scraped data to properly formatted JSON data for better presentation as well as masking our ip address. However, the time module is used to obfuscate actions and introduce randomness in case overly predictable behavior is detected.

Moreover, it also helps in making sure that the needed elements to be extracted from the data are already on the page. The rest of the imports are the needed classes or submodules which do different things that will be discussed later in other parts of the code.

Step 2: Setting up Selenium Chrome Driver

When a Selenium instance is triggered from a script written in Python, the script on its own will utilize our IP for any tasks it is required to perform. This can be really problematic for some sites like YouTube that scrape information from their site. It is advisable to review the websites robots file for better understanding. As a result, you might suffer the consequences of having your IP receive temporary bans when trying to scrape Youtube channels, for example.

To mitigate all those challenges, there are several things we need to do. First, we must store the details of the proxy we will be using as the base of three variables. Then we define an options variable, chrome_options which will be passed to the selenium.Chrome() instance to inform which proxy has to be used while scraping. The proxy information is set by passing the required arguments in chrome_options. Here’s how to scrape Youtube with proxies script:

proxy_address = ""
proxy_username = ""
proxy_password = ""
chrome_options = Options()
seleniumwire_options = {
    "proxy": {
        "http": f"http://{proxy_username}:{proxy_password}@{proxy_address}",
        "https": f"http://{proxy_username}:{proxy_password}@{proxy_address}",
    }
}
driver = wiredriver.Chrome(options=chrome_options, seleniumwire_options=seleniumwire_options)

Step 3: How to Scrape Youtube Videos Pages

Set up a variable called “youtube_url_to_scrape” to keep the address for the YouTube homepage. This variable is used in the “driver.get()” method for Selenium, so it knows what specific webpage to go to for scraping purposes. This action will cause a new Chrome window to open when the script is executed.

youtube_url_to_scrape = "https://www.youtube.com/watch?v=XXXXXXXXXXX"
driver.get(youtube_url_to_scrape)

# Wait for the title to appear (main indicator that video loaded)
WebDriverWait(driver, 15).until(
    EC.presence_of_element_located((By.XPATH, '//*[@id="title"]/h1'))
)

# JS scroll
driver.execute_script("window.scrollTo(0, document.documentElement.scrollHeight / 2);")
time.sleep(2)
driver.execute_script("window.scrollTo(0, document.documentElement.scrollHeight);")
time.sleep(2)

Following, we now describe the “extract _information()” function, which performs information extraction from the page according to its name.

We made sure all elements on the page are loaded. This is accomplished by using the WebDriver API along with WebDriverWait to suspend the script until the element represented by the sample “more” button is available and clicked. When it is, Selenium performs a JavaScript click for the full description of the video.

Dynamic Comment Issue

To remedy it, we are implementing a solution to eliminate any related issues. Using the Actions class and the Time module, we scroll down twice every 10 seconds, making sure to scrape YouTube comments as many as possible. This ensures there are no hassles involving dynamically loaded content.

def extract_information() -> dict:
   try:
       element = WebDriverWait(driver, 15).until(
           EC.presence_of_element_located((By.XPATH, '//*[@id="expand"]'))
       )

       element.click()

       time.sleep(10)
       actions = ActionChains(driver)
       actions.send_keys(Keys.END).perform()
       time.sleep(10)
       actions.send_keys(Keys.END).perform()
       time.sleep(10)

Using Selenium WebDriver, there are distinct ways to search for elements. One can search by ID, by CLASS_NAME, by XPATH, among others. For this, how to scrape data from Youtube in a Python tutorial, we will combine them and not a single way of doing so.

XPATH is a more complex system of finding variables to scrape because it seems more pattern oriented. It is regarded as the most difficult indeed, but not when it comes to Chrome at least.

When taking a look at the code, all that is needed is to click and select the copy XPATH option, which can be done by right-clicking on it from the inspect part of Chrome. After acquiring XPATH, the method ‘find_elements’ can easily be employed to search for all the components that have the pertinent details like the video’s title, description, etc.

It is important to mention that some features on the webpage could have duplicate properties and that is why using the basic ‘find_elements()’ technique will give you a list instead of a string. In this scenario, you have to check the list and identify which index has the required information and which text needs to be retrieved.

Concludingly, a dictionary variable named “data” is returned, encapsulating all the information garnered during scraping, ergo, an essential for the subsequent section.

   video_title = driver.find_elements(By.XPATH, '//*[@id="title"]/h1')[0].text

   owner = driver.find_elements(By.XPATH, '//*[@id="text"]/a')[0].text

   total_number_of_subscribers = \
       driver.find_elements(By.XPATH, "//div[@id='upload-info']//yt-formatted-string[@id='owner-sub-count']")[
           0].text

   video_description = driver.find_elements(By.XPATH, '//*[@id="description-inline-expander"]/yt-attributed-string/span/span')
   result = []
   for i in video_description:
       result.append(i.text)
   description = ''.join(result)

   publish_date = driver.find_elements(By.XPATH, '//*[@id="info"]/span')[2].text
   total_views = driver.find_elements(By.XPATH, '//*[@id="info"]/span')[0].text

   number_of_likes = driver.find_elements(By.XPATH, '//*[@id="top-level-buttons-computed"]/segmented-like-dislike-button-view-model/yt-smartimation/div/div/like-button-view-model/toggle-button-view-model/button-view-model/button/div')[
       1].text

   comment_names = driver.find_elements(By.XPATH, '//*[@id="author-text"]/span')
   comment_content = driver.find_elements(By.XPATH, '//*[@id="content-text"]/span')
   comment_library = []

   for each in range(len(comment_names)):
       name = comment_names[each].text
       content = comment_content[each].text
       indie_comment = {
           'name': name,
           'comment': content
       }
       comment_library.append(indie_comment)

   data = {
       'owner': owner,
       'subscribers': total_number_of_subscribers,
       'video_title': video_title,
       'description': description,
       'date': publish_date,
       'views': total_views,
       'likes': number_of_likes,
       'comments': comment_library
   }

   return data

except Exception as err:
   print(f"Error: {err}")

A user defined function named “organize_write_data” is designed to accept the “data” output from the previous operation. Its main work is to structure the data within the JSON format and write it to a file called output.json, all the while ensuring that the correct procedures are followed during the writing to file stage.

Full Code

With the individual steps in place, the complete scraper now looks as follows:

from selenium.webdriver.chrome.options import Options
from seleniumwire import webdriver as wiredriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.action_chains import ActionChains
import json
import time

# Proxy configuration (replace with your real proxy or leave empty if not needed)
proxy_address = ""
proxy_username = ""
proxy_password = ""

chrome_options = Options()
if proxy_address:
    chrome_options.add_argument(f'--proxy-server={proxy_address}')
    if proxy_username and proxy_password:
        chrome_options.add_argument(f'--proxy-auth={proxy_username}:{proxy_password}')

driver = wiredriver.Chrome(options=chrome_options)

youtube_url_to_scrape = "https://www.youtube.com/watch?v=XXXXXXXXXXX"
driver.get(youtube_url_to_scrape)
time.sleep(5)  # basic wait for the page to load


def extract_information() -> dict:
    """Minimal example that collects core video data and top-level comments."""
    try:
        # Expand description if the button is present
        try:
            element = WebDriverWait(driver, 10).until(
                EC.presence_of_element_located((By.XPATH, '//*[@id="expand"]'))
            )
            element.click()
            time.sleep(2)
        except Exception:
            pass  # if the button is not found, continue with what is visible

        # Scroll to load some comments
        actions = ActionChains(driver)
        actions.send_keys(Keys.END).perform()
        time.sleep(3)
        actions.send_keys(Keys.END).perform()
        time.sleep(3)

        video_title = driver.find_elements(By.XPATH, '//*[@id="title"]/h1')[0].text
        owner = driver.find_elements(By.XPATH, '//*[@id="text"]/a')[0].text
        total_number_of_subscribers = driver.find_elements(
            By.XPATH,
            "//div[@id='upload-info']//yt-formatted-string[@id='owner-sub-count']"
        )[0].text

        video_description_nodes = driver.find_elements(
            By.XPATH,
            '//*[@id="description-inline-expander"]/yt-attributed-string/span/span'
        )
        description = ''.join([node.text for node in video_description_nodes])

        publish_date = driver.find_elements(By.XPATH, '//*[@id="info"]/span')[2].text
        total_views = driver.find_elements(By.XPATH, '//*[@id="info"]/span')[0].text

        number_of_likes = driver.find_elements(
            By.XPATH,
            '//*[@id="top-level-buttons-computed"]/segmented-like-dislike-button-view-model/yt-smartimation/div/div/like-button-view-model/toggle-button-view-model/button-view-model/button/div'
        )[1].text

        comment_names = driver.find_elements(By.XPATH, '//*[@id="author-text"]/span')
        comment_content = driver.find_elements(By.XPATH, '//*[@id="content-text"]/span')

        comment_library = []
        for each in range(min(len(comment_names), len(comment_content))):
            name = comment_names[each].text
            content = comment_content[each].text
            comment_library.append({"name": name, "comment": content})

        data = {
            'owner': owner,
            'subscribers': total_number_of_subscribers,
            'video_title': video_title,
            'description': description,
            'date': publish_date,
            'views': total_views,
            'likes': number_of_likes,
            'comments': comment_library
        }
        return data

    except Exception as err:
        print(f"Error: {err}")
        return {}


def organize_write_data(data: dict):
    """Write scraped data to output.json in a readable format."""
    output = json.dumps(data, indent=2, ensure_ascii=False)
    try:
        with open("output.json", 'w', encoding='utf-8') as file:
            file.write(output)
    except Exception as err:
        print(f"Error encountered: {err}")


organize_write_data(extract_information())
driver.quit()

Minimal “Bare-Bones” Example

In case you only want a very compact script that just opens a video and extracts basic data (without proxies, scrolling or comments), you can start from this skeleton:

from selenium import webdriver
from selenium.webdriver.common.by import By
import json, time

driver = webdriver.Chrome()
driver.get("https://www.youtube.com/watch?v=XXXXXXXXXXX")
time.sleep(5)

title = driver.find_element(By.XPATH, '//*[@id="title"]/h1').text
owner = driver.find_element(By.XPATH, '//*[@id="text"]/a').text
views = driver.find_elements(By.XPATH, '//*[@id="info"]/span')[0].text

data = {
    "title": title,
    "owner": owner,
    "views": views
}

with open("output_minimal.json", "w", encoding="utf-8") as f:
    json.dump(data, f, indent=2, ensure_ascii=False)

driver.quit()

Results

We also know how to scrape Youtube comments with Python now. The output looks like this:

{
  "owner": "Suzie Taylor",
  "subscribers": "165K subscribers",
  "video_title": "I Spent $185,000 From MrBeast",
  "description": "@MrBeast blessed me with $185,000, after SURVIVING 100 DAYS trapped with a STRANGER. Now I want to bless you!\nGive This A Like if you Enjoyed :))\n-\nTo Enter The Giveaway: \nFollow me on Instagram: https://www.instagram.com/sooztaylor/...\nSubscribe to me on Youtube: \n   / @suzietaylor  \nI am picking winners for the giveaway ONE WEEK from today (December 23rd) \n-\nThank you everyone for all of your love already. This is my dream!",
  "date": "Dec 16, 2023",
  "views": "4,605,785 ",
  "likes": "230K",
  "comments": [
    {
      "name": "",
      "comment": "The right person got the money "
    },
    {
      "name": "@Scottster",
      "comment": "Way to go Suzie, a worthy winner! Always the thought that counts and you put a lot into it!"
    },
    {
      "name": "@cidsx",
      "comment": "I'm so glad that she's paying it forward! She 100% deserved the reward"
    },
    {
      "name": "@Basicskill720",
      "comment": "This is amazing Suzie. Way to spread kindness in this dark world. It is much needed !"
    },
    {
      "name": "@eliasnull",
      "comment": "You are such a blessing Suzie! The world needs more people like you."
    },
    {
      "name": "@iceline22",
      "comment": "That's so awesome you're paying it forward! You seem so genuine, and happy to pass along your good fortune! Amazing! Keep it up!"
    },
    {
      "name": "",
      "comment": "Always nice to see a Mr. Beast winner turn around and doing nice things for others. I know this was but a small portion of what you won and nobody expects you to not take care of yourself with that money, but to give back even in small ways can mean so much. Thank you for doing this."
    }
  ]
}

How to Scrape Youtube: Final Thoughts

The ability to leverage the database information from YouTube is incredibly valuable when using automation or proxy scripts that aid in staying compliant to the platform’s rules. To wrap up, here is a short, practical sequence you can follow:

  1. Check the YouTube Data API first and use it wherever possible.
  2. Decide what really requires web-scraping (UI-only or non-API fields).
  3. Set up Selenium (optionally with proxies) and confirm a single video can be loaded correctly.
  4. Implement a minimal extraction function that collects only the fields you need.
  5. Add scrolling and comment parsing if you need engagement data from dynamically loaded sections.
  6. Introduce proxy rotation and random delays to reduce the risk of IP blocks.
  7. Write the results to JSON and validate that the output matches your analysis needs.
  8. Monitor YouTube layout changes and update your locators when the scraper breaks.

Used responsibly, this workflow lets you automate the collection of vital YouTube data while minimizing technical risk and staying as close as possible to platform rules.

Comments:

0 comments