How to Scrape Vital YouTube Data With Python

Comments: 0

YouTube content creators have to analyze how every video is performing, which includes both analyzing comments and the creator’s own content in relation to other videos in the same or a different category.

Going through all of the videos manually can be exhausting and difficult. This scenario is where a special script would come in handy. In this guide, we will show you how to scrape YouTube and create such a script that aims to automate the data-gathering process that is usually carried out manually.

Understanding Data Structure to Scrape Youtube with Python

Before we get to know how to scrape YouTube we need to understand its structure. It has so many features available that have an endless array of data types to choose from pertaining to user activities and video statistics. Some key parameters from the platform include video titles and descriptions, the tags added, amount of views, likes and comments, as well as the channel and playlist information. These elements are significant for content marketers and creators for assessing the videos’ performances and strategizing how to formulate video content.

With the YouTube Data API, developers can get access to most of the metrics programmatically. The API also allows access to subscriber counts as well as videos on the channel which provides a good amount of data for analysis and integration purposes.

Yet, there might be some particular elements that are impossible to get through the API and thus can be only retrieved via web scraping. For example, obtaining some detailed viewer engagement metric, such as the sentiment of their comments or the specific time when they engaged, would require some approaches to web scrape Youtube pages. This technique is usually more complicated and can have risks with platform’s ever-evolving content rendition as well as their strict regulations on data scraping.

In the following blocks we are going to show you how to build a script and how to scrape data from Youtube in Python efficiently.

Using Proxies to Avoid Detection While Scraping YouTube

In order to scrape Youtube videos with Python, the use of proxies is essential to evade IP bans and bots traversal prevention methods. Here are some types and their descriptions:

  1. Residential proxies are connected to genuine real IP addresses and are used as authentic connections for the websites. To scrape Youtube data, where trust is required to a large extent in order to not get caught, proxies are the best option. They make it possible for the scraper to behave like a genuine user, hence the chances of being detected as a bot are minimized.
  2. ISP proxies provide the middle ground between residential IPs and datacenter proxies. They are provided by internet service providers, who issue authentic IP addresses, which are notoriously difficult to flag as proxies. This quality makes ISP proxies very effective in cases when need to scrape Youtube search results, which need both authenticity and outstanding performance.
  3. Even though Datacenter ones boast the highest speeds, they can easily be identified by platforms like YouTube due to coming from large data centers. The risk of being blocked while scraping is high, even though they are efficient and easy to use. These types are best when the need for rapid data processing outweighs the risks posed by detection.
  4. Mobile proxies provide the most legitimate solution due to routing connections through mobile devices on cellular networks. Their use for scraping tasks is the most effective as they are less likely to be blocked, as mobile IPs are often rotated by service providers, making mobile proxies far less likely to get flagged. But, it needs to be noted that their speed might be much lower than other types.

When using these types strategically, it is possible to scrape data from Youtube without being detected, allowing for continued access to data while abiding to the platform's Terms of Service. Also, understanding them properly will help you a lot when you will need to find a proxy for scraping.

Creating a Scraper to Extract Data From YouTube

Let’s start with how to scrape Youtube by creating the script, there are some packages that must be installed. We will first install Selenium-wire which is a proxy extension of Selenium as well as it has some key classes and modules. Use the following command string in your command interface to install these packages:


pip install selenium-wire selenium blinker==1.7.0

Let us now turn our attention to the imports.

Step 1: Importing libraries and packages

At this point, we need to load the libraries and packages which are targeted towards web elements for which the scripts have been written. Additionally, we should include modules for data processing and runtime management to ensure efficient execution of the script.


from selenium.webdriver.chrome.options import Options
from seleniumwire import webdriver as wiredriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.action_chains import ActionChains
import json
import time

The json module assists in converting scraped data to properly formatted JSON data for better presentation as well as masking our ip address. However, the time module is used to obfuscate actions and introduce randomness in case overly predictable behavior is detected.

Moreover, it also helps in making sure that the needed elements to be extracted from the data are already on the page. The rest of the imports are the needed classes or submodules which do different things that will be discussed later in other parts of the code.

Step 2: Setting up Selenium Chrome Driver

When a Selenium instance is triggered from a script written in Python, the script on its own will utilize our IP for any tasks it is required to perform. This can be really problematic for some sites like YouTube that scrape information from their site. It is advisable to review the websites robots file for better understanding. As a result, you might suffer the consequences of having your IP receive temporary bans when trying to scrape Youtube channels, for example.

To mitigate all those challenges, there are several things we need to do. First, we must store the details of the proxy we will be using as the base of three variables. Then we define an options variable, chrome_options which will be passed to the selenium.Chrome() instance to inform which proxy has to be used while scraping. The proxy information is set by passing the required arguments in chrome_options. Here’s how to scrape Youtube with proxies script:


# Specify the proxy server address with username and password
proxy_address = ""
proxy_username = ""
proxy_password = ""
# Set up Chrome options with the proxy and authentication
chrome_options = Options()
chrome_options.add_argument(f'--proxy-server={proxy_address}')
chrome_options.add_argument(f'--proxy-auth={proxy_username}:{proxy_password}')
# Create a WebDriver instance with selenium-wire
driver = wiredriver.Chrome(options=chrome_options)

Step 3: How to Scrape Youtube Videos Pages

Set up a variable called “youtube_url_to_scrape” to keep the address for the YouTube homepage. This variable is used in the “driver.get()” method for Selenium, so it knows what specific webpage to go to for scraping purposes. This action will cause a new Chrome window to open when the script is executed.


youtube_url_to_scrape = ""
# Perform your Selenium automation with the enhanced capabilities of selenium-wire
driver.get(youtube_url_to_scrape)

Following, we now describe the “extract _information()” function, which performs information extraction from the page according to its name.

We made sure all elements on the page are loaded. This is accomplished by using the WebDriver API along with WebDriverWait to suspend the script until the element represented by the sample “more” button is available and clicked. When it is, Selenium performs a JavaScript click for the full description of the video.

Dynamic Comment Issue

To remedy it, we are implementing a solution to eliminate any related issues. Using the Actions class and the Time module, we scroll down twice every 10 seconds, making sure to scrape YouTube comments as many as possible. This ensures there are no hassles involving dynamically loaded content.


def extract_information() -> dict:
   try:
       element = WebDriverWait(driver, 15).until(
           EC.presence_of_element_located((By.XPATH, '//*[@id="expand"]'))
       )

       element.click()

       time.sleep(10)
       actions = ActionChains(driver)
       actions.send_keys(Keys.END).perform()
       time.sleep(10)
       actions.send_keys(Keys.END).perform()
       time.sleep(10)

Using Selenium WebDriver, there are distinct ways to search for elements. One can search by ID, by CLASS_NAME, by XPATH, among others. For this, how to scrape data from Youtube in a Python tutorial, we will combine them and not a single way of doing so.

XPATH is a more complex system of finding variables to scrape because it seems more pattern oriented. It is regarded as the most difficult indeed, but not when it comes to Chrome at least.

When taking a look at the code, all that is needed is to click and select the copy XPATH option, which can be done by right-clicking on it from the inspect part of Chrome. After acquiring XPATH, the method ‘find_elements’ can easily be employed to search for all the components that have the pertinent details like the video’s title, description, etc.

It is important to mention that some features on the webpage could have duplicate properties and that is why using the basic ‘find_elements()’ technique will give you a list instead of a string. In this scenario, you have to check the list and identify which index has the required information and which text needs to be retrieved.

Concludingly, a dictionary variable named “data” is returned, encapsulating all the information garnered during scraping, ergo, an essential for the subsequent section.


    video_title = driver.find_elements(By.XPATH, '//*[@id="title"]/h1')[0].text

   owner = driver.find_elements(By.XPATH, '//*[@id="text"]/a')[0].text

   total_number_of_subscribers = \
       driver.find_elements(By.XPATH, "//div[@id='upload-info']//yt-formatted-string[@id='owner-sub-count']")[
           0].text

   video_description = driver.find_elements(By.XPATH,                                  '//*[@id="description-inline-expander"]/yt-attributed-string/span/span')
   result = []
   for i in video_description:
       result.append(i.text)
   description = ''.join(result)

   publish_date = driver.find_elements(By.XPATH, '//*[@id="info"]/span')[2].text
   total_views = driver.find_elements(By.XPATH, '//*[@id="info"]/span')[0].text

   number_of_likes = driver.find_elements(By.XPATH,                                   '//*[@id="top-level-buttons-computed"]/segmented-like-dislike-button-view-model/yt-smartimation/div/div/like-button-view-model/toggle-button-view-model/button-view-model/button/div')[
       1].text

   comment_names = driver.find_elements(By.XPATH, '//*[@id="author-text"]/span')
   comment_content = driver.find_elements(By.XPATH, '//*[@id="content-text"]/span')
   comment_library = []

   for each in range(len(comment_names)):
       name = comment_names[each].text
       content = comment_content[each].text
       indie_comment = {
           'name': name,
           'comment': content
       }
       comment_library.append(indie_comment)

   data = {
       'owner': owner,
       'subscribers': total_number_of_subscribers,
       'video_title': video_title,
       'description': description,
       'date': publish_date,
       'views': total_views,
       'likes': number_of_likes,
       'comments': comment_library
   }

   return data

except Exception as err:
   print(f"Error: {err}")

A user defined function named “organize_write_data” is designed to accept the “data” output from the previous operation. Its main work is to structure the data within the JSON format and write it to a file called output.json, all the while ensuring that the correct procedures are followed during the writing to file stage.

Full Code

Now, we understand how to scrape Youtube properly. So far, here is the full code of our scraping program:


from selenium.webdriver.chrome.options import Options
from seleniumwire import webdriver as wiredriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.action_chains import ActionChains
import json
import time

# Specify the proxy server address with username and password
proxy_address = ""
proxy_username = ""
proxy_password = ""

# Set up Chrome options with the proxy and authentication
chrome_options = Options()
chrome_options.add_argument(f'--proxy-server={proxy_address}')
chrome_options.add_argument(f'--proxy-auth={proxy_username}:{proxy_password}')

# Create a WebDriver instance with selenium-wire
driver = wiredriver.Chrome(options=chrome_options)

youtube_url_to_scrape = ""

# Perform your Selenium automation with the enhanced capabilities of selenium-wire
driver.get(youtube_url_to_scrape)


def extract_information() -> dict:
   try:
       element = WebDriverWait(driver, 15).until(
           EC.presence_of_element_located((By.XPATH, '//*[@id="expand"]'))
       )
       element.click()

       time.sleep(10)
       actions = ActionChains(driver)
       actions.send_keys(Keys.END).perform()
       time.sleep(10)
       actions.send_keys(Keys.END).perform()
       time.sleep(10)

       video_title = driver.find_elements(By.XPATH, '//*[@id="title"]/h1')[0].text

       owner = driver.find_elements(By.XPATH, '//*[@id="text"]/a')[0].text
       total_number_of_subscribers = \
           driver.find_elements(By.XPATH, "//div[@id='upload-info']//yt-formatted-string[@id='owner-sub-count']")[
               0].text

       video_description = driver.find_elements(By.XPATH,
                                                '//*[@id="description-inline-expander"]/yt-attributed-string/span/span')
       result = []
       for i in video_description:
           result.append(i.text)
       description = ''.join(result)

       publish_date = driver.find_elements(By.XPATH, '//*[@id="info"]/span')[2].text
       total_views = driver.find_elements(By.XPATH, '//*[@id="info"]/span')[0].text

       number_of_likes = driver.find_elements(By.XPATH,
                                              '//*[@id="top-level-buttons-computed"]/segmented-like-dislike-button-view-model/yt-smartimation/div/div/like-button-view-model/toggle-button-view-model/button-view-model/button/div')[
           1].text

       comment_names = driver.find_elements(By.XPATH, '//*[@id="author-text"]/span')
       comment_content = driver.find_elements(By.XPATH,
                                              '//*[@id="content-text"]/span')
       comment_library = []

       for each in range(len(comment_names)):
           name = comment_names[each].text
           content = comment_content[each].text
           indie_comment = {
               'name': name,
               'comment': content
           }
           comment_library.append(indie_comment)

       data = {
           'owner': owner,
           'subscribers': total_number_of_subscribers,
           'video_title': video_title,
           'description': description,
           'date': publish_date,
           'views': total_views,
           'likes': number_of_likes,
           'comments': comment_library
       }

       return data

   except Exception as err:
       print(f"Error: {err}")


# Record data to JSON
def organize_write_data(data: dict):
   output = json.dumps(data, indent=2, ensure_ascii=False).encode("ascii", "ignore").decode("utf-8")
   try:
       with open("output.json", 'w', encoding='utf-8') as file:
           file.write(output)
   except Exception as err:
       print(f"Error encountered: {err}")


organize_write_data(extract_information())
driver.quit()

Results

We also know how to scrape Youtube comments with Python now. The output looks like this:


{
  "owner": "Suzie Taylor",
  "subscribers": "165K subscribers",
  "video_title": "I Spent $185,000 From MrBeast",
  "description": "@MrBeast blessed me with $185,000, after SURVIVING 100 DAYS trapped with a STRANGER. Now I want to bless you!\nGive This A Like if you Enjoyed :))\n-\nTo Enter The Giveaway: \nFollow me on Instagram: https://www.instagram.com/sooztaylor/...\nSubscribe to me on Youtube: \n   / @suzietaylor  \nI am picking winners for the giveaway ONE WEEK from today (December 23rd) \n-\nThank you everyone for all of your love already. This is my dream!",
  "date": "Dec 16, 2023",
  "views": "4,605,785 ",
  "likes": "230K",
  "comments": [
    {
      "name": "",
      "comment": "The right person got the money "
    },
    {
      "name": "@Scottster",
      "comment": "Way to go Suzie, a worthy winner! Always the thought that counts and you put a lot into it!"
    },
    {
      "name": "@cidsx",
      "comment": "I'm so glad that she's paying it forward! She 100% deserved the reward"
    },
    {
      "name": "@Basicskill720",
      "comment": "This is amazing Suzie. Way to spread kindness in this dark world. It is much needed !"
    },
    {
      "name": "@eliasnull",
      "comment": "You are such a blessing Suzie! The world needs more people like you."
    },
    {
      "name": "@iceline22",
      "comment": "That's so awesome you're paying it forward! You seem so genuine, and happy to pass along your good fortune! Amazing! Keep it up!"
    },
    {
      "name": "",
      "comment": "Always nice to see a Mr. Beast winner turn around and doing nice things for others. I know this was but a small portion of what you won and nobody expects you to not take care of yourself with that money, but to give back even in small ways can mean so much. Thank you for doing this."
    }
  ]
}

How to Scrape Youtube: Final Thoughts

The ability to leverage the database information from YouTube is incredibly valuable when using automation or proxy scripts that aid in staying compliant to the platform’s rules. The main approach on how to scrape Youtube we described above will be useful for responsible harvesting of data while avoiding getting banned or limited by the platform.

Comments:

0 comments