How to scrape vital YouTube data with Python

Comments: 0

YouTube creators have to assess their videos' performance; analysing positive and negative comments, and comparing their content with others in the same or different categories become essential.

Manually sifting through posted videos can be tedious and time-consuming for creators. This is precisely where a YouTube scraping script becomes invaluable. We will be developing a YouTube script designed to automate the data-gathering process in this guide.

Creating a scraper to extract data from YouTube

For the script to function correctly, we need to install some packages. The first package to install is selenium-wire, an extension of Selenium that enables proper proxy configuration, and Selenium itself for essential classes and modules. To install these packages, execute the following command in your command interface:

pip install selenium-wire selenium blinker==1.7.0

Now let's focus on the imports.

Step 1: Importing libraries and packages

At this stage, it's important to import the libraries and packages that will be utilized in our script for interacting with web elements. Additionally, we should include modules for data processing and runtime management to ensure efficient execution of the script.

from selenium.webdriver.chrome.options import Options
from seleniumwire import webdriver as wiredriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.action_chains import ActionChains
import json
import time

The json module aids in converting extracted data into properly formatted JSON data, ensuring optimal data presentation. Despite masking our IP, the time module is essential for introducing randomness to actions, preventing the appearance of script-like behavior.

Additionally, this module is crucial for ensuring that the elements we need to extract data from the page have loaded. The remaining imports consist of necessary classes or submodules that perform distinct actions and will be elaborated on in the subsequent sections of the code.

Step 2: Setting up Selenium Chrome Driver

Whenever you run a selenium instance using a script in python, the script uses our IP address for whatever activity it is we wish to perform. This is dangerous, especially for websites like YouTube with strict policies against scraping information from their website, you can check out their robots file for a better reference. The consequences of this could be temporary restrictions on your IP from accessing YouTube content.

To avoid all of that, there are a couple of things we need to do. We need to create 3 variables to house the details of the proxy through which we would be accessing the page. Then we create an options variable, chrome_options, which we will pass into the Chrome WebDriver instance so Selenium knows which proxy to use when scraping. We pass in the proxy details as arguments for chrome_options and our proxy is set.

# Specify the proxy server address with username and password
proxy_address = ""
proxy_username = ""
proxy_password = ""
# Set up Chrome options with the proxy and authentication
chrome_options = Options()
chrome_options.add_argument(f'--proxy-server={proxy_address}')
chrome_options.add_argument(f'--proxy-auth={proxy_username}:{proxy_password}')
# Create a WebDriver instance with selenium-wire
driver = wiredriver.Chrome(options=chrome_options)

Step 3: Extracting the information from the YouTube video page

Create a variable named “youtube_url_to_scrape” to store the URL of the YouTube landing page. This variable is then utilized in the “driver.get()” method to direct Selenium to open a specific page for scraping. Executing this action will open a separate Chrome window when the script is run.

youtube_url_to_scrape = ""
# Perform your Selenium automation with the enhanced capabilities of selenium-wire
driver.get(youtube_url_to_scrape)

Next, we define the “extract _information()” function, which, as the name suggests, extracts the necessary information from the page.

It is important to ensure that all elements on the page are loaded. To do this, we use the WebDriverWait class to pause the script at least until the "more" button is available and clicked, which is implemented under the "element" variable. Once available, Selenium executes a JavaScript click action that allows access to the video's full description.

To address the dynamic comment issue mentioned earlier, we are implementing a solution to eliminate any related issues. Using the Actions class and the Time module, we scroll down twice every 10 seconds, making sure to scrape as many comments as possible. This proactive approach protects against potential bottlenecks associated with dynamically loaded content.

def extract_information() -> dict:
   try:
       element = WebDriverWait(driver, 15).until(
           EC.presence_of_element_located((By.XPATH, '//*[@id="expand"]'))
       )

       element.click()

       time.sleep(10)
       actions = ActionChains(driver)
       actions.send_keys(Keys.END).perform()
       time.sleep(10)
       actions.send_keys(Keys.END).perform()
       time.sleep(10)

There are different ways to search for elements using selenium webdriver. You can search by ID, CLASS_NAME, XPATH, etc. For this guide, we will be using a combination rather than just one method.

XPATH is a more intricate yet pattern-based system for locating variables during scraping. It is considered the most complicated; however, Chrome has made that easy.

While reviewing the code using Chrome’s inspect tool, simply right-click to copy the XPATH. Once copied, you can use the `find_elements` function to identify all the elements containing the desired information, such as the video title, description, etc.

It's crucial to note that certain elements on the page may share similar attributes, which may cause the `find_elements()` call to return a list rather than a string. In such cases, you must examine the list to pinpoint the index of the relevant information and extract the text.

Concludingly, a dictionary variable named `data` is returned, encapsulating all the information garnered during scraping, ergo, an essential for the subsequent section.

 video_title = driver.find_elements(By.XPATH, '//*[@id="title"]/h1')[0].text

   owner = driver.find_elements(By.XPATH, '//*[@id="text"]/a')[0].text

   total_number_of_subscribers = \
       driver.find_elements(By.XPATH, "//div[@id='upload-info']//yt-formatted-string[@id='owner-sub-count']")[
           0].text

   video_description = driver.find_elements(By.XPATH,                                  '//*[@id="description-inline-expander"]/yt-attributed-string/span/span')
   result = []
   for i in video_description:
       result.append(i.text)
   description = ''.join(result)

   publish_date = driver.find_elements(By.XPATH, '//*[@id="info"]/span')[2].text
   total_views = driver.find_elements(By.XPATH, '//*[@id="info"]/span')[0].text

   number_of_likes = driver.find_elements(By.XPATH,                                   '//*[@id="top-level-buttons-computed"]/segmented-like-dislike-button-view-model/yt-smartimation/div/div/like-button-view-model/toggle-button-view-model/button-view-model/button/div')[
       1].text

   comment_names = driver.find_elements(By.XPATH, '//*[@id="author-text"]/span')
   comment_content = driver.find_elements(By.XPATH, '//*[@id="content-text"]/span')
   comment_library = []

   for each in range(len(comment_names)):
       name = comment_names[each].text
       content = comment_content[each].text
       indie_comment = {
           'name': name,
           'comment': content
       }
       comment_library.append(indie_comment)

   data = {
       'owner': owner,
       'subscribers': total_number_of_subscribers,
       'video_title': video_title,
       'description': description,
       'date': publish_date,
       'views': total_views,
       'likes': number_of_likes,
       'comments': comment_library
   }

   return data

except Exception as err:
   print(f"Error: {err}")

Step 4: Write the data gathered to a JSON file

def organize_write_data(data:dict):
    output = json.dumps(data, indent=2, ensure_ascii=False).encode("ascii", "ignore").decode("utf-8")
    try:
        with open("output.json", 'w', encoding='utf-8') as file:
            file.write(output)
    except Exception as err:
        print(f"Error encountered: {err}")

The function `organize_write_data()` takes the returned `data` as input and organizes it into a formatted JSON structure. It then writes this organized data to an output file named "output.json" while handling potential errors during the file-writing process.

Full code

So far, here is the full code of our scraping program:

from selenium.webdriver.chrome.options import Options
from seleniumwire import webdriver as wiredriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.action_chains import ActionChains
import json
import time

# Specify the proxy server address with username and password
proxy_address = ""
proxy_username = ""
proxy_password = ""

# Set up Chrome options with the proxy and authentication
chrome_options = Options()
chrome_options.add_argument(f'--proxy-server={proxy_address}')
chrome_options.add_argument(f'--proxy-auth={proxy_username}:{proxy_password}')

# Create a WebDriver instance with selenium-wire
driver = wiredriver.Chrome(options=chrome_options)

youtube_url_to_scrape = ""

# Perform your Selenium automation with the enhanced capabilities of selenium-wire
driver.get(youtube_url_to_scrape)


def extract_information() -> dict:
   try:
       element = WebDriverWait(driver, 15).until(
           EC.presence_of_element_located((By.XPATH, '//*[@id="expand"]'))
       )
       element.click()

       time.sleep(10)
       actions = ActionChains(driver)
       actions.send_keys(Keys.END).perform()
       time.sleep(10)
       actions.send_keys(Keys.END).perform()
       time.sleep(10)

       video_title = driver.find_elements(By.XPATH, '//*[@id="title"]/h1')[0].text

       owner = driver.find_elements(By.XPATH, '//*[@id="text"]/a')[0].text
       total_number_of_subscribers = \
           driver.find_elements(By.XPATH, "//div[@id='upload-info']//yt-formatted-string[@id='owner-sub-count']")[
               0].text

       video_description = driver.find_elements(By.XPATH,
                                                '//*[@id="description-inline-expander"]/yt-attributed-string/span/span')
       result = []
       for i in video_description:
           result.append(i.text)
       description = ''.join(result)

       publish_date = driver.find_elements(By.XPATH, '//*[@id="info"]/span')[2].text
       total_views = driver.find_elements(By.XPATH, '//*[@id="info"]/span')[0].text

       number_of_likes = driver.find_elements(By.XPATH,
                                              '//*[@id="top-level-buttons-computed"]/segmented-like-dislike-button-view-model/yt-smartimation/div/div/like-button-view-model/toggle-button-view-model/button-view-model/button/div')[
           1].text

       comment_names = driver.find_elements(By.XPATH, '//*[@id="author-text"]/span')
       comment_content = driver.find_elements(By.XPATH,
                                              '//*[@id="content-text"]/span')
       comment_library = []

       for each in range(len(comment_names)):
           name = comment_names[each].text
           content = comment_content[each].text
           indie_comment = {
               'name': name,
               'comment': content
           }
           comment_library.append(indie_comment)

       data = {
           'owner': owner,
           'subscribers': total_number_of_subscribers,
           'video_title': video_title,
           'description': description,
           'date': publish_date,
           'views': total_views,
           'likes': number_of_likes,
           'comments': comment_library
       }

       return data

   except Exception as err:
       print(f"Error: {err}")


# Record data to JSON
def organize_write_data(data: dict):
   output = json.dumps(data, indent=2, ensure_ascii=False).encode("ascii", "ignore").decode("utf-8")
   try:
       with open("output.json", 'w', encoding='utf-8') as file:
           file.write(output)
   except Exception as err:
       print(f"Error encountered: {err}")


organize_write_data(extract_information())
driver.quit()

Results

The output looks like this:

Screenshot_1.png

Safely harnessing YouTube’s wealth of information is significantly beneficial when well-crafted scripts that utilize proxies to ensure adherence to platform policies and regulations are employed. The approach discussed above facilitates responsible data extraction and mitigates the risk of potential restrictions imposed by the platform.

Comments:

0 comments