en
Español
中國人
Tiếng Việt
Deutsch
Українська
Português
Français
भारतीय
Türkçe
한국인
Italiano
Gaeilge
اردو
Indonesia
Polski YouTube content creators have to analyze how every video is performing, which includes both analyzing comments and the creator’s own content in relation to other videos in the same or a different category.
Going through all of the videos manually can be exhausting and difficult. This scenario is where a special script would come in handy. In this guide, we will show you how to scrape YouTube and create such a script that aims to automate the data-gathering process that is usually carried out manually.
Accessing YouTube data usually comes down to two options: calling the official YouTube Data API or scraping the website with tools such as Selenium. The API is designed and supported for this purpose, offering stable access to video metadata, channel statistics, comments, and search results with documented quotas and predictable responses. In most cases, development teams should rely on the API as the primary integration layer.
Web-scraping becomes relevant only when the UI exposes fields that the API does not. Typical gaps include:
Scraping remains a last resort. It is slower, more fragile (front-end changes can break selectors and the whole scraper as a result), and introduces ongoing compliance and maintenance risks.
Important disclaimer: Scraping YouTube may violate YouTube’s Terms of Service and can trigger rate limiting or IP blocking. Always respect local laws and platform policies, avoid aggressive request patterns, and prefer official API access whenever it covers the required data.
Before diving into web-scraping on YouTube with Python, it helps to understand how the site structures its data. It has so many features available that have an endless array of data types to choose from pertaining to user activities and video statistics. Some key parameters from the platform include video titles and descriptions, the tags added, amount of views, likes and comments, as well as the channel and playlist information. These elements are significant not only for content marketers and creators, but also for analytics pipelines that assess video performance and inform content strategy.
As described above, the YouTube Data API already exposes most of these metrics programmatically. The API also allows access to subscriber counts as well as videos on the channel which provides a good amount of data for analysis and integration purposes.
Yet, certain UI-only elements remain unavailable through the API and can only be retrieved via web-scraping. For example, obtaining some detailed viewer engagement metric, such as the sentiment of comments, rich description formatting, or the specific time when viewers interact with parts of the page, would require some approaches to web scrape YouTube pages. This technique is usually more complicated and more fragile, because it depends on ever-changing front-end markup and must respect the platform’s policies and technical limits on automated access.
In the following sections, the article demonstrates how to build a Python script that scrapes selected YouTube data efficiently, while staying aligned with the constraints and trade-offs outlined above.
In order to scrape Youtube videos with Python, the use of proxies is essential to evade IP bans and bots traversal prevention methods. Here are some types and their descriptions:
When using these types strategically, it is possible to scrape data from Youtube without being detected, allowing for continued access to data while abiding to the platform's Terms of Service. Also, understanding them properly will help you a lot when you will need to find a proxy for scraping.
Beyond choosing the right proxy type, it is important to rotate IP addresses and vary your behavior:
IP rotation helps to distribute your requests across multiple identities, which reduces the chance that YouTube will detect and block your scraper based on repeated traffic from a single address.
Let’s start with how to scrape Youtube by creating the script, there are some packages that must be installed. We will first install Selenium-wire which is a proxy extension of Selenium as well as it has some key classes and modules. Use the following command string in your command interface to install these packages:
pip install selenium-wire selenium blinker==1.7.0
Let us now turn our attention to the imports.
At this point, we need to load the libraries and packages which are targeted towards web elements for which the scripts have been written. Additionally, we should include modules for data processing and runtime management to ensure efficient execution of the script.
from seleniumwire import webdriver # selenium-wire driver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import json
import time
The json module assists in converting scraped data to properly formatted JSON data for better presentation as well as masking our ip address. However, the time module is used to obfuscate actions and introduce randomness in case overly predictable behavior is detected.
Moreover, it also helps in making sure that the needed elements to be extracted from the data are already on the page. The rest of the imports are the needed classes or submodules which do different things that will be discussed later in other parts of the code.
When a Selenium instance is triggered from a script written in Python, the script on its own will utilize our IP for any tasks it is required to perform. This can be really problematic for some sites like YouTube that scrape information from their site. It is advisable to review the websites robots file for better understanding. As a result, you might suffer the consequences of having your IP receive temporary bans when trying to scrape Youtube channels, for example.
To mitigate all those challenges, there are several things we need to do. First, we must store the details of the proxy we will be using as the base of three variables. Then we define an options variable, chrome_options which will be passed to the selenium.Chrome() instance to inform which proxy has to be used while scraping. The proxy information is set by passing the required arguments in chrome_options. Here’s how to scrape Youtube with proxies script:
proxy_address = ""
proxy_username = ""
proxy_password = ""
chrome_options = Options()
seleniumwire_options = {
"proxy": {
"http": f"http://{proxy_username}:{proxy_password}@{proxy_address}",
"https": f"http://{proxy_username}:{proxy_password}@{proxy_address}",
}
}
driver = wiredriver.Chrome(options=chrome_options, seleniumwire_options=seleniumwire_options)
Set up a variable called “youtube_url_to_scrape” to keep the address for the YouTube homepage. This variable is used in the “driver.get()” method for Selenium, so it knows what specific webpage to go to for scraping purposes. This action will cause a new Chrome window to open when the script is executed.
youtube_url_to_scrape = "https://www.youtube.com/watch?v=XXXXXXXXXXX"
driver.get(youtube_url_to_scrape)
# Wait for the title to appear (main indicator that video loaded)
WebDriverWait(driver, 15).until(
EC.presence_of_element_located((By.XPATH, '//*[@id="title"]/h1'))
)
# JS scroll
driver.execute_script("window.scrollTo(0, document.documentElement.scrollHeight / 2);")
time.sleep(2)
driver.execute_script("window.scrollTo(0, document.documentElement.scrollHeight);")
time.sleep(2)
Following, we now describe the “extract _information()” function, which performs information extraction from the page according to its name.
We made sure all elements on the page are loaded. This is accomplished by using the WebDriver API along with WebDriverWait to suspend the script until the element represented by the sample “more” button is available and clicked. When it is, Selenium performs a JavaScript click for the full description of the video.
To remedy it, we are implementing a solution to eliminate any related issues. Using the Actions class and the Time module, we scroll down twice every 10 seconds, making sure to scrape YouTube comments as many as possible. This ensures there are no hassles involving dynamically loaded content.
def extract_information() -> dict:
try:
element = WebDriverWait(driver, 15).until(
EC.presence_of_element_located((By.XPATH, '//*[@id="expand"]'))
)
element.click()
time.sleep(10)
actions = ActionChains(driver)
actions.send_keys(Keys.END).perform()
time.sleep(10)
actions.send_keys(Keys.END).perform()
time.sleep(10)
Using Selenium WebDriver, there are distinct ways to search for elements. One can search by ID, by CLASS_NAME, by XPATH, among others. For this, how to scrape data from Youtube in a Python tutorial, we will combine them and not a single way of doing so.
XPATH is a more complex system of finding variables to scrape because it seems more pattern oriented. It is regarded as the most difficult indeed, but not when it comes to Chrome at least.
When taking a look at the code, all that is needed is to click and select the copy XPATH option, which can be done by right-clicking on it from the inspect part of Chrome. After acquiring XPATH, the method ‘find_elements’ can easily be employed to search for all the components that have the pertinent details like the video’s title, description, etc.
It is important to mention that some features on the webpage could have duplicate properties and that is why using the basic ‘find_elements()’ technique will give you a list instead of a string. In this scenario, you have to check the list and identify which index has the required information and which text needs to be retrieved.
Concludingly, a dictionary variable named “data” is returned, encapsulating all the information garnered during scraping, ergo, an essential for the subsequent section.
video_title = driver.find_elements(By.XPATH, '//*[@id="title"]/h1')[0].text
owner = driver.find_elements(By.XPATH, '//*[@id="text"]/a')[0].text
total_number_of_subscribers = \
driver.find_elements(By.XPATH, "//div[@id='upload-info']//yt-formatted-string[@id='owner-sub-count']")[
0].text
video_description = driver.find_elements(By.XPATH, '//*[@id="description-inline-expander"]/yt-attributed-string/span/span')
result = []
for i in video_description:
result.append(i.text)
description = ''.join(result)
publish_date = driver.find_elements(By.XPATH, '//*[@id="info"]/span')[2].text
total_views = driver.find_elements(By.XPATH, '//*[@id="info"]/span')[0].text
number_of_likes = driver.find_elements(By.XPATH, '//*[@id="top-level-buttons-computed"]/segmented-like-dislike-button-view-model/yt-smartimation/div/div/like-button-view-model/toggle-button-view-model/button-view-model/button/div')[
1].text
comment_names = driver.find_elements(By.XPATH, '//*[@id="author-text"]/span')
comment_content = driver.find_elements(By.XPATH, '//*[@id="content-text"]/span')
comment_library = []
for each in range(len(comment_names)):
name = comment_names[each].text
content = comment_content[each].text
indie_comment = {
'name': name,
'comment': content
}
comment_library.append(indie_comment)
data = {
'owner': owner,
'subscribers': total_number_of_subscribers,
'video_title': video_title,
'description': description,
'date': publish_date,
'views': total_views,
'likes': number_of_likes,
'comments': comment_library
}
return data
except Exception as err:
print(f"Error: {err}")
A user defined function named “organize_write_data” is designed to accept the “data” output from the previous operation. Its main work is to structure the data within the JSON format and write it to a file called output.json, all the while ensuring that the correct procedures are followed during the writing to file stage.
With the individual steps in place, the complete scraper now looks as follows:
from selenium.webdriver.chrome.options import Options
from seleniumwire import webdriver as wiredriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.action_chains import ActionChains
import json
import time
# Proxy configuration (replace with your real proxy or leave empty if not needed)
proxy_address = ""
proxy_username = ""
proxy_password = ""
chrome_options = Options()
if proxy_address:
chrome_options.add_argument(f'--proxy-server={proxy_address}')
if proxy_username and proxy_password:
chrome_options.add_argument(f'--proxy-auth={proxy_username}:{proxy_password}')
driver = wiredriver.Chrome(options=chrome_options)
youtube_url_to_scrape = "https://www.youtube.com/watch?v=XXXXXXXXXXX"
driver.get(youtube_url_to_scrape)
time.sleep(5) # basic wait for the page to load
def extract_information() -> dict:
"""Minimal example that collects core video data and top-level comments."""
try:
# Expand description if the button is present
try:
element = WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.XPATH, '//*[@id="expand"]'))
)
element.click()
time.sleep(2)
except Exception:
pass # if the button is not found, continue with what is visible
# Scroll to load some comments
actions = ActionChains(driver)
actions.send_keys(Keys.END).perform()
time.sleep(3)
actions.send_keys(Keys.END).perform()
time.sleep(3)
video_title = driver.find_elements(By.XPATH, '//*[@id="title"]/h1')[0].text
owner = driver.find_elements(By.XPATH, '//*[@id="text"]/a')[0].text
total_number_of_subscribers = driver.find_elements(
By.XPATH,
"//div[@id='upload-info']//yt-formatted-string[@id='owner-sub-count']"
)[0].text
video_description_nodes = driver.find_elements(
By.XPATH,
'//*[@id="description-inline-expander"]/yt-attributed-string/span/span'
)
description = ''.join([node.text for node in video_description_nodes])
publish_date = driver.find_elements(By.XPATH, '//*[@id="info"]/span')[2].text
total_views = driver.find_elements(By.XPATH, '//*[@id="info"]/span')[0].text
number_of_likes = driver.find_elements(
By.XPATH,
'//*[@id="top-level-buttons-computed"]/segmented-like-dislike-button-view-model/yt-smartimation/div/div/like-button-view-model/toggle-button-view-model/button-view-model/button/div'
)[1].text
comment_names = driver.find_elements(By.XPATH, '//*[@id="author-text"]/span')
comment_content = driver.find_elements(By.XPATH, '//*[@id="content-text"]/span')
comment_library = []
for each in range(min(len(comment_names), len(comment_content))):
name = comment_names[each].text
content = comment_content[each].text
comment_library.append({"name": name, "comment": content})
data = {
'owner': owner,
'subscribers': total_number_of_subscribers,
'video_title': video_title,
'description': description,
'date': publish_date,
'views': total_views,
'likes': number_of_likes,
'comments': comment_library
}
return data
except Exception as err:
print(f"Error: {err}")
return {}
def organize_write_data(data: dict):
"""Write scraped data to output.json in a readable format."""
output = json.dumps(data, indent=2, ensure_ascii=False)
try:
with open("output.json", 'w', encoding='utf-8') as file:
file.write(output)
except Exception as err:
print(f"Error encountered: {err}")
organize_write_data(extract_information())
driver.quit()
In case you only want a very compact script that just opens a video and extracts basic data (without proxies, scrolling or comments), you can start from this skeleton:
from selenium import webdriver
from selenium.webdriver.common.by import By
import json, time
driver = webdriver.Chrome()
driver.get("https://www.youtube.com/watch?v=XXXXXXXXXXX")
time.sleep(5)
title = driver.find_element(By.XPATH, '//*[@id="title"]/h1').text
owner = driver.find_element(By.XPATH, '//*[@id="text"]/a').text
views = driver.find_elements(By.XPATH, '//*[@id="info"]/span')[0].text
data = {
"title": title,
"owner": owner,
"views": views
}
with open("output_minimal.json", "w", encoding="utf-8") as f:
json.dump(data, f, indent=2, ensure_ascii=False)
driver.quit()
We also know how to scrape Youtube comments with Python now. The output looks like this:
{
"owner": "Suzie Taylor",
"subscribers": "165K subscribers",
"video_title": "I Spent $185,000 From MrBeast",
"description": "@MrBeast blessed me with $185,000, after SURVIVING 100 DAYS trapped with a STRANGER. Now I want to bless you!\nGive This A Like if you Enjoyed :))\n-\nTo Enter The Giveaway: \nFollow me on Instagram: https://www.instagram.com/sooztaylor/...\nSubscribe to me on Youtube: \n / @suzietaylor \nI am picking winners for the giveaway ONE WEEK from today (December 23rd) \n-\nThank you everyone for all of your love already. This is my dream!",
"date": "Dec 16, 2023",
"views": "4,605,785 ",
"likes": "230K",
"comments": [
{
"name": "",
"comment": "The right person got the money "
},
{
"name": "@Scottster",
"comment": "Way to go Suzie, a worthy winner! Always the thought that counts and you put a lot into it!"
},
{
"name": "@cidsx",
"comment": "I'm so glad that she's paying it forward! She 100% deserved the reward"
},
{
"name": "@Basicskill720",
"comment": "This is amazing Suzie. Way to spread kindness in this dark world. It is much needed !"
},
{
"name": "@eliasnull",
"comment": "You are such a blessing Suzie! The world needs more people like you."
},
{
"name": "@iceline22",
"comment": "That's so awesome you're paying it forward! You seem so genuine, and happy to pass along your good fortune! Amazing! Keep it up!"
},
{
"name": "",
"comment": "Always nice to see a Mr. Beast winner turn around and doing nice things for others. I know this was but a small portion of what you won and nobody expects you to not take care of yourself with that money, but to give back even in small ways can mean so much. Thank you for doing this."
}
]
}
The ability to leverage the database information from YouTube is incredibly valuable when using automation or proxy scripts that aid in staying compliant to the platform’s rules. To wrap up, here is a short, practical sequence you can follow:
Used responsibly, this workflow lets you automate the collection of vital YouTube data while minimizing technical risk and staying as close as possible to platform rules.
Comments: 0