Python으로 중요한 YouTube 데이터를 스크랩하는 방법

19.08.2024

댓글: 0

기사 내용:

유튜브에서 데이터를 추출하는 스크레이퍼 만들기

1단계: 라이브러리 및 패키지 가져오기
2단계: 셀레늄 크롬 드라이버 설정
3단계: YouTube 동영상 페이지에서 정보 추출하기

전체 코드
결과

YouTube 크리에이터는 동영상의 성과를 평가해야 하며, 긍정 및 부정 댓글을 분석하고 동일 또는 다른 카테고리의 다른 콘텐츠와 비교하는 것이 필수적입니다.

게시된 동영상을 수동으로 선별하는 것은 크리에이터에게 지루하고 시간이 많이 소요될 수 있습니다. 바로 이 부분에서 YouTube 스크래핑 스크립트가 유용하게 활용될 수 있습니다. 이 가이드에서는 데이터 수집 프로세스를 자동화하도록 설계된 YouTube 스크립트를 개발할 것입니다.

유튜브에서 데이터를 추출하는 스크레이퍼 만들기

스크립트가 제대로 작동하려면 몇 가지 패키지를 설치해야 합니다. 가장 먼저 설치해야 할 패키지는 적절한 프록시 구성을 가능하게 하는 셀레늄의 확장 프로그램인 셀레늄 와이어와 필수 클래스 및 모듈을 위한 셀레늄 자체입니다. 이러한 패키지를 설치하려면 명령 인터페이스에서 다음 명령을 실행하세요:

pip install selenium-wire selenium blinker==1.7.0

이제 수입에 대해 집중해 보겠습니다.

1단계: 라이브러리 및 패키지 가져오기

이 단계에서는 웹 요소와 상호 작용하기 위해 스크립트에서 사용할 라이브러리와 패키지를 가져오는 것이 중요합니다. 또한 스크립트의 효율적인 실행을 위해 데이터 처리 및 런타임 관리를 위한 모듈을 포함해야 합니다.

from selenium.webdriver.chrome.options import Options
from seleniumwire import webdriver as wiredriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.action_chains import ActionChains
import json
import time

json 모듈은 추출된 데이터를 적절한 형식의 JSON 데이터로 변환하여 최적의 데이터 표시를 보장합니다. 시간 모듈은 IP를 마스킹하지만 동작에 무작위성을 도입하여 스크립트와 유사한 동작이 나타나는 것을 방지하는 데 필수적입니다.

또한 이 모듈은 페이지에서 데이터를 추출하는 데 필요한 요소가 로드되었는지 확인하는 데 중요합니다. 나머지 임포트는 고유한 동작을 수행하는 필수 클래스 또는 하위 모듈로 구성되며 코드의 다음 섹션에서 자세히 설명합니다.

2단계: 셀레늄 크롬 드라이버 설정

파이썬에서 스크립트를 사용하여 셀레늄 인스턴스를 실행할 때마다 스크립트는 수행하려는 모든 활동에 대해 당사의 IP 주소를 사용합니다. 이는 특히 웹사이트의 정보를 스크랩하지 못하도록 엄격한 정책을 시행하는 YouTube와 같은 웹사이트의 경우 위험할 수 있으며, 해당 웹사이트의 로봇 파일을 확인하여 더 나은 참조를 얻을 수 있습니다. 이로 인해 일시적으로 사용자의 IP가 YouTube 콘텐츠에 액세스하지 못하도록 제한될 수 있습니다.

이 모든 것을 방지하려면 몇 가지 작업을 수행해야 합니다. 페이지에 액세스할 프록시의 세부 정보를 저장하기 위해 3개의 변수를 만들어야 합니다. 그런 다음 크롬_옵션이라는 옵션 변수를 생성하고, 이를 크롬 웹 드라이버 인스턴스에 전달하여 셀레늄이 스크래핑할 때 사용할 프록시를 알 수 있도록 합니다. 프록시 세부 정보를 chrome_options의 인수로 전달하면 프록시가 설정됩니다.

# 사용자 아이디와 비밀번호로 프록시 서버 주소를 지정합니다.
proxy_address = ""
proxy_username = ""
proxy_password = ""
# 프록시 및 인증으로 Chrome 옵션 설정하기
chrome_options = Options()
chrome_options.add_argument(f'--proxy-server={proxy_address}')
chrome_options.add_argument(f'--proxy-auth={proxy_username}:{proxy_password}')
# 셀레늄 와이어로 WebDriver 인스턴스 만들기
driver = wiredriver.Chrome(options=chrome_options)

3단계: YouTube 동영상 페이지에서 정보 추출하기

YouTube 랜딩 페이지의 URL을 저장하기 위해 "youtube_url_to_scrape"라는 변수를 만듭니다. 그런 다음 이 변수를 "driver.get()" 메서드에서 활용하여 스크랩을 위해 특정 페이지를 열도록 Selenium에 지시합니다. 이 작업을 실행하면 스크립트가 실행될 때 별도의 Chrome 창이 열립니다.

youtube_url_to_scrape = ""
# 셀레늄 와이어의 향상된 기능으로 셀레늄 자동화를 수행하세요.
driver.get(youtube_url_to_scrape)

다음으로, 이름에서 알 수 있듯이 페이지에서 필요한 정보를 추출하는 "extract _information()" 함수를 정의합니다.

페이지의 모든 요소가 로드되는지 확인하는 것이 중요합니다. 이를 위해 WebDriverWait 클래스를 사용하여 "element" 변수 아래에 구현된 "더보기" 버튼을 사용할 수 있고 클릭할 때까지 스크립트를 일시 중지합니다. 버튼을 사용할 수 있게 되면 셀레늄은 비디오의 전체 설명에 액세스할 수 있는 자바스크립트 클릭 액션을 실행합니다.

앞서 언급한 동적 댓글 문제를 해결하기 위해 관련 문제를 제거하는 솔루션을 구현하고 있습니다. 액션 클래스와 시간 모듈을 사용하여 10초마다 두 번씩 아래로 스크롤하여 가능한 한 많은 댓글을 스크랩합니다. 이러한 사전 예방적 접근 방식은 동적으로 로드되는 콘텐츠와 관련된 잠재적인 병목 현상을 방지합니다.

def extract_information() -> dict:
   try:
       element = WebDriverWait(driver, 15).until(
           EC.presence_of_element_located((By.XPATH, '//*[@id="expand"]'))
       )

       element.click()

       time.sleep(10)
       actions = ActionChains(driver)
       actions.send_keys(Keys.END).perform()
       time.sleep(10)
       actions.send_keys(Keys.END).perform()
       time.sleep(10)

셀레늄 웹드라이버를 사용하여 요소를 검색하는 방법에는 여러 가지가 있습니다. ID, CLASS_NAME, XPATH 등으로 검색할 수 있습니다. 이 가이드에서는 한 가지 방법보다는 여러 가지 방법을 조합하여 사용하겠습니다.

XPATH는 스크래핑 중에 변수를 찾기 위한 좀 더 복잡하지만 패턴 기반 시스템입니다. 가장 복잡하다고 여겨지지만 Chrome은 이를 쉽게 만들었습니다.

Chrome의 검사 도구를 사용하여 코드를 검토하는 동안 마우스 오른쪽 버튼을 클릭하여 XPATH를 복사하기만 하면 됩니다. 복사한 후에는 `find_elements` 함수를 사용하여 동영상 제목, 설명 등 원하는 정보가 포함된 모든 요소를 식별할 수 있습니다.

페이지의 특정 요소가 유사한 속성을 공유할 수 있으므로 `find_elements()` 호출이 문자열이 아닌 목록을 반환할 수 있다는 점에 유의해야 합니다. 이러한 경우 목록을 검토하여 관련 정보의 인덱스를 정확히 찾아 텍스트를 추출해야 합니다.

결론적으로, 다음 섹션의 필수 요소인 스크래핑 중에 수집된 모든 정보를 캡슐화하는 `data`라는 이름의 사전 변수가 반환됩니다.

 video_title = driver.find_elements(By.XPATH, '//*[@id="title"]/h1')[0].text

   owner = driver.find_elements(By.XPATH, '//*[@id="text"]/a')[0].text

   total_number_of_subscribers = \
       driver.find_elements(By.XPATH, "//div[@id='upload-info']//yt-formatted-string[@id='owner-sub-count']")[
           0].text

   video_description = driver.find_elements(By.XPATH,                                  '//*[@id="description-inline-expander"]/yt-attributed-string/span/span')
   result = []
   for i in video_description:
       result.append(i.text)
   description = ''.join(result)

   publish_date = driver.find_elements(By.XPATH, '//*[@id="info"]/span')[2].text
   total_views = driver.find_elements(By.XPATH, '//*[@id="info"]/span')[0].text

   number_of_likes = driver.find_elements(By.XPATH,                                   '//*[@id="top-level-buttons-computed"]/segmented-like-dislike-button-view-model/yt-smartimation/div/div/like-button-view-model/toggle-button-view-model/button-view-model/button/div')[
       1].text

   comment_names = driver.find_elements(By.XPATH, '//*[@id="author-text"]/span')
   comment_content = driver.find_elements(By.XPATH, '//*[@id="content-text"]/span')
   comment_library = []

   for each in range(len(comment_names)):
       name = comment_names[each].text
       content = comment_content[each].text
       indie_comment = {
           'name': name,
           'comment': content
       }
       comment_library.append(indie_comment)

   data = {
       'owner': owner,
       'subscribers': total_number_of_subscribers,
       'video_title': video_title,
       'description': description,
       'date': publish_date,
       'views': total_views,
       'likes': number_of_likes,
       'comments': comment_library
   }

   return data

except Exception as err:
   print(f"Error: {err}")

4단계: 수집한 데이터를 JSON 파일에 쓰기

def organize_write_data(data:dict):
    output = json.dumps(data, indent=2, ensure_ascii=False).encode("ascii", "ignore").decode("utf-8")
    try:
        with open("output.json", 'w', encoding='utf-8') as file:
            file.write(output)
    except Exception as err:
        print(f"Error encountered: {err}")

'organize_write_data()` 함수는 반환된 '데이터'를 입력으로 받아 형식이 지정된 JSON 구조로 정리합니다. 그런 다음 파일 쓰기 프로세스 중에 발생할 수 있는 오류를 처리하면서 이 정리된 데이터를 'output.json'이라는 출력 파일에 씁니다.

전체 코드

지금까지 스크래핑 프로그램의 전체 코드입니다:

from selenium.webdriver.chrome.options import Options
from seleniumwire import webdriver as wiredriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.action_chains import ActionChains
import json
import time

# 사용자 아이디와 비밀번호로 프록시 서버 주소를 지정합니다.
proxy_address = ""
proxy_username = ""
proxy_password = ""

# 프록시 및 인증으로 Chrome 옵션 설정하기
chrome_options = Options()
chrome_options.add_argument(f'--proxy-server={proxy_address}')
chrome_options.add_argument(f'--proxy-auth={proxy_username}:{proxy_password}')

# 셀레늄 와이어로 WebDriver 인스턴스 만들기
driver = wiredriver.Chrome(options=chrome_options)

youtube_url_to_scrape = ""

# 셀레늄 와이어의 향상된 기능으로 셀레늄 자동화를 수행하세요.
driver.get(youtube_url_to_scrape)


def extract_information() -> dict:
   try:
       element = WebDriverWait(driver, 15).until(
           EC.presence_of_element_located((By.XPATH, '//*[@id="expand"]'))
       )
       element.click()

       time.sleep(10)
       actions = ActionChains(driver)
       actions.send_keys(Keys.END).perform()
       time.sleep(10)
       actions.send_keys(Keys.END).perform()
       time.sleep(10)

       video_title = driver.find_elements(By.XPATH, '//*[@id="title"]/h1')[0].text

       owner = driver.find_elements(By.XPATH, '//*[@id="text"]/a')[0].text
       total_number_of_subscribers = \
           driver.find_elements(By.XPATH, "//div[@id='upload-info']//yt-formatted-string[@id='owner-sub-count']")[
               0].text

       video_description = driver.find_elements(By.XPATH,
                                                '//*[@id="description-inline-expander"]/yt-attributed-string/span/span')
       result = []
       for i in video_description:
           result.append(i.text)
       description = ''.join(result)

       publish_date = driver.find_elements(By.XPATH, '//*[@id="info"]/span')[2].text
       total_views = driver.find_elements(By.XPATH, '//*[@id="info"]/span')[0].text

       number_of_likes = driver.find_elements(By.XPATH,
                                              '//*[@id="top-level-buttons-computed"]/segmented-like-dislike-button-view-model/yt-smartimation/div/div/like-button-view-model/toggle-button-view-model/button-view-model/button/div')[
           1].text

       comment_names = driver.find_elements(By.XPATH, '//*[@id="author-text"]/span')
       comment_content = driver.find_elements(By.XPATH,
                                              '//*[@id="content-text"]/span')
       comment_library = []

       for each in range(len(comment_names)):
           name = comment_names[each].text
           content = comment_content[each].text
           indie_comment = {
               'name': name,
               'comment': content
           }
           comment_library.append(indie_comment)

       data = {
           'owner': owner,
           'subscribers': total_number_of_subscribers,
           'video_title': video_title,
           'description': description,
           'date': publish_date,
           'views': total_views,
           'likes': number_of_likes,
           'comments': comment_library
       }

       return data

   except Exception as err:
       print(f"Error: {err}")


# 데이터를 JSON으로 기록
def organize_write_data(data: dict):
   output = json.dumps(data, indent=2, ensure_ascii=False).encode("ascii", "ignore").decode("utf-8")
   try:
       with open("output.json", 'w', encoding='utf-8') as file:
           file.write(output)
   except Exception as err:
       print(f"Error encountered: {err}")


organize_write_data(extract_information())
driver.quit()

결과

출력은 다음과 같습니다:

플랫폼 정책 및 규정을 준수하기 위해 프록시를 활용하는 잘 만들어진 스크립트를 사용하면 YouTube의 풍부한 정보를 안전하게 활용하는 데 상당한 이점이 있습니다. 위에서 설명한 접근 방식은 책임감 있는 데이터 추출을 용이하게 하고 플랫폼에서 부과하는 잠재적 제한의 위험을 완화합니다.

0 댓글

이전 기사