Harvesting data from Netflix can provide in-depth details about movies and TV series including items like titles, dates of release, categories of content, and overview contents. This article demonstrates how to extract data from multiple Netflix movie pages using Python, requests, and lxml. Since Netflix doesn't provide an open API for movie data, scraping allows us to gather valuable content data that can support recommendations, content analysis, and other applications.
To start, ensure that requests and lxml libraries are installed. Use the following commands to set up your environment:
pip install requests
pip install lxml
These libraries are essential for sending HTTP requests to the Netflix pages and parsing HTML content to extract required data.
To access Netflix pages, we need a list of URLs that we’ll iterate through to retrieve movie details. This tutorial will scrape Netflix’s movie title, year, duration, description, genre, and more from each specified page URL.
Netflix employs strict anti-bot measures, so using correct headers and proxies (if necessary) can prevent detection. In this script, we mimic a real browser by setting up custom headers with a User-Agent, language preferences, and other parameters, making our requests look more legitimate.
headers = {
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7',
'accept-language': 'en-IN,en;q=0.9',
'cache-control': 'no-cache',
'dnt': '1',
'pragma': 'no-cache',
'priority': 'u=0, i',
'sec-ch-ua': '"Chromium";v="130", "Google Chrome";v="130", "Not?A_Brand";v="99"',
'sec-ch-ua-mobile': '?0',
'sec-ch-ua-platform': '"Linux"',
'sec-fetch-dest': 'document',
'sec-fetch-mode': 'navigate',
'sec-fetch-site': 'none',
'sec-fetch-user': '?1',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/130.0.0.0 Safari/537.36',
}
If needed, proxies can be added to make requests from different IP addresses, further reducing the likelihood of being flagged. Example of integrating a proxy with IP address authentication:
proxies = {
'http': 'http://your_proxy_address:port',
'https': 'http://your_proxy_address:port',
}
response = requests.get(url, headers=headers, proxies=proxies)
We specify a list of Netflix movie URLs, which our script will iterate through to extract data.
urls_list = [
'Https link',
'Https link'
]
Each URL is accessed with the requests.get() method, passing the headers to avoid detection.
response = requests.get(url, headers=headers)
Using lxml, we parse the HTML response to navigate and extract data using XPath expressions.
from lxml.html import fromstring
parser = fromstring(response.text)
Using XPath, we capture essential movie details like title, year, duration, description, genre, subtitles, and more. Below is how each field is extracted:
title = parser.xpath('//h1[@class="title-title"]/text()')[0]
year = parser.xpath('//span[@data-uia="item-year"]/text()')[0]
duration = parser.xpath('//span[@class="duration"]/text()')[0]
description = parser.xpath('//div[@class="title-info-synopsis"]/text()')[0]
maturity_number = parser.xpath('//span[@class="maturity-number"]/text()')[0]
starring = parser.xpath('//span[@data-uia="info-starring"]/text()')[0]
genre = parser.xpath('//a[@data-uia="item-genre"]/text()')[0]
genres = parser.xpath('//span[@data-uia="more-details-item-genres"]/a/text()')
subtitles = ''.join(parser.xpath('//span[@data-uia="more-details-item-subtitle"]/text()'))
audio = ''.join(parser.xpath('//span[@data-uia="more-details-item-audio"]/text()'))
We store each movie’s data in a dictionary and append it to a list. This approach keeps data organized and ready to be saved in CSV.
data = {
'title': title,
'year': year,
'duration': duration,
'description': description,
'maturity_number': maturity_number,
'starring': starring,
'genre': genre,
'genres': genres,
'subtitles': subtitles,
'audio': audio
}
extracted_data.append(data)
Finally, after iterating through all URLs, we write the accumulated data to a CSV file.
import csv
with open('netflix_data.csv', 'w', newline='', encoding='utf-8') as f:
writer = csv.DictWriter(f, fieldnames=extracted_data[0].keys())
writer.writeheader()
writer.writerows(extracted_data)
The complete code combines all steps with headers and proxy setup.
import requests
from lxml.html import fromstring
import csv
urls_list = [
'Https link',
'Https link'
]
headers = {
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7',
'accept-language': 'en-IN,en;q=0.9',
'cache-control': 'no-cache',
'dnt': '1',
'pragma': 'no-cache',
'priority': 'u=0, i',
'sec-ch-ua': '"Chromium";v="130", "Google Chrome";v="130", "Not?A_Brand";v="99"',
'sec-ch-ua-mobile': '?0',
'sec-ch-ua-platform': '"Linux"',
'sec-fetch-dest': 'document',
'sec-fetch-mode': 'navigate',
'sec-fetch-site': 'none',
'sec-fetch-user': '?1',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/130.0.0.0 Safari/537.36',
}
proxies = {
'http': 'http://your_proxy_address:port',
'https': 'http://your_proxy_address:port',
}
extracted_data = []
for url in urls_list:
response = requests.get(url, headers=headers, proxies=proxies)
parser = fromstring(response.text)
title = parser.xpath('//h1[@class="title-title"]/text()')[0]
year = parser.xpath('//span[@data-uia="item-year"]/text()')[0]
duration = parser.xpath('//span[@class="duration"]/text()')[0]
description = parser.xpath('//div[@class="title-info-synopsis"]/text()')[0]
maturity_number = parser.xpath('//span[@class="maturity-number"]/text()')[0]
starring = parser.xpath('//span[@data-uia="info-starring"]/text()')[0]
genre = parser.xpath('//a[@data-uia="item-genre"]/text()')[0]
genres = parser.xpath('//span[@data-uia="more-details-item-genres"]/a/text()')
subtitles = ''.join(parser.xpath('//span[@data-uia="more-details-item-subtitle"]/text()'))
audio = ''.join(parser.xpath('//span[@data-uia="more-details-item-audio"]/text()'))
data = {
'title': title,
'year': year,
'duration': duration,
'description': description,
'maturity_number': maturity_number,
'starring': starring,
'genre': genre,
'genres': genres,
'subtitles': subtitles,
'audio': audio
}
extracted_data.append(data)
with open('netflix_data.csv', 'w', newline='', encoding='utf-8') as f:
writer = csv.DictWriter(f, fieldnames=extracted_data[0].keys())
writer.writeheader()
writer.writerows(extracted_data)
print('saved into netflix_data.csv')
Scraping Netflix data with Python provides a practical way to access content details without an official API. By using headers, proxies, and parsing techniques, we gather and save valuable data effectively. This script can be customized for various streaming analytics, recommendations, or content monitoring, helping you utilize Netflix data for broader applications.
Comments: 0