How to scrape Netflix data using Python

21.04.2025

Comments: 0

Like:

Content of the article:

Requirements and installation
Setting up the scraper

Headers and proxy setup

Step-by-step code breakdown

Step 1. Setting up URL list
Step 2. Sending the HTTP request
Step 3. Parsing the HTML content
Step 4. Extracting data elements
Step 5. Saving data in a list
Step 6. Writing data to CSV

Complete code

Harvesting data from Netflix can provide in-depth details about movies and TV series including items like titles, dates of release, categories of content, and overview contents. This article demonstrates how to extract data from multiple Netflix movie pages using Python, requests, and lxml. Since Netflix doesn't provide an open API for movie data, scraping allows us to gather valuable content data that can support recommendations, content analysis, and other applications.

Requirements and installation

To start, ensure that requests and lxml libraries are installed. Use the following commands to set up your environment:


pip install requests
pip install lxml

These libraries are essential for sending HTTP requests to the Netflix pages and parsing HTML content to extract required data.

Setting up the scraper

To access Netflix pages, we need a list of URLs that we’ll iterate through to retrieve movie details. This tutorial will scrape Netflix’s movie title, year, duration, description, genre, and more from each specified page URL.

Headers and proxy setup

Netflix employs strict anti-bot measures, so using correct headers and proxies (if necessary) can prevent detection. In this script, we mimic a real browser by setting up custom headers with a User-Agent, language preferences, and other parameters, making our requests look more legitimate.



headers = {
    'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7',
    'accept-language': 'en-IN,en;q=0.9',
    'cache-control': 'no-cache',
    'dnt': '1',
    'pragma': 'no-cache',
    'priority': 'u=0, i',
    'sec-ch-ua': '"Chromium";v="130", "Google Chrome";v="130", "Not?A_Brand";v="99"',
    'sec-ch-ua-mobile': '?0',
    'sec-ch-ua-platform': '"Linux"',
    'sec-fetch-dest': 'document',
    'sec-fetch-mode': 'navigate',
    'sec-fetch-site': 'none',
    'sec-fetch-user': '?1',
    'upgrade-insecure-requests': '1',
    'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/130.0.0.0 Safari/537.36',
}

If needed, proxies can be added to make requests from different IP addresses, further reducing the likelihood of being flagged. Example of integrating a proxy with IP address authentication:



proxies = {
    'http': 'http://your_proxy_address:port',
    'https': 'http://your_proxy_address:port',
}
response = requests.get(url, headers=headers, proxies=proxies)

Step-by-step code breakdown

Step 1. Setting up URL list

We specify a list of Netflix movie URLs, which our script will iterate through to extract data.



urls_list = [
    'Https link', 
    'Https link'
]

Step 2. Sending the HTTP request

Each URL is accessed with the requests.get() method, passing the headers to avoid detection.



response = requests.get(url, headers=headers)

Step 3. Parsing the HTML content

Using lxml, we parse the HTML response to navigate and extract data using XPath expressions.


from lxml.html import fromstring
parser = fromstring(response.text)

Step 4. Extracting data elements

Using XPath, we capture essential movie details like title, year, duration, description, genre, subtitles, and more. Below is how each field is extracted:


title = parser.xpath('//h1[@class="title-title"]/text()')[0]
year = parser.xpath('//span[@data-uia="item-year"]/text()')[0]
duration = parser.xpath('//span[@class="duration"]/text()')[0]
description = parser.xpath('//div[@class="title-info-synopsis"]/text()')[0]
maturity_number = parser.xpath('//span[@class="maturity-number"]/text()')[0]
starring = parser.xpath('//span[@data-uia="info-starring"]/text()')[0]
genre = parser.xpath('//a[@data-uia="item-genre"]/text()')[0]
genres = parser.xpath('//span[@data-uia="more-details-item-genres"]/a/text()')
subtitles = ''.join(parser.xpath('//span[@data-uia="more-details-item-subtitle"]/text()'))
audio = ''.join(parser.xpath('//span[@data-uia="more-details-item-audio"]/text()'))

Step 5. Saving data in a list

We store each movie’s data in a dictionary and append it to a list. This approach keeps data organized and ready to be saved in CSV.



data = {
    'title': title,
    'year': year,
    'duration': duration,
    'description': description,
    'maturity_number': maturity_number,
    'starring': starring,
    'genre': genre,
    'genres': genres,
    'subtitles': subtitles,
    'audio': audio
}
extracted_data.append(data)

Step 6. Writing data to CSV

Finally, after iterating through all URLs, we write the accumulated data to a CSV file.



import csv

with open('netflix_data.csv', 'w', newline='', encoding='utf-8') as f:
    writer = csv.DictWriter(f, fieldnames=extracted_data[0].keys())
    writer.writeheader()
    writer.writerows(extracted_data)

Complete code

The complete code combines all steps with headers and proxy setup.


import requests
from lxml.html import fromstring
import csv

urls_list = [
    'Https link', 
    'Https link'
]

headers = {
    'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7',
    'accept-language': 'en-IN,en;q=0.9',
    'cache-control': 'no-cache',
    'dnt': '1',
    'pragma': 'no-cache',
    'priority': 'u=0, i',
    'sec-ch-ua': '"Chromium";v="130", "Google Chrome";v="130", "Not?A_Brand";v="99"',
    'sec-ch-ua-mobile': '?0',
    'sec-ch-ua-platform': '"Linux"',
    'sec-fetch-dest': 'document',
    'sec-fetch-mode': 'navigate',
    'sec-fetch-site': 'none',
    'sec-fetch-user': '?1',
    'upgrade-insecure-requests': '1',
    'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/130.0.0.0 Safari/537.36',
}

proxies = {
    'http': 'http://your_proxy_address:port',
    'https': 'http://your_proxy_address:port',
}

extracted_data = []
for url in urls_list:
    response = requests.get(url, headers=headers, proxies=proxies)
    parser = fromstring(response.text)
    
    title = parser.xpath('//h1[@class="title-title"]/text()')[0]
    year = parser.xpath('//span[@data-uia="item-year"]/text()')[0]
    duration = parser.xpath('//span[@class="duration"]/text()')[0]
    description = parser.xpath('//div[@class="title-info-synopsis"]/text()')[0]
    maturity_number = parser.xpath('//span[@class="maturity-number"]/text()')[0]
    starring = parser.xpath('//span[@data-uia="info-starring"]/text()')[0]
    genre = parser.xpath('//a[@data-uia="item-genre"]/text()')[0]
    genres = parser.xpath('//span[@data-uia="more-details-item-genres"]/a/text()')
    subtitles = ''.join(parser.xpath('//span[@data-uia="more-details-item-subtitle"]/text()'))
    audio = ''.join(parser.xpath('//span[@data-uia="more-details-item-audio"]/text()'))

    data = {
        'title': title,
        'year': year,
        'duration': duration,
        'description': description,
        'maturity_number': maturity_number,
        'starring': starring,
        'genre': genre,
        'genres': genres,
        'subtitles': subtitles,
        'audio': audio
    }
    extracted_data.append(data)

with open('netflix_data.csv', 'w', newline='', encoding='utf-8') as f:
    writer = csv.DictWriter(f, fieldnames=extracted_data[0].keys())
    writer.writeheader()
    writer.writerows(extracted_data)
print('saved into netflix_data.csv')

Scraping Netflix data with Python provides a practical way to access content details without an official API. By using headers, proxies, and parsing techniques, we gather and save valuable data effectively. This script can be customized for various streaming analytics, recommendations, or content monitoring, helping you utilize Netflix data for broader applications.