How to scrape IMDB data using Python

Comments: 0

In today’s world, extracting data from online social platforms such as IMDB can be an effective road towards collecting much-needed movie-related information for research or enjoyment purposes. In this tutorial, we will walk through scraping the Top 250 movies on IMDB using Python and extract details such as movie titles, movie summaries, ratings, genres, and more.

When scraping websites like IMDB, it's crucial to simulate the behavior of a real user to minimize the risk of detection and ensure successful data retrieval. Here are some strategies that can be employed:

  1. Avoid IP Blocking: Websites often limit the number of requests that can be made from a single IP address to prevent scraping. By using proxies, you can spread your requests over multiple IP addresses, reducing the risk of being blocked.
  2. Ensure Anonymity: Proxies mask your real IP address, which not only helps to protect your privacy but also makes it harder for websites to track scraping activities back to you.
  3. Comply with Speed Limits: Distributing requests through multiple proxies can help manage the frequency of your queries, staying within the website’s rate limits and reducing the likelihood of triggering anti-scraping measures.
  4. Bypass Server Suspicions: Incorporating headers that mimic those of a typical browser, such as User-Agent, can make your scraping requests appear more like normal user requests. This can prevent the server from flagging your activities as suspicious.

Step 1: Preparing the Scraper

For this tutorial, we will use Python's requests library to download the web content, lxml for parsing HTML, and optionally, the json library for handling formatted data when needed. First, install the required libraries.

Installing the libraries

Before starting, you need to install the necessary Python libraries. Run the following command in your terminal to install them:


pip install requests lxml

These libraries will be used to make HTTP requests, parse HTML content, and process the extracted data.

Configuring HTTP Request Headers

To make our requests resemble those from a real web browser, it’s crucial to set up the HTTP headers accordingly. Here is an example of how you might configure these headers in your script:


import requests

headers = {
    'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7',
    'accept-language': 'en-IN,en;q=0.9',
    'cache-control': 'no-cache',
    'dnt': '1',  # Do Not Track header
    'pragma': 'no-cache',
    'sec-ch-ua': '"Google Chrome";v="129", "Not=A?Brand";v="8", "Chromium";v="129"',
    'sec-ch-ua-mobile': '?0',
    'sec-ch-ua-platform': '"Linux"',
    'sec-fetch-dest': 'document',
    'sec-fetch-mode': 'navigate',
    'sec-fetch-site': 'none',
    'sec-fetch-user': '?1',
    'upgrade-insecure-requests': '1',
    'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/129.0.0.0 Safari/537.36',
}

response = requests.get('https://www.imdb.com/chart/top/', headers=headers)

Setting Up Proxies

Proxies are useful for large-scale scraping. They help you avoid getting blocked by distributing your requests across multiple IPs. Here’s how you can include a proxy:


proxies = {
    "http": "http://your_proxy_server",
    "https": "https://your_proxy_server"
}

response = requests.get('https://www.imdb.com/chart/top/', headers=headers, proxies=proxies)

Replace "your_proxy_server" with the actual proxy details you have access to. This ensures that your IP address is not exposed, and it helps to avoid getting blocked.

Step 2: Parsing the HTML Content

After fetching the webpage content, we need to parse it to extract movie details. We'll use lxml to parse the HTML and json to handle the structured data:


from lxml.html import fromstring
import json

# Parse the HTML response
parser = fromstring(response.text)

# Extract the JSON-LD data (structured data) from the script tag
raw_data = parser.xpath('//script[@type="application/ld+json"]/text()')[0]
json_data = json.loads(raw_data)

# Now we have structured movie data in JSON format

Step 3: Extracting Movie Details

The IMDB Top 250 page includes structured data embedded in the HTML, which can be easily accessed using XPath and parsed as JSON. We’ll extract movie details such as name, description, ratings, genres, and more:


movies_details = json_data.get('itemListElement')

# Loop through the movie data
movies_data = []
for movie in movies_details:
    movie_type = movie['item']['@type']
    url = movie['item']['url']
    name = movie['item']['name']
    description = movie['item']['description']
    image = movie['item']['image']
    bestRating = movie['item']['aggregateRating']['bestRating']
    worstRating = movie['item']['aggregateRating']['worstRating']
    ratingValue = movie['item']['aggregateRating']['ratingValue']
    ratingCount = movie['item']['aggregateRating']['ratingCount']
    contentRating = movie['item'].get('contentRating')
    genre = movie['item']['genre']
    duration = movie['item']['duration']
    
    # Append each movie's data to the list
    movies_data.append({
        'movie_type': movie_type,
        'url': url,
        'name': name,
        'description': description,
        'image': image,
        'bestRating': bestRating,
        'worstRating': worstRating,
        'ratingValue': ratingValue,
        'ratingCount': ratingCount,
        'contentRating': contentRating,
        'genre': genre,
        'duration': duration
    })

Step 4: Storing the Data

Once the data is extracted, it’s important to store it in a format that’s easy to analyze. In this case, we'll save it to a CSV file using the pandas library:


import pandas as pd

# Convert the list of movies to a pandas DataFrame
df = pd.DataFrame(movies_data)

# Save the data to a CSV file
df.to_csv('imdb_top_250_movies.csv', index=False)

print("IMDB Top 250 movies data saved to imdb_top_250_movies.csv")

Complete Code

Here’s the complete code for scraping IMDB’s Top 250 movies:


import requests
from lxml.html import fromstring
import json
import pandas as pd

# Define headers for the request
headers = {
    'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7',
    'accept-language': 'en-IN,en;q=0.9',
    'cache-control': 'no-cache',
    'dnt': '1',
    'pragma': 'no-cache',
    'sec-ch-ua': '"Google Chrome";v="129", "Not=A?Brand";v="8", "Chromium";v="129"',
    'sec-ch-ua-mobile': '?0',
    'sec-ch-ua-platform': '"Linux"',
    'sec-fetch-dest': 'document',
    'sec-fetch-mode': 'navigate',
    'sec-fetch-site': 'none',
    'sec-fetch-user': '?1',
    'upgrade-insecure-requests': '1',
    'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/129.0.0.0 Safari/537.36',
}

# Optionally, set up proxies
proxies = {
    "http": "http://your_proxy_server",
    "https": "https://your_proxy_server"
}

# Send the request to the IMDB Top 250 page
response = requests.get('https://www.imdb.com/chart/top/', headers=headers, proxies=proxies)

# Parse the HTML response
parser = fromstring(response.text)

# Extract the JSON-LD data
raw_data = parser.xpath('//script[@type="application/ld+json"]/text()')[0]
json_data = json.loads(raw_data)

# Extract movie details
movies_details = json_data.get('itemListElement')

movies_data = []
for movie in movies_details:
    movie_type = movie['item']['@type']
    url = movie['item']['url']
    name = movie['item']['name']
    description = movie['item']['description']
    image = movie['item']['image']
    bestRating = movie['item']['aggregateRating']['bestRating']
    worstRating = movie['item']['aggregateRating']['worstRating']
    ratingValue = movie['item']['aggregateRating']['ratingValue']
    ratingCount = movie['item']['aggregateRating']['ratingCount']
    contentRating = movie['item'].get('contentRating')
    genre = movie['item']['genre']
    duration = movie['item']['duration']
    
    movies_data.append({
        'movie_type': movie_type,
        'url': url,
        'name': name,
        'description': description,
        'image': image,
        'bestRating': bestRating,
        'worstRating': worstRating,
        'ratingValue': ratingValue,
        'ratingCount': ratingCount,
        'contentRating': contentRating,
        'genre': genre,
        'duration': duration
    })

# Save the data to a CSV file
df = pd.DataFrame(movies_data)
df.to_csv('imdb_top_250_movies.csv', index=False)
print("IMDB Top 250 movies data saved to imdb_top_250_movies.csv")

Ethical Considerations

Before scraping any website, it’s important to consider ethical and legal issues:

  • Respect the Robots.txt: Check IMDB’s robots.txt file to see what parts of the website are allowed for scraping. Always adhere to the website’s policies.
  • Avoid Overloading the Server: Scrape data responsibly by limiting the frequency of your requests to avoid putting unnecessary load on the server.
  • Respect Terms of Service: Ensure that scraping doesn’t violate IMDB’s terms of service.

Always be mindful of the rules and use web scraping for legitimate purposes.

Comments:

0 comments