In today’s world, extracting data from online social platforms such as IMDB can be an effective road towards collecting much-needed movie-related information for research or enjoyment purposes. In this tutorial, we will walk through scraping the Top 250 movies on IMDB using Python and extract details such as movie titles, movie summaries, ratings, genres, and more.
When scraping websites like IMDB, it's crucial to simulate the behavior of a real user to minimize the risk of detection and ensure successful data retrieval. Here are some strategies that can be employed:
For this tutorial, we will use Python's requests library to download the web content, lxml for parsing HTML, and optionally, the json library for handling formatted data when needed. First, install the required libraries.
Before starting, you need to install the necessary Python libraries. Run the following command in your terminal to install them:
pip install requests lxml
These libraries will be used to make HTTP requests, parse HTML content, and process the extracted data.
To make our requests resemble those from a real web browser, it’s crucial to set up the HTTP headers accordingly. Here is an example of how you might configure these headers in your script:
import requests
headers = {
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7',
'accept-language': 'en-IN,en;q=0.9',
'cache-control': 'no-cache',
'dnt': '1', # Do Not Track header
'pragma': 'no-cache',
'sec-ch-ua': '"Google Chrome";v="129", "Not=A?Brand";v="8", "Chromium";v="129"',
'sec-ch-ua-mobile': '?0',
'sec-ch-ua-platform': '"Linux"',
'sec-fetch-dest': 'document',
'sec-fetch-mode': 'navigate',
'sec-fetch-site': 'none',
'sec-fetch-user': '?1',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/129.0.0.0 Safari/537.36',
}
response = requests.get('https://www.imdb.com/chart/top/', headers=headers)
Proxies are useful for large-scale scraping. They help you avoid getting blocked by distributing your requests across multiple IPs. Here’s how you can include a proxy:
proxies = {
"http": "http://your_proxy_server",
"https": "https://your_proxy_server"
}
response = requests.get('https://www.imdb.com/chart/top/', headers=headers, proxies=proxies)
Replace "your_proxy_server" with the actual proxy details you have access to. This ensures that your IP address is not exposed, and it helps to avoid getting blocked.
After fetching the webpage content, we need to parse it to extract movie details. We'll use lxml to parse the HTML and json to handle the structured data:
from lxml.html import fromstring
import json
# Parse the HTML response
parser = fromstring(response.text)
# Extract the JSON-LD data (structured data) from the script tag
raw_data = parser.xpath('//script[@type="application/ld+json"]/text()')[0]
json_data = json.loads(raw_data)
# Now we have structured movie data in JSON format
The IMDB Top 250 page includes structured data embedded in the HTML, which can be easily accessed using XPath and parsed as JSON. We’ll extract movie details such as name, description, ratings, genres, and more:
movies_details = json_data.get('itemListElement')
# Loop through the movie data
movies_data = []
for movie in movies_details:
movie_type = movie['item']['@type']
url = movie['item']['url']
name = movie['item']['name']
description = movie['item']['description']
image = movie['item']['image']
bestRating = movie['item']['aggregateRating']['bestRating']
worstRating = movie['item']['aggregateRating']['worstRating']
ratingValue = movie['item']['aggregateRating']['ratingValue']
ratingCount = movie['item']['aggregateRating']['ratingCount']
contentRating = movie['item'].get('contentRating')
genre = movie['item']['genre']
duration = movie['item']['duration']
# Append each movie's data to the list
movies_data.append({
'movie_type': movie_type,
'url': url,
'name': name,
'description': description,
'image': image,
'bestRating': bestRating,
'worstRating': worstRating,
'ratingValue': ratingValue,
'ratingCount': ratingCount,
'contentRating': contentRating,
'genre': genre,
'duration': duration
})
Once the data is extracted, it’s important to store it in a format that’s easy to analyze. In this case, we'll save it to a CSV file using the pandas library:
import pandas as pd
# Convert the list of movies to a pandas DataFrame
df = pd.DataFrame(movies_data)
# Save the data to a CSV file
df.to_csv('imdb_top_250_movies.csv', index=False)
print("IMDB Top 250 movies data saved to imdb_top_250_movies.csv")
Here’s the complete code for scraping IMDB’s Top 250 movies:
import requests
from lxml.html import fromstring
import json
import pandas as pd
# Define headers for the request
headers = {
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7',
'accept-language': 'en-IN,en;q=0.9',
'cache-control': 'no-cache',
'dnt': '1',
'pragma': 'no-cache',
'sec-ch-ua': '"Google Chrome";v="129", "Not=A?Brand";v="8", "Chromium";v="129"',
'sec-ch-ua-mobile': '?0',
'sec-ch-ua-platform': '"Linux"',
'sec-fetch-dest': 'document',
'sec-fetch-mode': 'navigate',
'sec-fetch-site': 'none',
'sec-fetch-user': '?1',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/129.0.0.0 Safari/537.36',
}
# Optionally, set up proxies
proxies = {
"http": "http://your_proxy_server",
"https": "https://your_proxy_server"
}
# Send the request to the IMDB Top 250 page
response = requests.get('https://www.imdb.com/chart/top/', headers=headers, proxies=proxies)
# Parse the HTML response
parser = fromstring(response.text)
# Extract the JSON-LD data
raw_data = parser.xpath('//script[@type="application/ld+json"]/text()')[0]
json_data = json.loads(raw_data)
# Extract movie details
movies_details = json_data.get('itemListElement')
movies_data = []
for movie in movies_details:
movie_type = movie['item']['@type']
url = movie['item']['url']
name = movie['item']['name']
description = movie['item']['description']
image = movie['item']['image']
bestRating = movie['item']['aggregateRating']['bestRating']
worstRating = movie['item']['aggregateRating']['worstRating']
ratingValue = movie['item']['aggregateRating']['ratingValue']
ratingCount = movie['item']['aggregateRating']['ratingCount']
contentRating = movie['item'].get('contentRating')
genre = movie['item']['genre']
duration = movie['item']['duration']
movies_data.append({
'movie_type': movie_type,
'url': url,
'name': name,
'description': description,
'image': image,
'bestRating': bestRating,
'worstRating': worstRating,
'ratingValue': ratingValue,
'ratingCount': ratingCount,
'contentRating': contentRating,
'genre': genre,
'duration': duration
})
# Save the data to a CSV file
df = pd.DataFrame(movies_data)
df.to_csv('imdb_top_250_movies.csv', index=False)
print("IMDB Top 250 movies data saved to imdb_top_250_movies.csv")
Before scraping any website, it’s important to consider ethical and legal issues:
Always be mindful of the rules and use web scraping for legitimate purposes.
Comments: 0