Gaining access to Instagram data can be tricky due to various anti-bot mechanisms, login requirements, and rate limits. However, you can extract useful information from public profiles with the right tools and techniques. This article will guide you through how to scrape Instagram user data using Python by making API requests to Instagram’s backend, extracting information from the returned JSON data, and saving it into a JSON file.
Before we get into the code, make sure you have installed the required Python libraries.
pip install requests python-box
We'll break the code into different sections for better understanding, including sending the request, obtaining and parsing the data, using proxies to avoid detection, and simplifying JSON parsing with the Box library.
The frontend of Instagram is heavily secured, but the backend offers API endpoints that can be used without authentication. We will use one of these points going forward.
This API provides detailed information about a user's profile, including their description, follower count, and posts. Let's explore how to request data using the requests library in Python.
Explanation:
import requests
# Define headers to mimic a real browser request
headers = {
"x-ig-app-id": "936619743392459", # Instagram app ID to authenticate the request
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36",
"Accept-Language": "en-US,en;q=0.9,ru;q=0.8",
"Accept-Encoding": "gzip, deflate, br",
"Accept": "*/*",
}
# Replace this with the username you want to scrape
username = 'testtest'
# Send an API request to get profile data
response = requests.get(f'https://i.instagram.com/api/v1/users/web_profile_info/?username={username}', headers=headers)
response_json = response.json() # Parse the response into a JSON object
Since Instagram restricts repeated requests from the same IP address, using proxies is essential for large-scale scraping. A proxy routes your requests through different IP addresses, helping you avoid detection.
To set up a proxy server, you will need the IP address, port number, and, if required, a username and password for authentication.
proxies = {
'http': 'http://<proxy_username>:<proxy_password>@<proxy_ip>:<proxy_port>',
'https': 'https://<proxy_username>:<proxy_password>@<proxy_ip>:<proxy_port>',
}
response = requests.get(f'https://i.instagram.com/api/v1/users/web_profile_info/?username={username}', headers=headers, proxies=proxies)
Instagram's API returns a complex nested JSON structure, which can be difficult to navigate using traditional dictionary-based access. To make parsing easier, we can use the Box library, which allows accessing JSON data using dot notation instead of dictionary keys.
Explanation:
from box import Box
response_json = Box(response.json())
# Extract user profile data
user_data = {
'full name': response_json.data.user.full_name,
'id': response_json.data.user.id,
'biography': response_json.data.user.biography,
'business account': response_json.data.user.is_business_account,
'professional account': response_json.data.user.is_professional_account,
'category name': response_json.data.user.category_name,
'is verified': response_json.data.user.is_verified,
'profile pic url': response_json.data.user.profile_pic_url_hd,
'followers': response_json.data.user.edge_followed_by.count,
'following': response_json.data.user.edge_follow.count,
}
Once the profile data is extracted, we can also scrape data from the user’s video timeline and regular posts.
Explanation:
# Extract video data
profile_video_data = []
for element in response_json.data.user.edge_felix_video_timeline.edges:
video_data = {
'id': element.node.id,
'short code': element.node.shortcode,
'video url': element.node.video_url,
'view count': element.node.video_view_count,
'comment count': element.node.edge_media_to_comment.count,
'like count': element.node.edge_liked_by.count,
'duration': element.node.video_duration,
}
profile_video_data.append(video_data)
# Extract timeline media data (photos and videos)
profile_timeline_media_data = []
for element in response_json.data.user.edge_owner_to_timeline_media.edges:
media_data = {
'id': element.node.id,
'short code': element.node.shortcode,
'media url': element.node.display_url,
'comment count': element.node.edge_media_to_comment.count,
'like count': element.node.edge_liked_by.count,
}
profile_timeline_media_data.append(media_data)
Once you’ve extracted all the data, the next step is to save it to a JSON file for further analysis or storage. We use Python's json module to write the extracted data to JSON files. Each file will be neatly formatted, thanks to the indent=4 parameter, which makes it easy to read and process the data.
import json
# Save user data to a JSON file
with open(f'{username}_profile_data.json', 'w') as file:
json.dump(user_data, file, indent=4)
# Save video data to a JSON file
with open(f'{username}_video_data.json', 'w') as file:
json.dump(profile_video_data, file, indent=4)
# Save timeline media data to a JSON file
with open(f'{username}_timeline_media_data.json', 'w') as file:
json.dump(profile_timeline_media_data, file, indent=4)
Here’s the complete Python script that combines all the previously discussed sections. This code scrapes user profile data, video data, and timeline media data from Instagram, handles the necessary headers and proxies, and saves the extracted information to JSON files.
import requests
from box import Box
import json
# Headers to mimic a real browser request to Instagram's backend API
headers = {
"x-ig-app-id": "936619743392459",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36",
"Accept-Language": "en-US,en;q=0.9,ru;q=0.8",
"Accept-Encoding": "gzip, deflate, br",
"Accept": "*/*",
}
# Set a proxy to avoid rate-limiting and detection (optional)
proxies = {
'http': 'http://<proxy_username>:<proxy_password>@<proxy_ip>:<proxy_port>',
'https': 'https://<proxy_username>:<proxy_password>@<proxy_ip>:<proxy_port>',
}
# The Instagram username to scrape
username = 'testtest'
# Send a request to Instagram's backend API to get profile data
response = requests.get(f'https://i.instagram.com/api/v1/users/web_profile_info/?username={username}',
headers=headers, proxies=proxies)
response_json = Box(response.json()) # Convert the response to Box object for easy navigation
# Extract user profile data
user_data = {
'full name': response_json.data.user.full_name,
'id': response_json.data.user.id,
'biography': response_json.data.user.biography,
'business account': response_json.data.user.is_business_account,
'professional account': response_json.data.user.is_professional_account,
'category name': response_json.data.user.category_name,
'is verified': response_json.data.user.is_verified,
'profile pic url': response_json.data.user.profile_pic_url_hd,
'followers': response_json.data.user.edge_followed_by.count,
'following': response_json.data.user.edge_follow.count,
}
# Extract video data from the user's video timeline
profile_video_data = []
for element in response_json.data.user.edge_felix_video_timeline.edges:
video_data = {
'id': element.node.id,
'short code': element.node.shortcode,
'video url': element.node.video_url,
'view count': element.node.video_view_count,
'comment count': element.node.edge_media_to_comment.count,
'like count': element.node.edge_liked_by.count,
'duration': element.node.video_duration,
}
profile_video_data.append(video_data)
# Extract timeline media data (photos and videos)
profile_timeline_media_data = []
for element in response_json.data.user.edge_owner_to_timeline_media.edges:
media_data = {
'id': element.node.id,
'short code': element.node.shortcode,
'media url': element.node.display_url,
'comment count': element.node.edge_media_to_comment.count,
'like count': element.node.edge_liked_by.count,
}
profile_timeline_media_data.append(media_data)
# Save user profile data to a JSON file
with open(f'{username}_profile_data.json', 'w') as file:
json.dump(user_data, file, indent=4)
print(f'saved json: {username}_profile_data.json')
# Save video data to a JSON file
with open(f'{username}_video_data.json', 'w') as file:
json.dump(profile_video_data, file, indent=4)
print(f'saved json: {username}_video_data.json')
# Save timeline media data to a JSON file
with open(f'{username}_timeline_media_data.json', 'w') as file:
json.dump(profile_timeline_media_data, file, indent=4)
print(f'saved json: {username}_timeline_media_data.json')
Scraping Instagram data with Python can be done by leveraging the backend API provided by Instagram, which helps bypass some of the front-end restrictions. Using the right headers to mimic browser behavior and employing proxies to avoid rate-limiting are critical steps. The Box library further simplifies the process by making JSON parsing more intuitive with dot notation. Before you start scraping Instagram at scale, remember to comply with Instagram's terms of service, and make sure your scraping efforts do not violate their policies.
Comments: 0