en
Español
中國人
Tiếng Việt
Deutsch
Українська
Português
Français
भारतीय
Türkçe
한국인
Italiano
Gaeilge
اردو
Indonesia
Polski Gaining access to Instagram data can be tricky due to various anti-bot mechanisms, login requirements, and rate limits. However, you can extract useful information from public profiles with the right tools and techniques. This article will guide you through how to scrape Instagram user data using Python by making API requests to Instagram’s backend, extracting information from the returned JSON data, and saving it into a JSON file.
Before we get into the code, make sure you have installed the required Python libraries.
pip install requests python-box
We'll break the code into different sections for better understanding, including sending the request, obtaining and parsing the data, using proxies to avoid detection, and simplifying JSON parsing with the Box library.
The frontend of Instagram is heavily secured, but the backend offers API endpoints that can be used without authentication. We will use one of these points going forward.
This API provides detailed information about a user's profile, including their description, follower count, and posts. Let's explore how to request data using the requests library in Python (read more about the best Python libraries for web scraping).
Explanation:
import requests
# Define headers to mimic a real browser request
headers = {
"x-ig-app-id": "936619743392459", # Instagram app ID to authenticate the request
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36",
"Accept-Language": "en-US,en;q=0.9,ru;q=0.8",
"Accept-Encoding": "gzip, deflate, br",
"Accept": "*/*",
}
# Replace this with the username you want to scrape
username = 'testtest'
# Send an API request to get profile data
response = requests.get(f'https://i.instagram.com/api/v1/users/web_profile_info/?username={username}', headers=headers)
response_json = response.json() # Parse the response into a JSON object
Since Instagram restricts repeated requests from the same IP address (click here to learn how to check your IP address), using proxies is a must for large-scale scraping. A proxy routes your requests through different IP addresses, helping you avoid detection. The same approach can be applied to other social platforms - for example, when working with Snapchat data, setting up a proxy for Snapchat helps distribute requests more evenly and keep sessions stable during scraping.
To set up a proxy server, you will need the IP address, port number, and, if required, a username and password for authentication.
proxies = {
'http': 'http://<proxy_username>:<proxy_password>@<proxy_ip>:<proxy_port>',
'https': 'https://<proxy_username>:<proxy_password>@<proxy_ip>:<proxy_port>',
}
response = requests.get(f'https://i.instagram.com/api/v1/users/web_profile_info/?username={username}', headers=headers, proxies=proxies)
To improve stability and distribute traffic across authenticated endpoints for production use, consider using a dedicated Instagram proxy.
Instagram's API returns a complex nested JSON structure, which can be difficult to navigate using traditional dictionary-based access. To make parsing easier, we can use the Box library, which allows accessing JSON data using dot notation instead of dictionary keys.
Explanation:
from box import Box
response_json = Box(response.json())
# Extract user profile data
user_data = {
'full name': response_json.data.user.full_name,
'id': response_json.data.user.id,
'biography': response_json.data.user.biography,
'business account': response_json.data.user.is_business_account,
'professional account': response_json.data.user.is_professional_account,
'category name': response_json.data.user.category_name,
'is verified': response_json.data.user.is_verified,
'profile pic url': response_json.data.user.profile_pic_url_hd,
'followers': response_json.data.user.edge_followed_by.count,
'following': response_json.data.user.edge_follow.count,
}
Once the profile data is extracted, we can also scrape data from the user’s video timeline and regular posts.
Explanation:
# Extract video data
profile_video_data = []
for element in response_json.data.user.edge_felix_video_timeline.edges:
video_data = {
'id': element.node.id,
'short code': element.node.shortcode,
'video url': element.node.video_url,
'view count': element.node.video_view_count,
'comment count': element.node.edge_media_to_comment.count,
'like count': element.node.edge_liked_by.count,
'duration': element.node.video_duration,
}
profile_video_data.append(video_data)
# Extract timeline media data (photos and videos)
profile_timeline_media_data = []
for element in response_json.data.user.edge_owner_to_timeline_media.edges:
media_data = {
'id': element.node.id,
'short code': element.node.shortcode,
'media url': element.node.display_url,
'comment count': element.node.edge_media_to_comment.count,
'like count': element.node.edge_liked_by.count,
}
profile_timeline_media_data.append(media_data)
Once you’ve extracted all the data, the next step is to save it to a JSON file for further analysis or storage. We use Python's json module to write the extracted data to JSON files. Each file will be neatly formatted, thanks to the indent=4 parameter, which makes it easy to read and process the data.
import json
# Save user data to a JSON file
with open(f'{username}_profile_data.json', 'w') as file:
json.dump(user_data, file, indent=4)
# Save video data to a JSON file
with open(f'{username}_video_data.json', 'w') as file:
json.dump(profile_video_data, file, indent=4)
# Save timeline media data to a JSON file
with open(f'{username}_timeline_media_data.json', 'w') as file:
json.dump(profile_timeline_media_data, file, indent=4)
Here’s the complete Python script that combines all the previously discussed sections. This code scrapes user profile data, video data, and timeline media data from Instagram, handles the necessary headers and proxies, and saves the extracted information to JSON files.
import requests
from box import Box
import json
# Headers to mimic a real browser request to Instagram's backend API
headers = {
"x-ig-app-id": "936619743392459",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36",
"Accept-Language": "en-US,en;q=0.9,ru;q=0.8",
"Accept-Encoding": "gzip, deflate, br",
"Accept": "*/*",
}
# Set a proxy to avoid rate-limiting and detection (optional)
proxies = {
'http': 'http://<proxy_username>:<proxy_password>@<proxy_ip>:<proxy_port>',
'https': 'https://<proxy_username>:<proxy_password>@<proxy_ip>:<proxy_port>',
}
# The Instagram username to scrape
username = 'testtest'
# Send a request to Instagram's backend API to get profile data
response = requests.get(f'https://i.instagram.com/api/v1/users/web_profile_info/?username={username}',
headers=headers, proxies=proxies)
response_json = Box(response.json()) # Convert the response to Box object for easy navigation
# Extract user profile data
user_data = {
'full name': response_json.data.user.full_name,
'id': response_json.data.user.id,
'biography': response_json.data.user.biography,
'business account': response_json.data.user.is_business_account,
'professional account': response_json.data.user.is_professional_account,
'category name': response_json.data.user.category_name,
'is verified': response_json.data.user.is_verified,
'profile pic url': response_json.data.user.profile_pic_url_hd,
'followers': response_json.data.user.edge_followed_by.count,
'following': response_json.data.user.edge_follow.count,
}
# Extract video data from the user's video timeline
profile_video_data = []
for element in response_json.data.user.edge_felix_video_timeline.edges:
video_data = {
'id': element.node.id,
'short code': element.node.shortcode,
'video url': element.node.video_url,
'view count': element.node.video_view_count,
'comment count': element.node.edge_media_to_comment.count,
'like count': element.node.edge_liked_by.count,
'duration': element.node.video_duration,
}
profile_video_data.append(video_data)
# Extract timeline media data (photos and videos)
profile_timeline_media_data = []
for element in response_json.data.user.edge_owner_to_timeline_media.edges:
media_data = {
'id': element.node.id,
'short code': element.node.shortcode,
'media url': element.node.display_url,
'comment count': element.node.edge_media_to_comment.count,
'like count': element.node.edge_liked_by.count,
}
profile_timeline_media_data.append(media_data)
# Save user profile data to a JSON file
with open(f'{username}_profile_data.json', 'w') as file:
json.dump(user_data, file, indent=4)
print(f'saved json: {username}_profile_data.json')
# Save video data to a JSON file
with open(f'{username}_video_data.json', 'w') as file:
json.dump(profile_video_data, file, indent=4)
print(f'saved json: {username}_video_data.json')
# Save timeline media data to a JSON file
with open(f'{username}_timeline_media_data.json', 'w') as file:
json.dump(profile_timeline_media_data, file, indent=4)
print(f'saved json: {username}_timeline_media_data.json')
Let's outline the reasons Instagram scraper Python may be difficult to use.
Instagram has put up a strong anti-bot login wall. This wall makes it hard to access data, even from public pages. When you visit Instagram through a scraper, you often hit login prompts that block your requests. You might think a VPN or datacenter IP can fix this, but they usually can't. These IPs are quickly detected and blocked.
Residential proxies offer a better way around these login walls. They use real IPs from regular users, so Instagram sees them as normal visitors. This reduces the chances of being blocked. You need proxies that rotate IPs and locations for the best results.
Another problem is Instagram’s intermittent login modals when you try to access posts directly. These pop-ups appear randomly, interrupting your scraper's flow. This causes your Python Instagram scraper to fail, forcing retries.
You could try intercepting Instagram’s hidden API calls to get data. But this is complex and needs constant maintenance. Instagram changes its API frequently and aggressively blocks scrapers. You must stay up to date with the exact API request headers to avoid errors and bans.
Using residential proxies along with updated headers is key. Proxy-Seller is a top provider of such proxies. They offer over 20 million rotating residential IP addresses worldwide. You can target proxies by country, city, or ISP precisely. You also get options like sticky sessions or rotation by time or request count.
Here are Proxy-Seller’s features you’ll find useful for Instagram scraper Python projects:
This makes Proxy-Seller ideal for scraping Instagram data without interruptions. These proxies help your Python Instagram scraper stay effectively under Instagram’s radar.
For a reliable and sustainable scraping workflow, use the Apify platform. Apify hosts ready-made web scrapers, called Actors, including many designed for Instagram. You don’t have to build everything from scratch, which saves time and hassle.
There are over 230 Instagram Scraping Actors available on the Apify Store. They cover tasks like profile data extraction, post scraping, comment collection, and more. Using these Actors means you get tried-and-tested solutions that handle Instagram’s challenges.
Cloud-based API scrapers like those on Apify offer key benefits:
Using Apify’s Instagram Scraper Python Actors makes your scraping setup sustainable. You save yourself from bot detection headaches by relying on Apify’s anti-blocking infrastructure.
Once you master basic scraping, extend your scripts with pagination or batched scraping to collect more data beyond default limits. Apify Actors support cursor-based pagination, letting you automatically scrape multiple pages of Instagram posts.
Exporting scraped data is simple. You can save results locally as JSON or CSV using Python’s built-in json and csv modules. Alternatively, upload data directly to cloud databases like MongoDB Atlas or PostgreSQL. Use pymongo for MongoDB or psycopg2 for PostgreSQL to integrate smoothly.
After scraping, analyze post metrics using Python libraries like pandas for data handling and matplotlib or seaborn for visualization. This lets you track trends over time, helping you understand Instagram engagement patterns.
To automate scraping, use Apify’s ScheduleClient. It lets you set up cron-like jobs that run your scraper at regular intervals. This setup eliminates manual runs, so you collect fresh data continuously.
If you need custom scrapers beyond Apify’s ready Actors, integrate residential or rotating proxies in your configurations. Proxy-Seller simplifies proxy management with:
These features make proxy integration in Python scripts and frameworks straightforward. You can fine-tune the proxy rotation and scale according to your scraping needs.
Finally, monitor your API usage and implement rate limit checks in your code to avoid blocks. Set up logging using Python’s logging module or third-party tools like Sentry. This helps catch scraper failures early and maintain stable operations.
Scraping Instagram data with Python can be done by leveraging the backend API provided by Instagram, which helps bypass some of the frontend restrictions. Using the right headers to mimic browser behavior and employing proxies to avoid rate-limiting are critical steps. The Box library further simplifies the process by making JSON parsing more intuitive with dot notation. Before you start scraping Instagram at scale, remember to comply with Instagram's terms of service, and make sure your scraping efforts do not violate their policies.
And remember, by combining Apify’s Instagram scraper Python Actors with Proxy-Seller’s premium proxies and earlier-mentioned advanced strategies, you’ll build a robust, scalable Instagram email scraper Python GitHub or a general scraper that runs smoothly and sustainably.
Comments: 0