How to Scrape Instagram Data Using Python

Comments: 0

Gaining access to Instagram data can be tricky due to various anti-bot mechanisms, login requirements, and rate limits. However, you can extract useful information from public profiles with the right tools and techniques. This article will guide you through how to scrape Instagram user data using Python by making API requests to Instagram’s backend, extracting information from the returned JSON data, and saving it into a JSON file.

Setting Up the Required Libraries

Before we get into the code, make sure you have installed the required Python libraries.

pip install requests python-box
  • requests: To make HTTP requests.
  • python-box: Simplifies data access by turning dictionaries into objects that allow dot notation access.

We'll break the code into different sections for better understanding, including sending the request, obtaining and parsing the data, using proxies to avoid detection, and simplifying JSON parsing with the Box library.

Step 1. Making the API request

The frontend of Instagram is heavily secured, but the backend offers API endpoints that can be used without authentication. We will use one of these points going forward.

This API provides detailed information about a user's profile, including their description, follower count, and posts. Let's explore how to request data using the requests library in Python (read more about the best Python libraries for web scraping).

Explanation:

  1. Headers: Instagram blocks most bot requests by analyzing request headers. The x-ig-app-id is essential because it mimics a request coming from the Instagram app itself.
  2. The User-Agent string represents the browser making the request, tricking Instagram into believing it's a real user.
  3. Backend API Request: The URL https://i.instagram.com/api/v1/users/web_profile_info/?username={username} is part of Instagram’s backend API. It provides detailed information about a public profile.
  4. Handling the JSON Response: We use response.json() to convert the API response into a JSON object that we can easily parse and extract information from.

import requests

# Define headers to mimic a real browser request
headers = {
    "x-ig-app-id": "936619743392459",  # Instagram app ID to authenticate the request
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36",
    "Accept-Language": "en-US,en;q=0.9,ru;q=0.8",
    "Accept-Encoding": "gzip, deflate, br",
    "Accept": "*/*",
}

# Replace this with the username you want to scrape
username = 'testtest'

# Send an API request to get profile data
response = requests.get(f'https://i.instagram.com/api/v1/users/web_profile_info/?username={username}', headers=headers)
response_json = response.json()  # Parse the response into a JSON object

Step 2. Handling Proxies to Bypass Rate-Limiting

Since Instagram restricts repeated requests from the same IP address (click here to learn how to check your IP address), using proxies is a must for large-scale scraping. A proxy routes your requests through different IP addresses, helping you avoid detection. The same approach can be applied to other social platforms - for example, when working with Snapchat data, setting up a proxy for Snapchat helps distribute requests more evenly and keep sessions stable during scraping.

To set up a proxy server, you will need the IP address, port number, and, if required, a username and password for authentication.


proxies = {
    'http': 'http://<proxy_username>:<proxy_password>@<proxy_ip>:<proxy_port>',
    'https': 'https://<proxy_username>:<proxy_password>@<proxy_ip>:<proxy_port>',
}

response = requests.get(f'https://i.instagram.com/api/v1/users/web_profile_info/?username={username}', headers=headers, proxies=proxies)

To improve stability and distribute traffic across authenticated endpoints for production use, consider using a dedicated Instagram proxy.

Step 3. Simplifying JSON Parsing with Box

Instagram's API returns a complex nested JSON structure, which can be difficult to navigate using traditional dictionary-based access. To make parsing easier, we can use the Box library, which allows accessing JSON data using dot notation instead of dictionary keys.

Explanation:

  1. Box: This library converts a JSON dictionary into an object, allowing us to access deeply nested fields using dot notation. For instance, instead of writing response_json['data']['user']['full_name'], we can simply write response_json.data.user.full_name.
  2. Extracting Data: We extract useful profile information like the user's full name, ID, biography, whether it’s a business or professional account, verification status, and follower count.

from box import Box

response_json = Box(response.json())

# Extract user profile data
user_data = {
    'full name': response_json.data.user.full_name,
    'id': response_json.data.user.id,
    'biography': response_json.data.user.biography,
    'business account': response_json.data.user.is_business_account,
    'professional account': response_json.data.user.is_professional_account,
    'category name': response_json.data.user.category_name,
    'is verified': response_json.data.user.is_verified,
    'profile pic url': response_json.data.user.profile_pic_url_hd,
    'followers': response_json.data.user.edge_followed_by.count,
    'following': response_json.data.user.edge_follow.count,
}

Step 4. Extracting Video and Timeline Data

Once the profile data is extracted, we can also scrape data from the user’s video timeline and regular posts.

Explanation:

  1. Video Data: This section extracts data about the user’s Instagram videos, including the video URL, view count, comment count, and the video's duration.
  2. Timeline Media: Similarly, this section extracts data from the user's timeline, capturing the post's media URL, likes, and comments.

# Extract video data
profile_video_data = []
for element in response_json.data.user.edge_felix_video_timeline.edges:
    video_data = {
        'id': element.node.id,
        'short code': element.node.shortcode,
        'video url': element.node.video_url,
        'view count': element.node.video_view_count,
        'comment count': element.node.edge_media_to_comment.count,
        'like count': element.node.edge_liked_by.count,
        'duration': element.node.video_duration,
    }
    profile_video_data.append(video_data)

# Extract timeline media data (photos and videos)
profile_timeline_media_data = []
for element in response_json.data.user.edge_owner_to_timeline_media.edges:
    media_data = {
        'id': element.node.id,
        'short code': element.node.shortcode,
        'media url': element.node.display_url,
        'comment count': element.node.edge_media_to_comment.count,
        'like count': element.node.edge_liked_by.count,
    }
    profile_timeline_media_data.append(media_data)

Step 5. Saving the Data to JSON Files

Once you’ve extracted all the data, the next step is to save it to a JSON file for further analysis or storage. We use Python's json module to write the extracted data to JSON files. Each file will be neatly formatted, thanks to the indent=4 parameter, which makes it easy to read and process the data.


import json

# Save user data to a JSON file
with open(f'{username}_profile_data.json', 'w') as file:
    json.dump(user_data, file, indent=4)

# Save video data to a JSON file
with open(f'{username}_video_data.json', 'w') as file:
    json.dump(profile_video_data, file, indent=4)

# Save timeline media data to a JSON file
with open(f'{username}_timeline_media_data.json', 'w') as file:
    json.dump(profile_timeline_media_data, file, indent=4)

Complete Code

Here’s the complete Python script that combines all the previously discussed sections. This code scrapes user profile data, video data, and timeline media data from Instagram, handles the necessary headers and proxies, and saves the extracted information to JSON files.


import requests
from box import Box
import json

# Headers to mimic a real browser request to Instagram's backend API
headers = {
    "x-ig-app-id": "936619743392459", 
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36",
    "Accept-Language": "en-US,en;q=0.9,ru;q=0.8",
    "Accept-Encoding": "gzip, deflate, br",
    "Accept": "*/*",
}

# Set a proxy to avoid rate-limiting and detection (optional)
proxies = {
    'http': 'http://<proxy_username>:<proxy_password>@<proxy_ip>:<proxy_port>',
    'https': 'https://<proxy_username>:<proxy_password>@<proxy_ip>:<proxy_port>',
}

# The Instagram username to scrape
username = 'testtest'

# Send a request to Instagram's backend API to get profile data
response = requests.get(f'https://i.instagram.com/api/v1/users/web_profile_info/?username={username}', 
                        headers=headers, proxies=proxies)
response_json = Box(response.json())  # Convert the response to Box object for easy navigation

# Extract user profile data
user_data = {
    'full name': response_json.data.user.full_name,
    'id': response_json.data.user.id,
    'biography': response_json.data.user.biography,
    'business account': response_json.data.user.is_business_account,
    'professional account': response_json.data.user.is_professional_account,
    'category name': response_json.data.user.category_name,
    'is verified': response_json.data.user.is_verified,
    'profile pic url': response_json.data.user.profile_pic_url_hd,
    'followers': response_json.data.user.edge_followed_by.count,
    'following': response_json.data.user.edge_follow.count,
}

# Extract video data from the user's video timeline
profile_video_data = []
for element in response_json.data.user.edge_felix_video_timeline.edges:
    video_data = {
        'id': element.node.id,
        'short code': element.node.shortcode,
        'video url': element.node.video_url,
        'view count': element.node.video_view_count,
        'comment count': element.node.edge_media_to_comment.count,
        'like count': element.node.edge_liked_by.count,
        'duration': element.node.video_duration,
    }
    profile_video_data.append(video_data)

# Extract timeline media data (photos and videos)
profile_timeline_media_data = []
for element in response_json.data.user.edge_owner_to_timeline_media.edges:
    media_data = {
        'id': element.node.id,
        'short code': element.node.shortcode,
        'media url': element.node.display_url,
        'comment count': element.node.edge_media_to_comment.count,
        'like count': element.node.edge_liked_by.count,
    }
    profile_timeline_media_data.append(media_data)

# Save user profile data to a JSON file
with open(f'{username}_profile_data.json', 'w') as file:
    json.dump(user_data, file, indent=4)
print(f'saved json: {username}_profile_data.json')

# Save video data to a JSON file
with open(f'{username}_video_data.json', 'w') as file:
    json.dump(profile_video_data, file, indent=4)
print(f'saved json: {username}_video_data.json')

# Save timeline media data to a JSON file
with open(f'{username}_timeline_media_data.json', 'w') as file:
    json.dump(profile_timeline_media_data, file, indent=4)
print(f'saved json: {username}_timeline_media_data.json')

Why Scraping Instagram is Tricky with the Instagram Scraper Python

Let's outline the reasons Instagram scraper Python may be difficult to use.

Strong Anti-Bot Login Wall

Instagram has put up a strong anti-bot login wall. This wall makes it hard to access data, even from public pages. When you visit Instagram through a scraper, you often hit login prompts that block your requests. You might think a VPN or datacenter IP can fix this, but they usually can't. These IPs are quickly detected and blocked.

Need for Residential Proxies

Residential proxies offer a better way around these login walls. They use real IPs from regular users, so Instagram sees them as normal visitors. This reduces the chances of being blocked. You need proxies that rotate IPs and locations for the best results.

Intermittent Login Modals

Another problem is Instagram’s intermittent login modals when you try to access posts directly. These pop-ups appear randomly, interrupting your scraper's flow. This causes your Python Instagram scraper to fail, forcing retries.

Complex API and Maintenance Burden

You could try intercepting Instagram’s hidden API calls to get data. But this is complex and needs constant maintenance. Instagram changes its API frequently and aggressively blocks scrapers. You must stay up to date with the exact API request headers to avoid errors and bans.

Proxy-Seller: The Essential Proxy Engine for Bypassing Instagram Blocks

Using residential proxies along with updated headers is key. Proxy-Seller is a top provider of such proxies. They offer over 20 million rotating residential IP addresses worldwide. You can target proxies by country, city, or ISP precisely. You also get options like sticky sessions or rotation by time or request count.

Here are Proxy-Seller’s features you’ll find useful for Instagram scraper Python projects:

  • High trust factor and excellent reputation among users
  • Full compliance with GDPR, CCPA, and other privacy laws
  • Fast proxy speeds up to 1 Gbps, suitable for heavy scraping
  • 24/7 reliable customer support
  • Flexible pricing plans for scrapers of all scales

This makes Proxy-Seller ideal for scraping Instagram data without interruptions. These proxies help your Python Instagram scraper stay effectively under Instagram’s radar.

Cloud Platform for Sustainable Instagram Scraping

For a reliable and sustainable scraping workflow, use the Apify platform. Apify hosts ready-made web scrapers, called Actors, including many designed for Instagram. You don’t have to build everything from scratch, which saves time and hassle.

There are over 230 Instagram Scraping Actors available on the Apify Store. They cover tasks like profile data extraction, post scraping, comment collection, and more. Using these Actors means you get tried-and-tested solutions that handle Instagram’s challenges.

Cloud-based API scrapers like those on Apify offer key benefits:

  • They run on Apify’s servers, so your local machine isn’t involved.
  • They use rotating proxies built into the platform, reducing blocks.
  • Updates to the scrapers happen seamlessly, keeping up with Instagram API changes.
  • You get easy-to-use APIs to trigger scraping and download results.

Using Apify’s Instagram Scraper Python Actors makes your scraping setup sustainable. You save yourself from bot detection headaches by relying on Apify’s anti-blocking infrastructure.

Next Steps and Advanced Usage with Instagram Scraper Python

Once you master basic scraping, extend your scripts with pagination or batched scraping to collect more data beyond default limits. Apify Actors support cursor-based pagination, letting you automatically scrape multiple pages of Instagram posts.

Exporting scraped data is simple. You can save results locally as JSON or CSV using Python’s built-in json and csv modules. Alternatively, upload data directly to cloud databases like MongoDB Atlas or PostgreSQL. Use pymongo for MongoDB or psycopg2 for PostgreSQL to integrate smoothly.

After scraping, analyze post metrics using Python libraries like pandas for data handling and matplotlib or seaborn for visualization. This lets you track trends over time, helping you understand Instagram engagement patterns.

To automate scraping, use Apify’s ScheduleClient. It lets you set up cron-like jobs that run your scraper at regular intervals. This setup eliminates manual runs, so you collect fresh data continuously.

Integrating Premium Proxies for Custom Scripts

If you need custom scrapers beyond Apify’s ready Actors, integrate residential or rotating proxies in your configurations. Proxy-Seller simplifies proxy management with:

  • Support for SOCKS5 and HTTP(S) protocols
  • Two authentication types: username/password and IP whitelisting
  • Easy-to-use API and dashboard for proxy control
  • Variety of proxy types, including ISP proxies, datacenter IPv4/IPv6, and mobile proxies
  • 24/7 customer support helping with configurations and troubleshooting

These features make proxy integration in Python scripts and frameworks straightforward. You can fine-tune the proxy rotation and scale according to your scraping needs.

Monitoring and Troubleshooting

Finally, monitor your API usage and implement rate limit checks in your code to avoid blocks. Set up logging using Python’s logging module or third-party tools like Sentry. This helps catch scraper failures early and maintain stable operations.

To Sum Up

Scraping Instagram data with Python can be done by leveraging the backend API provided by Instagram, which helps bypass some of the frontend restrictions. Using the right headers to mimic browser behavior and employing proxies to avoid rate-limiting are critical steps. The Box library further simplifies the process by making JSON parsing more intuitive with dot notation. Before you start scraping Instagram at scale, remember to comply with Instagram's terms of service, and make sure your scraping efforts do not violate their policies.

And remember, by combining Apify’s Instagram scraper Python Actors with Proxy-Seller’s premium proxies and earlier-mentioned advanced strategies, you’ll build a robust, scalable Instagram email scraper Python GitHub or a general scraper that runs smoothly and sustainably.

Comments:

0 comments