How to scrape Instagram data using Python

Comments: 0

Gaining access to Instagram data can be tricky due to various anti-bot mechanisms, login requirements, and rate limits. However, you can extract useful information from public profiles with the right tools and techniques. This article will guide you through how to scrape Instagram user data using Python by making API requests to Instagram’s backend, extracting information from the returned JSON data, and saving it into a JSON file.

Setting up the required Libraries

Before we get into the code, make sure you have installed the required Python libraries.


pip install requests python-box

  • requests: To make HTTP requests.
  • python-box: Simplifies data access by turning dictionaries into objects that allow dot notation access.

We'll break the code into different sections for better understanding, including sending the request, obtaining and parsing the data, using proxies to avoid detection, and simplifying JSON parsing with the Box library.

Step 1. Making the API request

The frontend of Instagram is heavily secured, but the backend offers API endpoints that can be used without authentication. We will use one of these points going forward.

This API provides detailed information about a user's profile, including their description, follower count, and posts. Let's explore how to request data using the requests library in Python.

Explanation:

  1. Headers: Instagram blocks most bot requests by analyzing request headers. The x-ig-app-id is essential because it mimics a request coming from the Instagram app itself.
  2. The User-Agent string represents the browser making the request, tricking Instagram into believing it's a real user.
  3. Backend API Request: The URL https://i.instagram.com/api/v1/users/web_profile_info/?username={username} is part of Instagram’s backend API. It provides detailed information about a public profile.
  4. Handling the JSON Response: We use response.json() to convert the API response into a JSON object that we can easily parse and extract information from.

import requests

# Define headers to mimic a real browser request
headers = {
    "x-ig-app-id": "936619743392459",  # Instagram app ID to authenticate the request
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36",
    "Accept-Language": "en-US,en;q=0.9,ru;q=0.8",
    "Accept-Encoding": "gzip, deflate, br",
    "Accept": "*/*",
}

# Replace this with the username you want to scrape
username = 'testtest'

# Send an API request to get profile data
response = requests.get(f'https://i.instagram.com/api/v1/users/web_profile_info/?username={username}', headers=headers)
response_json = response.json()  # Parse the response into a JSON object

Step 2. Handling proxies to bypass rate-limiting

Since Instagram restricts repeated requests from the same IP address, using proxies is essential for large-scale scraping. A proxy routes your requests through different IP addresses, helping you avoid detection.

To set up a proxy server, you will need the IP address, port number, and, if required, a username and password for authentication.


proxies = {
    'http': 'http://<proxy_username>:<proxy_password>@<proxy_ip>:<proxy_port>',
    'https': 'https://<proxy_username>:<proxy_password>@<proxy_ip>:<proxy_port>',
}

response = requests.get(f'https://i.instagram.com/api/v1/users/web_profile_info/?username={username}', headers=headers, proxies=proxies)

Step 3. Simplifying JSON parsing with Box

Instagram's API returns a complex nested JSON structure, which can be difficult to navigate using traditional dictionary-based access. To make parsing easier, we can use the Box library, which allows accessing JSON data using dot notation instead of dictionary keys.

Explanation:

  1. Box: This library converts a JSON dictionary into an object, allowing us to access deeply nested fields using dot notation. For instance, instead of writing response_json['data']['user']['full_name'], we can simply write response_json.data.user.full_name.
  2. Extracting Data: We extract useful profile information like the user's full name, ID, biography, whether it’s a business or professional account, verification status, and follower count.

from box import Box

response_json = Box(response.json())

# Extract user profile data
user_data = {
    'full name': response_json.data.user.full_name,
    'id': response_json.data.user.id,
    'biography': response_json.data.user.biography,
    'business account': response_json.data.user.is_business_account,
    'professional account': response_json.data.user.is_professional_account,
    'category name': response_json.data.user.category_name,
    'is verified': response_json.data.user.is_verified,
    'profile pic url': response_json.data.user.profile_pic_url_hd,
    'followers': response_json.data.user.edge_followed_by.count,
    'following': response_json.data.user.edge_follow.count,
}

Step 4. Extracting video and timeline data

Once the profile data is extracted, we can also scrape data from the user’s video timeline and regular posts.

Explanation:

  1. Video Data: This section extracts data about the user’s Instagram videos, including the video URL, view count, comment count, and the video's duration.
  2. Timeline Media: Similarly, this section extracts data from the user's timeline, capturing the post's media URL, likes, and comments.

# Extract video data
profile_video_data = []
for element in response_json.data.user.edge_felix_video_timeline.edges:
    video_data = {
        'id': element.node.id,
        'short code': element.node.shortcode,
        'video url': element.node.video_url,
        'view count': element.node.video_view_count,
        'comment count': element.node.edge_media_to_comment.count,
        'like count': element.node.edge_liked_by.count,
        'duration': element.node.video_duration,
    }
    profile_video_data.append(video_data)

# Extract timeline media data (photos and videos)
profile_timeline_media_data = []
for element in response_json.data.user.edge_owner_to_timeline_media.edges:
    media_data = {
        'id': element.node.id,
        'short code': element.node.shortcode,
        'media url': element.node.display_url,
        'comment count': element.node.edge_media_to_comment.count,
        'like count': element.node.edge_liked_by.count,
    }
    profile_timeline_media_data.append(media_data)

Step 5. Saving the data to JSON files

Once you’ve extracted all the data, the next step is to save it to a JSON file for further analysis or storage. We use Python's json module to write the extracted data to JSON files. Each file will be neatly formatted, thanks to the indent=4 parameter, which makes it easy to read and process the data.


import json

# Save user data to a JSON file
with open(f'{username}_profile_data.json', 'w') as file:
    json.dump(user_data, file, indent=4)

# Save video data to a JSON file
with open(f'{username}_video_data.json', 'w') as file:
    json.dump(profile_video_data, file, indent=4)

# Save timeline media data to a JSON file
with open(f'{username}_timeline_media_data.json', 'w') as file:
    json.dump(profile_timeline_media_data, file, indent=4)

Complete Code

Here’s the complete Python script that combines all the previously discussed sections. This code scrapes user profile data, video data, and timeline media data from Instagram, handles the necessary headers and proxies, and saves the extracted information to JSON files.


import requests
from box import Box
import json

# Headers to mimic a real browser request to Instagram's backend API
headers = {
    "x-ig-app-id": "936619743392459", 
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36",
    "Accept-Language": "en-US,en;q=0.9,ru;q=0.8",
    "Accept-Encoding": "gzip, deflate, br",
    "Accept": "*/*",
}

# Set a proxy to avoid rate-limiting and detection (optional)
proxies = {
    'http': 'http://<proxy_username>:<proxy_password>@<proxy_ip>:<proxy_port>',
    'https': 'https://<proxy_username>:<proxy_password>@<proxy_ip>:<proxy_port>',
}

# The Instagram username to scrape
username = 'testtest'

# Send a request to Instagram's backend API to get profile data
response = requests.get(f'https://i.instagram.com/api/v1/users/web_profile_info/?username={username}', 
                        headers=headers, proxies=proxies)
response_json = Box(response.json())  # Convert the response to Box object for easy navigation

# Extract user profile data
user_data = {
    'full name': response_json.data.user.full_name,
    'id': response_json.data.user.id,
    'biography': response_json.data.user.biography,
    'business account': response_json.data.user.is_business_account,
    'professional account': response_json.data.user.is_professional_account,
    'category name': response_json.data.user.category_name,
    'is verified': response_json.data.user.is_verified,
    'profile pic url': response_json.data.user.profile_pic_url_hd,
    'followers': response_json.data.user.edge_followed_by.count,
    'following': response_json.data.user.edge_follow.count,
}

# Extract video data from the user's video timeline
profile_video_data = []
for element in response_json.data.user.edge_felix_video_timeline.edges:
    video_data = {
        'id': element.node.id,
        'short code': element.node.shortcode,
        'video url': element.node.video_url,
        'view count': element.node.video_view_count,
        'comment count': element.node.edge_media_to_comment.count,
        'like count': element.node.edge_liked_by.count,
        'duration': element.node.video_duration,
    }
    profile_video_data.append(video_data)

# Extract timeline media data (photos and videos)
profile_timeline_media_data = []
for element in response_json.data.user.edge_owner_to_timeline_media.edges:
    media_data = {
        'id': element.node.id,
        'short code': element.node.shortcode,
        'media url': element.node.display_url,
        'comment count': element.node.edge_media_to_comment.count,
        'like count': element.node.edge_liked_by.count,
    }
    profile_timeline_media_data.append(media_data)

# Save user profile data to a JSON file
with open(f'{username}_profile_data.json', 'w') as file:
    json.dump(user_data, file, indent=4)
print(f'saved json: {username}_profile_data.json')

# Save video data to a JSON file
with open(f'{username}_video_data.json', 'w') as file:
    json.dump(profile_video_data, file, indent=4)
print(f'saved json: {username}_video_data.json')

# Save timeline media data to a JSON file
with open(f'{username}_timeline_media_data.json', 'w') as file:
    json.dump(profile_timeline_media_data, file, indent=4)
print(f'saved json: {username}_timeline_media_data.json')

Scraping Instagram data with Python can be done by leveraging the backend API provided by Instagram, which helps bypass some of the front-end restrictions. Using the right headers to mimic browser behavior and employing proxies to avoid rate-limiting are critical steps. The Box library further simplifies the process by making JSON parsing more intuitive with dot notation. Before you start scraping Instagram at scale, remember to comply with Instagram's terms of service, and make sure your scraping efforts do not violate their policies.

Comments:

0 comments