An important ability for obtaining data from web pages is web scraping. Pinterest and Instagram, which dynamically load content through user interaction with them are examples of these types of websites. Regular scraping methods are insufficient when handling JavaScript-based material. In this article, we will elaborate on Playwright as the automation tool while lxml will be used for data extraction from such dynamic sites that require Javascript to work properly. On this note, we may discuss utilizing proxies in Playwright to evade detection as bots. In this tutorial, we'll scrape the Instagram profile to retrieve all the post URLs by simulating user behavior, such as scrolling and waiting for posts to load.
Tools we’ll use in this guide:
We will illustrate the process using the example of scraping an Instagram profile to extract post URLs, simulating user actions such as scrolling through the page and waiting for new data to load. Dynamic websites asynchronously load their content via AJAX requests, which means not all page content is immediately accessible.
Before we start, install the necessary packages:
pip install playwright
pip install lxml
You'll also need to install Playwright browsers:
playwright install
We'll use Playwright to automate the browser, load Instagram's dynamic content, and scroll through the page to load more posts. Let's create a basic automation script:
Automation script (Headless Browser):
import asyncio
from playwright.async_api import async_playwright
async def scrape_instagram():
async with async_playwright() as p:
browser = await p.chromium.launch(headless=True) # Headless mode No visual feedback
page = await browser.new_page()
# Visit the profile URL
await page.goto("https://www.instagram.com/profile name/", wait_until="networkidle")
# Click the button to load more posts
await page.get_by_role("button", name="Show more posts from").click()
# Scroll the page to load dynamic content
scroll_count = 5 # Customize this based on how many times you want to scroll
for _ in range(scroll_count):
await page.evaluate('window.scrollBy(0, 700);')
await page.wait_for_timeout(3000) # Wait for posts to load
await page.wait_for_load_state("networkidle")
# Get the page content
content = await page.content()
await browser.close()
return content
# Run the asynchronous function
asyncio.run(scrape_instagram())
Once the content is loaded, we can use lxml to parse the HTML and extract data using XPath. In this case, we're extracting the URLs of all posts from the profile.
Parsing the page content and extracting post URLs:
from lxml import html
import json
def extract_post_urls(page_content):
# Parse the HTML content using lxml
tree = html.fromstring(page_content)
# XPath for extracting post URLs
post_urls_xpath = '//a[contains(@href, "/p/")]/@href'
# Extract URLs
post_urls = tree.xpath(post_urls_xpath)
# Convert relative URLs to absolute
base_url = "https://www.instagram.com"
post_urls = [f"{base_url}{url}" for url in post_urls]
return post_urls
Example function to save extracted data in JSON format:
def save_data(profile_url, post_urls):
data = {profile_url: post_urls}
with open('instagram_posts.json', 'w') as json_file:
json.dump(data, json_file, indent=4)
# Scrape and extract URLs
page_content = asyncio.run(scrape_instagram())
post_urls = extract_post_urls(page_content)
# Save the extracted URLs in a JSON file
save_data("https://www.instagram.com/profile name/", post_urls)
To scrape dynamic websites, you often need to simulate infinite scrolling. In our script, we scroll the page using JavaScript:
(window.scrollBy(0, 700))
And wait for new content to load using this command:
wait_for_load_state("networkidle")
Instagram has strict rate limits and anti-bot measures. To avoid being blocked, you can use proxies to rotate IP addresses and distribute requests. Playwright makes it easy to integrate proxies into your scraping automation.
Implementing proxies in Playwright:
async def scrape_with_proxy():
async with async_playwright() as p:
browser = await p.chromium.launch(
headless=False,
proxy={"server": "http://your-proxy-server:port"}
)
page = await browser.new_page()
await page.goto("https://www.instagram.com/profile name/", wait_until="networkidle")
# Continue scraping as before...
Playwright also supports proxy to be passed as username password and server example is given below.
async def scrape_with_proxy():
async with async_playwright() as p:
browser = await p.chromium.launch(
headless=False,
proxy={"server": "http://your-proxy-server:port", "username": "username", "password": "password"}
)
page = await browser.new_page()
await page.goto("https://www.instagram.com/profile name/", wait_until="networkidle")
# Continue scraping as before...
Proxies help avoid IP bans, CAPTCHA challenges, and ensure smooth scraping of data-heavy or restricted websites like Instagram.
import asyncio
from playwright.async_api import async_playwright
from lxml import html
import json
# Function to automate browser and scrape dynamic content with proxies
async def scrape_instagram(profile_url, proxy=None):
async with async_playwright() as p:
# Set up browser with proxy if provided
browser_options = {
'headless': True, # Use headed browser to see the action (can set to True for headless mode)
}
if proxy:
browser_options['proxy'] = proxy
# Launch the browser
browser = await p.chromium.launch(**browser_options)
page = await browser.new_page()
# Visit the Instagram profile page
await page.goto(profile_url, wait_until="networkidle")
# Try clicking the "Show more posts" button (optional, might fail if button not found)
try:
await page.click('button:has-text("Show more posts from")')
except Exception as e:
print(f"No 'Show more posts' button found: {e}")
# Scroll the page to load more posts
scroll_count = 5 # Number of scrolls to load posts
for _ in range(scroll_count):
await page.evaluate('window.scrollBy(0, 500);')
await page.wait_for_timeout(3000) # Wait for new posts to load
await page.wait_for_load_state("networkidle")
# Get the complete page content after scrolling
content = await page.content()
await browser.close() # Close the browser once done
return content
# Function to parse the scraped page content and extract post URLs
def extract_post_urls(page_content):
# Parse the HTML content using lxml
tree = html.fromstring(page_content)
# XPath for extracting post URLs
post_urls_xpath = '//a[contains(@href, "/p/")]/@href'
# Extract post URLs using the XPath
post_urls = tree.xpath(post_urls_xpath)
# Convert relative URLs to absolute URLs
base_url = "https://www.instagram.com"
post_urls = [f"{base_url}{url}" for url in post_urls]
return post_urls
# Function to save the extracted post URLs into a JSON file
def save_data(profile_url, post_urls):
# Structure the data in JSON format
data = {profile_url: post_urls}
# Save the data to a file
with open('instagram_posts.json', 'w') as json_file:
json.dump(data, json_file, indent=4)
print(f"Data saved to instagram_posts.json")
# Main function to run the scraper and save the data
async def main():
# Define the Instagram profile URL
profile_url = "https://www.instagram.com/profile name/"
# Optionally, set up a proxy
proxy = {"server": "server", "username": "username", "password": "password"} # Use None if no proxy is required
# Scrape the Instagram page with proxies
page_content = await scrape_instagram(profile_url, proxy=proxy)
# Extract post URLs from the scraped page content
post_urls = extract_post_urls(page_content)
# Save the extracted post URLs into a JSON file
save_data(profile_url, post_urls)
if __name__ == '__main__':
asyncio.run(main())
While Playwright is an excellent choice for scraping dynamic websites, other tools might be suitable for different scenarios:
Each tool offers unique strengths and can be chosen based on the specific needs and conditions of the project.
For successful scraping of dynamic websites that actively use JavaScript and AJAX requests, powerful tools capable of efficiently handling infinite scrolling and complex interactive elements are necessary. One such solution is Playwright—a tool from Microsoft that provides full browser automation, making it an ideal choice for platforms like Instagram. Combined with the lxml library for HTML parsing, Playwright greatly simplifies data extraction, allowing for the automation of interactions with page elements and the parsing process without manual intervention. Additionally, the use of proxy servers helps circumvent anti-bot protection and prevents IP blocking, ensuring stable and uninterrupted scraping operations.
Comments: 0