Guide to scraping dynamic websites with Python

Comments: 0

An important ability for obtaining data from web pages is web scraping. Pinterest and Instagram, which dynamically load content through user interaction with them are examples of these types of websites. Regular scraping methods are insufficient when handling JavaScript-based material. In this article, we will elaborate on Playwright as the automation tool while lxml will be used for data extraction from such dynamic sites that require Javascript to work properly. On this note, we may discuss utilizing proxies in Playwright to evade detection as bots. In this tutorial, we'll scrape the Instagram profile to retrieve all the post URLs by simulating user behavior, such as scrolling and waiting for posts to load.

Tools we’ll use in this guide:

  • Playwright (for browser automation);
  • lxml (for data extraction using XPath);
  • Python (as our programming language).

Step-by-step guide to scraping Instagram posts

We will illustrate the process using the example of scraping an Instagram profile to extract post URLs, simulating user actions such as scrolling through the page and waiting for new data to load. Dynamic websites asynchronously load their content via AJAX requests, which means not all page content is immediately accessible.

Step 1. Install required libraries

Before we start, install the necessary packages:


pip install playwright
pip install lxml

You'll also need to install Playwright browsers:


playwright install

Step 2. Playwright setup for dynamic website scraping

We'll use Playwright to automate the browser, load Instagram's dynamic content, and scroll through the page to load more posts. Let's create a basic automation script:

Automation script (Headless Browser):


import asyncio
from playwright.async_api import async_playwright

async def scrape_instagram():
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True)  # Headless mode No visual feedback
        page = await browser.new_page()
        
        # Visit the profile URL
        await page.goto("https://www.instagram.com/profile name/", wait_until="networkidle")

        # Click the button to load more posts 
        await page.get_by_role("button", name="Show more posts from").click()
        
        # Scroll the page to load dynamic content
        scroll_count = 5  # Customize this based on how many times you want to scroll
        for _ in range(scroll_count):
            await page.evaluate('window.scrollBy(0, 700);')
            await page.wait_for_timeout(3000)  # Wait for posts to load
            await page.wait_for_load_state("networkidle")
        
        # Get the page content
        content = await page.content()
        await browser.close()
        
        return content

# Run the asynchronous function
asyncio.run(scrape_instagram())

Step 3. Parsing the page with lxml and XPath

Once the content is loaded, we can use lxml to parse the HTML and extract data using XPath. In this case, we're extracting the URLs of all posts from the profile.

Parsing the page content and extracting post URLs:


from lxml import html
import json

def extract_post_urls(page_content):
    # Parse the HTML content using lxml
    tree = html.fromstring(page_content)
    
    # XPath for extracting post URLs
    post_urls_xpath = '//a[contains(@href, "/p/")]/@href'
    
    # Extract URLs
    post_urls = tree.xpath(post_urls_xpath)
    
    # Convert relative URLs to absolute
    base_url = "https://www.instagram.com"
    post_urls = [f"{base_url}{url}" for url in post_urls]
    
    return post_urls

Example function to save extracted data in JSON format:


def save_data(profile_url, post_urls):
    data = {profile_url: post_urls}
    with open('instagram_posts.json', 'w') as json_file:
        json.dump(data, json_file, indent=4)

# Scrape and extract URLs
page_content = asyncio.run(scrape_instagram())
post_urls = extract_post_urls(page_content)

# Save the extracted URLs in a JSON file
save_data("https://www.instagram.com/profile name/", post_urls)

Step 4. Handling infinite scroll with Playwright

To scrape dynamic websites, you often need to simulate infinite scrolling. In our script, we scroll the page using JavaScript:


(window.scrollBy(0, 700))

And wait for new content to load using this command:


 wait_for_load_state("networkidle")

Step 5. Using proxies with Playwright

Instagram has strict rate limits and anti-bot measures. To avoid being blocked, you can use proxies to rotate IP addresses and distribute requests. Playwright makes it easy to integrate proxies into your scraping automation.

Implementing proxies in Playwright:


async def scrape_with_proxy():
    async with async_playwright() as p:
        browser = await p.chromium.launch(
            headless=False, 
            proxy={"server": "http://your-proxy-server:port"}
        )
        page = await browser.new_page()
        await page.goto("https://www.instagram.com/profile name/", wait_until="networkidle")
        # Continue scraping as before...

Playwright also supports proxy to be passed as username password and server example is given below.


async def scrape_with_proxy():
    async with async_playwright() as p:
        browser = await p.chromium.launch(
            headless=False, 
            proxy={"server": "http://your-proxy-server:port", "username": "username", "password": "password"}
        )
        page = await browser.new_page()
        await page.goto("https://www.instagram.com/profile name/", wait_until="networkidle")
        # Continue scraping as before...

Proxies help avoid IP bans, CAPTCHA challenges, and ensure smooth scraping of data-heavy or restricted websites like Instagram.

Complete code


import asyncio
from playwright.async_api import async_playwright
from lxml import html
import json

# Function to automate browser and scrape dynamic content with proxies
async def scrape_instagram(profile_url, proxy=None):
    async with async_playwright() as p:
        # Set up browser with proxy if provided
        browser_options = {
            'headless': True,  # Use headed browser to see the action (can set to True for headless mode)
        }
        if proxy:
            browser_options['proxy'] = proxy

        # Launch the browser
        browser = await p.chromium.launch(**browser_options)
        page = await browser.new_page()

        # Visit the Instagram profile page
        await page.goto(profile_url, wait_until="networkidle")
        
        # Try clicking the "Show more posts" button (optional, might fail if button not found)
        try:
           await page.click('button:has-text("Show more posts from")')
        except Exception as e:
           print(f"No 'Show more posts' button found: {e}")


        # Scroll the page to load more posts
        scroll_count = 5  # Number of scrolls to load posts
        for _ in range(scroll_count):
            await page.evaluate('window.scrollBy(0, 500);')
            await page.wait_for_timeout(3000)  # Wait for new posts to load
            await page.wait_for_load_state("networkidle")

        # Get the complete page content after scrolling
        content = await page.content()
        await browser.close()  # Close the browser once done
        
        return content

# Function to parse the scraped page content and extract post URLs
def extract_post_urls(page_content):
    # Parse the HTML content using lxml
    tree = html.fromstring(page_content)
    
    # XPath for extracting post URLs
    post_urls_xpath = '//a[contains(@href, "/p/")]/@href'
    
    # Extract post URLs using the XPath
    post_urls = tree.xpath(post_urls_xpath)
    
    # Convert relative URLs to absolute URLs
    base_url = "https://www.instagram.com"
    post_urls = [f"{base_url}{url}" for url in post_urls]
    
    return post_urls

# Function to save the extracted post URLs into a JSON file
def save_data(profile_url, post_urls):
    # Structure the data in JSON format
    data = {profile_url: post_urls}
    
    # Save the data to a file
    with open('instagram_posts.json', 'w') as json_file:
        json.dump(data, json_file, indent=4)
    print(f"Data saved to instagram_posts.json")

# Main function to run the scraper and save the data
async def main():
    # Define the Instagram profile URL
    profile_url = "https://www.instagram.com/profile name/"
    
    # Optionally, set up a proxy
    proxy = {"server": "server", "username": "username", "password": "password"}  # Use None if no proxy is required
    
    # Scrape the Instagram page with proxies
    page_content = await scrape_instagram(profile_url, proxy=proxy)
    
    # Extract post URLs from the scraped page content
    post_urls = extract_post_urls(page_content)
    
    # Save the extracted post URLs into a JSON file
    save_data(profile_url, post_urls)

if __name__ == '__main__':
   asyncio.run(main())

Alternative automation tools for web scraping

While Playwright is an excellent choice for scraping dynamic websites, other tools might be suitable for different scenarios:

  1. Selenium: Selenium is one of the oldest browser automation frameworks and works similarly to Playwright. It's highly versatile but lacks some modern capabilities that Playwright offers, such as handling multiple browsers with a single API;
  2. Puppeteer: Puppeteer is another popular tool for browser automation, especially for scraping JavaScript-heavy websites. Like Playwright, it controls headless browsers and allows interaction with dynamic content;
  3. Requests + BeautifulSoup: for simpler websites that don’t require JavaScript to load content, the Requests library combined with BeautifulSoup is a lightweight alternative. However, it doesn’t handle dynamic content well.

Each tool offers unique strengths and can be chosen based on the specific needs and conditions of the project.

For successful scraping of dynamic websites that actively use JavaScript and AJAX requests, powerful tools capable of efficiently handling infinite scrolling and complex interactive elements are necessary. One such solution is Playwright—a tool from Microsoft that provides full browser automation, making it an ideal choice for platforms like Instagram. Combined with the lxml library for HTML parsing, Playwright greatly simplifies data extraction, allowing for the automation of interactions with page elements and the parsing process without manual intervention. Additionally, the use of proxy servers helps circumvent anti-bot protection and prevents IP blocking, ensuring stable and uninterrupted scraping operations.

Comments:

0 comments