How to Scrape Indeed Job Listings Using Python

15.01.2025

Comments: 0

Content of the article:

Prerequisites
Step 1: Setting up Playwright for web scraping

Parsing HTML content using lxml

Step 2: Scraping job listings
Step 3: Adding pagination support
Step 4: Customizing job search queries
Complete code

For job hunters, employers, or any individuals monitoring trends in the job market, scraping Indeed's list of available jobs could provide helpful information. In this particular tutorial, we will combine Playwright for web scraping and lxml for HTML content parsing in order to collect the details of the job including its title, name of the recruiting company, location, job description, job posting link, and finally present the findings by saving the information in a CSV file.

Prerequisites

To successfully perform scraping, the following Python libraries need to be installed.

Playwright for browser automation:


pip install playwright

lxml for parsing HTML:


pip install lxml

pandas for saving data to a CSV file:


pip install pandas

Install Playwright browsers:

After installing Playwright, run this command to install the necessary browser binaries:


playwright install

Step 1: Setting up Playwright for web scraping

Playwright allows you to automate and interact with web browsers. We start by setting up Playwright to launch a Chromium browser, visit a webpage, and extract its content. Here we can also pass proxies through the playwright.

Why use proxies?

Websites often have rate-limiting or anti-scraping measures in place to block repeated requests from the same IP address. Proxies allow you to:

Avoid IP blocking: distribute your requests through different IPs to avoid detection.
Geolocation bypassing: access job listings that might be restricted based on geographical locations.
Anonymity: hide your actual IP and stay anonymous during the scraping process.


import asyncio
from playwright.async_api import async_playwright

async def get_page_content(url):
    async with async_playwright() as p:
        browser = await p.chromium.launch(
            headless=False,
            proxy = {
                'server': '',
                'username': '',
                'password': ''
            }
        )  # Headed browser
        page = await browser.new_page()
        await page.goto(url)
        
        # Extract the page's content
        content = await page.content()
        
        await browser.close()  # Close the browser once done
        return content

In this code, async_playwright launches a headed browser, navigates to the specified URL, and fetches the page's content.

Parsing HTML content using lxml

Next, we will parse the page content to extract meaningful data. lxml is used for this purpose because it provides robust support for parsing and querying HTML content using XPath.


from lxml import html

def parse_job_listings(content):
    # Parse HTML content
    parser = html.fromstring(content)
    
    # Extract each job posting using XPath
    job_posting = parser.xpath('//ul[@class="css-zu9cdh eu4oa1w0"]/li')
    
    jobs_data = []
    for element in job_posting[:-1]:  # Skip the last element if it's an ad or irrelevant
        title = ''.join(element.xpath('.//h2/a/span/@title'))
        if title:
            link = ''.join(element.xpath('.//h2/a/@href'))
            location = ''.join(element.xpath('.//div[@data-testid="text-location"]/text()'))
            description = ', '.join(element.xpath('.//div[@class="css-9446fg eu4oa1w0"]/ul//li/text()'))
            company_name = ''.join(element.xpath('.//span[@data-testid="company-name"]/text()'))

            # Append extracted data to the jobs_data list
            jobs_data.append({
                'Title': title,
                'Link': f"https://www.indeed.com{link}",
                'Location': location,
                'Description': description,
                'Company': company_name
            })
    
    return jobs_data

Step 2: Scraping job listings

Now that we have both the browser automation and parsing steps set up, let’s combine them to scrape job listings from the Indeed page.

Explanation:

get_page_content(url): Fetches the page content using Playwright.
parse_job_listings(content): Parses the content using lxml and extracts job data.
main(): Orchestrates the scraping process, fetching data and saving it to a CSV file.


import pandas as pd

async def scrape_indeed_jobs(url):
    # Step 1: Get page content using Playwright
    content = await get_page_content(url)
    
    # Step 2: Parse the HTML and extract job details
    jobs_data = parse_job_listings(content)
    
    return jobs_data

# URL to scrape
url = 'https://www.indeed.com/q-usa-jobs.html'

# Scraping and saving data
async def main():
    # Scrape job data from the specified URL
    jobs = await scrape_indeed_jobs(url)
    
    # Step 3: Save data to CSV using pandas
    df = pd.DataFrame(jobs)
    df.to_csv('indeed_jobs.csv', index=False)
    
    print("Data saved to indeed_jobs.csv")

# Run the main function
asyncio.run(main())

Step 3: Adding pagination support

Indeed paginates its job listings, and you can easily extend the scraper to handle multiple pages. The page URL is adjusted using a query parameter start, which increments by 10 for each new page.

To enhance your scraper's functionality for collecting data from multiple pages, you can implement a function called scrape_multiple_pages. This function will modify the base URL by incrementally adjusting the start parameter, enabling access to subsequent pages. By systematically progressing through each page, you can expand the scope and quantity of data collected, such as vacancies, ensuring a more comprehensive dataset.


async def scrape_multiple_pages(base_url, pages=3):
    all_jobs = []
    
    for page_num in range(pages):
        # Update URL for pagination
        url = f"{base_url}&start={page_num * 10}"
        print(f"Scraping page: {url}")
        
        # Scrape job data from each page
        jobs = await scrape_indeed_jobs(url)
        all_jobs.extend(jobs)
    
    # Save all jobs to CSV
    df = pd.DataFrame(all_jobs)
    df.to_csv('indeed_jobs_all_pages.csv', index=False)
    print("Data saved to indeed_jobs_all_pages.csv")

# Scrape multiple pages of job listings
asyncio.run(scrape_multiple_pages('https://www.indeed.com/jobs?q=usa', pages=3))

Step 4: Customizing job search queries

To target specific job titles or keywords in your scraping efforts, you'll need to configure the query search parameter in the URL used by Indeed. This customization allows the scraper to collect data specific to particular jobs or sectors. For instance, if you're searching for Python developer positions on http://www.indeed.com, you would adjust the query parameter to include “Python+developer” or relevant keywords.


query = "python+developer"
base_url = f"https://www.indeed.com/jobs?q={query}"
asyncio.run(scrape_multiple_pages(base_url, pages=3))

By modifying this parameter according to your data collection needs, you can focus your scraping on specific jobs, enhancing the flexibility and efficiency of your data collection process. This approach is especially useful for adapting to the dynamic demands of the job market.

Complete code


import asyncio
from playwright.async_api import async_playwright
from lxml import html
import pandas as pd

# Step 1: Fetch page content using Playwright
async def get_page_content(url):
    async with async_playwright() as p:
        browser = await p.chromium.launch(
            headless=False
            proxy = {
                'server': '',
                'username': '',
                'password': ''
            }
        )  # Run browser in headed mode
        page = await browser.new_page()
        await page.goto(url, wait_until='networkidle')
        
        # Extract page content
        content = await page.content()
        await browser.close()  # Close browser after use
        return content

# Step 2: Parse the HTML content using lxml
def parse_job_listings(content):
    # Parse the HTML using lxml
    parser = html.fromstring(content)
    
    # Select individual job postings using XPath
    job_posting = parser.xpath('//ul[@class="css-zu9cdh eu4oa1w0"]/li')
    
    # Extract job data
    jobs_data = []
    for element in job_posting[:-1]:
        title = ''.join(element.xpath('.//h2/a/span/@title'))
        if title:
            link = ''.join(element.xpath('.//h2/a/@href'))
            location = ''.join(element.xpath('.//div[@data-testid="text-location"]/text()'))
            description = ', '.join(element.xpath('.//div[@class="css-9446fg eu4oa1w0"]/ul//li/text()'))
            company_name = ''.join(element.xpath('.//span[@data-testid="company-name"]/text()'))

            # Append extracted data to the jobs_data list
            jobs_data.append({
                'Title': title,
                'Link': f"https://www.indeed.com{link}",
                'Location': location,
                'Description': description,
                'Company': company_name
            })
    
    return jobs_data

# Step 3: Scrape Indeed jobs for a single page
async def scrape_indeed_jobs(url):
    # Get page content using Playwright
    content = await get_page_content(url)
    
    # Parse HTML and extract job data
    jobs_data = parse_job_listings(content)
    
    return jobs_data

# Step 4: Handle pagination and scrape multiple pages
async def scrape_multiple_pages(base_url, query, pages=3):
    all_jobs = []
    
    for page_num in range(pages):
        # Update the URL to handle pagination and add the search query
        url = f"{base_url}?q={query}&start={page_num * 10}"
        print(f"Scraping page: {url}")
        
        # Scrape jobs for the current page
        jobs = await scrape_indeed_jobs(url)
        all_jobs.extend(jobs)
    
    # Save all jobs to a CSV file
    df = pd.DataFrame(all_jobs)
    df.to_csv(f'indeed_jobs_{query}.csv', index=False)
    print(f"Data saved to indeed_jobs_{query}.csv")

# Function to run the scraper with dynamic query input
async def run_scraper():
    # Step 5: Ask user for input query and number of pages to scrape
    query = input("Enter the job title or keywords to search (e.g., python+developer): ")
    pages = int(input("Enter the number of pages to scrape: "))
    
    # Scrape jobs across multiple pages based on the query
    base_url = 'https://www.indeed.com/jobs'
    await scrape_multiple_pages(base_url, query, pages)

# Run the scraper
asyncio.run(run_scraper())

To ensure a smooth scraping process and reduce the risk of blocks and CAPTCHA appearances, it's crucial to choose the right proxy server. The most optimal option for scraping are ISP proxies, which provide high speed and connection stability, as well as a high trust factor, making them rarely blocked by platforms. This type of proxy is static, so for large-scale scraping, it's necessary to create a pool of ISP proxies and configure IP rotation for their regular change. An alternative option would be residential proxies, which are dynamic and have the broadest geographic coverage compared to other types of proxy servers.