How to Scrape Glassdoor Data Using Python

Comments: 0

Glassdoor is one of the best platforms that provides a wide range of information for people both seeking jobs and wanting to hire them, including salary information, employer reviews and job applications. In this guide, we’ll walk through the process of scraping job listings from Glassdoor using Python and Playwright. Playwright is essential here because Glassdoor employs strong anti-bot measures, which can flag and block traditional scraping libraries. With Playwright, we can simulate a real browser and include proxies, helping us bypass these detection systems.

Due to Glassdoor's robust anti-scraping mechanisms, direct requests with libraries like requests can lead to IP blocking or CAPTCHA challenges. Playwright allows us to automate a browser, making our interactions more human-like. By adding proxies and browser headers, we can further avoid detection.

Requirements

To get started, you’ll need to install Playwright and the lxml library for HTML parsing. You can install them as follows:


pip install playwright lxml
playwright install

Scraping Glassdoor job listings

We’ll walk through each step, from loading the page with Playwright to extracting job details and saving the data into a CSV file.

Step 1. Setting up the browser and making requests

First, set up Playwright with a proxy to connect to Glassdoor. This helps prevent getting blocked and allows the browser to load the page as if a real user were visiting the site.


from playwright.async_api import async_playwright
from lxml.html import fromstring

async def scrape_job_listings():
    async with async_playwright() as p:
        browser = await p.chromium.launch(
            headless=False,
            proxy={"server": '', 'username': '', 'password': ''}
        )
        page = await browser.new_page()
        await page.goto('https link', timeout=60000)
        content = await page.content()
        await browser.close()
        return content

# Call the function to retrieve page content
html_content = await scrape_job_listings()

Step 2. Parsing the HTML and extracting data

After loading the page, use lxml to parse the HTML content and extract relevant job information. Here’s how to parse the job title, location, salary, and other details for each job listing:


parser = fromstring(html_content)
job_posting_elements = parser.xpath('//li[@data-test="jobListing"]')

jobs_data = []
for element in job_posting_elements:
    job_title = element.xpath('.//a[@data-test="job-title"]/text()')[0]
    job_location = element.xpath('.//div[@data-test="emp-location"]/text()')[0]
    salary = ' '.join(element.xpath('.//div[@data-test="detailSalary"]/text()')).strip()
    job_link = element.xpath('.//a[@data-test="job-title"]/@href')[0]
    easy_apply = bool(element.xpath('.//div[@data-role-variant="featured"]'))
    company = element.xpath('.//span[@class="EmployerProfile_compactEmployerName__LE242"]/text()')[0]
    
    job_data = {
        'company': company,
        'job_title': job_title,
        'job_location': job_location,
        'job_link': job_link,
        'salary': salary,
        'easy_apply': easy_apply
    }
    jobs_data.append(job_data)

Step 3. Saving data to a CSV file

Once we’ve extracted the job details, we can save them into a CSV file for easy data analysis.


import csv

with open('glassdoor_job_listings.csv', 'w', newline='', encoding='utf-8') as file:
    writer = csv.DictWriter(file, fieldnames=['company', 'job_title', 'job_location', 'job_link', 'salary', 'easy_apply'])
    writer.writeheader()
    writer.writerows(jobs_data)

Complete code


import csv
from playwright.async_api import async_playwright
from lxml.html import fromstring

async def scrape_job_listings():
    # Setup the Playwright browser with proxy to avoid detection
    async with async_playwright() as p:
        browser = await p.chromium.launch(
            headless=False,
            proxy={"server": '', 'username': '', 'password': ''}
        )
        page = await browser.new_page()
        await page.goto('https://www.glassdoor.com/Job/united-states-software-engineer-jobs-SRCH_IL.0,13_IN1_KO14,31.htm', timeout=60000)
        
        # Retrieve the page content and close the browser
        content = await page.content()
        await browser.close()
        
        # Parse the content with lxml
        parser = fromstring(content)
        job_posting_elements = parser.xpath('//li[@data-test="jobListing"]')
        
        # Extract data for each job listing
        jobs_data = []
        for element in job_posting_elements:
            job_title = element.xpath('.//a[@data-test="job-title"]/text()')[0]
            job_location = element.xpath('.//div[@data-test="emp-location"]/text()')[0]
            salary = ' '.join(element.xpath('.//div[@data-test="detailSalary"]/text()')).strip()
            job_link = "https://www.glassdoor.com" + element.xpath('.//a[@data-test="job-title"]/@href')[0]
            easy_apply = bool(element.xpath('.//div[@data-role-variant="featured"]'))
            company = element.xpath('.//span[@class="EmployerProfile_compactEmployerName__LE242"]/text()')[0]
            
            job_data = {
                'company': company,
                'job_title': job_title,
                'job_location': job_location,
                'job_link': job_link,
                'salary': salary,
                'easy_apply': easy_apply
            }
            jobs_data.append(job_data)
    
        # Save the data to a CSV file
        with open('glassdoor_job_listings.csv', 'w', newline='', encoding='utf-8') as file:
            writer = csv.DictWriter(file, fieldnames=['company', 'job_title', 'job_location', 'job_link', 'salary', 'easy_apply'])
            writer.writeheader()
            writer.writerows(jobs_data)

# Run the scraping function
import asyncio
asyncio.run(scrape_job_listings())

Explanation of the Complete Code:

  1. Browser Setup with Proxy: The code initiates a browser session with Playwright, incorporating a proxy to mimic human browsing behavior. The headless=False setting enables the browser window to remain open, which can further help bypass bot detection.
  2. Navigating to the Job Listings Page: The script visits the Glassdoor job listings URL for software engineering jobs in the United States.
  3. Parsing the Content: The job data is extracted using lxml for HTML parsing. We capture the job title, location, salary, job link, company name, and if it’s an easy apply job.
  4. Saving to CSV: After extracting all the data, the script saves it to a CSV file, glassdoor_job_listings.csv, with columns for each attribute.

Respecting Glassdoor's terms of service

When scraping Glassdoor or any other website, it’s essential to follow responsible scraping practices:

  • Respect Rate Limits: Avoid overwhelming the server by implementing delays between requests.
  • Use Rotating Proxies: Minimize the risk of getting banned by rotating proxies and IPs.
  • Comply with Terms of Service: Regularly review the website’s terms of service and avoid actions that violate them.

When you know how to scrape Glassdoor’s data using Python and Playwright, you will easily enhance your ability to collect job listings. This technique, when combined with the use of proxies and appropriate headers, is effective in eliminating the risk of being blocked by Glassdoor. You need to also take note of ethical scraping policies to prevent crashing the servers of Glassdoor. By keeping to these measures, you can now harvest as well as process useful employment information from Glassdoor for your own use or that of your company.

Comments:

0 comments