Glassdoor is one of the best platforms that provides a wide range of information for people both seeking jobs and wanting to hire them, including salary information, employer reviews and job applications. In this guide, we’ll walk through the process of scraping job listings from Glassdoor using Python and Playwright. Playwright is essential here because Glassdoor employs strong anti-bot measures, which can flag and block traditional scraping libraries. With Playwright, we can simulate a real browser and include proxies, helping us bypass these detection systems.
Due to Glassdoor's robust anti-scraping mechanisms, direct requests with libraries like requests can lead to IP blocking or CAPTCHA challenges. Playwright allows us to automate a browser, making our interactions more human-like. By adding proxies and browser headers, we can further avoid detection.
To get started, you’ll need to install Playwright and the lxml library for HTML parsing. You can install them as follows:
pip install playwright lxml
playwright install
We’ll walk through each step, from loading the page with Playwright to extracting job details and saving the data into a CSV file.
First, set up Playwright with a proxy to connect to Glassdoor. This helps prevent getting blocked and allows the browser to load the page as if a real user were visiting the site.
from playwright.async_api import async_playwright
from lxml.html import fromstring
async def scrape_job_listings():
async with async_playwright() as p:
browser = await p.chromium.launch(
headless=False,
proxy={"server": '', 'username': '', 'password': ''}
)
page = await browser.new_page()
await page.goto('https link', timeout=60000)
content = await page.content()
await browser.close()
return content
# Call the function to retrieve page content
html_content = await scrape_job_listings()
After loading the page, use lxml to parse the HTML content and extract relevant job information. Here’s how to parse the job title, location, salary, and other details for each job listing:
parser = fromstring(html_content)
job_posting_elements = parser.xpath('//li[@data-test="jobListing"]')
jobs_data = []
for element in job_posting_elements:
job_title = element.xpath('.//a[@data-test="job-title"]/text()')[0]
job_location = element.xpath('.//div[@data-test="emp-location"]/text()')[0]
salary = ' '.join(element.xpath('.//div[@data-test="detailSalary"]/text()')).strip()
job_link = element.xpath('.//a[@data-test="job-title"]/@href')[0]
easy_apply = bool(element.xpath('.//div[@data-role-variant="featured"]'))
company = element.xpath('.//span[@class="EmployerProfile_compactEmployerName__LE242"]/text()')[0]
job_data = {
'company': company,
'job_title': job_title,
'job_location': job_location,
'job_link': job_link,
'salary': salary,
'easy_apply': easy_apply
}
jobs_data.append(job_data)
Once we’ve extracted the job details, we can save them into a CSV file for easy data analysis.
import csv
with open('glassdoor_job_listings.csv', 'w', newline='', encoding='utf-8') as file:
writer = csv.DictWriter(file, fieldnames=['company', 'job_title', 'job_location', 'job_link', 'salary', 'easy_apply'])
writer.writeheader()
writer.writerows(jobs_data)
import csv
from playwright.async_api import async_playwright
from lxml.html import fromstring
async def scrape_job_listings():
# Setup the Playwright browser with proxy to avoid detection
async with async_playwright() as p:
browser = await p.chromium.launch(
headless=False,
proxy={"server": '', 'username': '', 'password': ''}
)
page = await browser.new_page()
await page.goto('https://www.glassdoor.com/Job/united-states-software-engineer-jobs-SRCH_IL.0,13_IN1_KO14,31.htm', timeout=60000)
# Retrieve the page content and close the browser
content = await page.content()
await browser.close()
# Parse the content with lxml
parser = fromstring(content)
job_posting_elements = parser.xpath('//li[@data-test="jobListing"]')
# Extract data for each job listing
jobs_data = []
for element in job_posting_elements:
job_title = element.xpath('.//a[@data-test="job-title"]/text()')[0]
job_location = element.xpath('.//div[@data-test="emp-location"]/text()')[0]
salary = ' '.join(element.xpath('.//div[@data-test="detailSalary"]/text()')).strip()
job_link = "https://www.glassdoor.com" + element.xpath('.//a[@data-test="job-title"]/@href')[0]
easy_apply = bool(element.xpath('.//div[@data-role-variant="featured"]'))
company = element.xpath('.//span[@class="EmployerProfile_compactEmployerName__LE242"]/text()')[0]
job_data = {
'company': company,
'job_title': job_title,
'job_location': job_location,
'job_link': job_link,
'salary': salary,
'easy_apply': easy_apply
}
jobs_data.append(job_data)
# Save the data to a CSV file
with open('glassdoor_job_listings.csv', 'w', newline='', encoding='utf-8') as file:
writer = csv.DictWriter(file, fieldnames=['company', 'job_title', 'job_location', 'job_link', 'salary', 'easy_apply'])
writer.writeheader()
writer.writerows(jobs_data)
# Run the scraping function
import asyncio
asyncio.run(scrape_job_listings())
Explanation of the Complete Code:
When scraping Glassdoor or any other website, it’s essential to follow responsible scraping practices:
When you know how to scrape Glassdoor’s data using Python and Playwright, you will easily enhance your ability to collect job listings. This technique, when combined with the use of proxies and appropriate headers, is effective in eliminating the risk of being blocked by Glassdoor. You need to also take note of ethical scraping policies to prevent crashing the servers of Glassdoor. By keeping to these measures, you can now harvest as well as process useful employment information from Glassdoor for your own use or that of your company.
Comments: 0