For job hunters, employers, or any individuals monitoring trends in the job market, scraping Indeed's list of available jobs could provide helpful information. In this particular tutorial, we will combine Playwright for web scraping and lxml for HTML content parsing in order to collect the details of the job including its title, name of the recruiting company, location, job description, job posting link, and finally present the findings by saving the information in a CSV file.
To successfully perform scraping, the following Python libraries need to be installed.
Playwright for browser automation:
pip install playwright
lxml for parsing HTML:
pip install lxml
pandas for saving data to a CSV file:
pip install pandas
Install Playwright browsers:
After installing Playwright, run this command to install the necessary browser binaries:
playwright install
Playwright allows you to automate and interact with web browsers. We start by setting up Playwright to launch a Chromium browser, visit a webpage, and extract its content. Here we can also pass proxies through the playwright.
Why use proxies?
Websites often have rate-limiting or anti-scraping measures in place to block repeated requests from the same IP address. Proxies allow you to:
import asyncio
from playwright.async_api import async_playwright
async def get_page_content(url):
async with async_playwright() as p:
browser = await p.chromium.launch(
headless=False,
proxy = {
'server': '',
'username': '',
'password': ''
}
) # Headed browser
page = await browser.new_page()
await page.goto(url)
# Extract the page's content
content = await page.content()
await browser.close() # Close the browser once done
return content
In this code, async_playwright launches a headed browser, navigates to the specified URL, and fetches the page's content.
Next, we will parse the page content to extract meaningful data. lxml is used for this purpose because it provides robust support for parsing and querying HTML content using XPath.
from lxml import html
def parse_job_listings(content):
# Parse HTML content
parser = html.fromstring(content)
# Extract each job posting using XPath
job_posting = parser.xpath('//ul[@class="css-zu9cdh eu4oa1w0"]/li')
jobs_data = []
for element in job_posting[:-1]: # Skip the last element if it's an ad or irrelevant
title = ''.join(element.xpath('.//h2/a/span/@title'))
if title:
link = ''.join(element.xpath('.//h2/a/@href'))
location = ''.join(element.xpath('.//div[@data-testid="text-location"]/text()'))
description = ', '.join(element.xpath('.//div[@class="css-9446fg eu4oa1w0"]/ul//li/text()'))
company_name = ''.join(element.xpath('.//span[@data-testid="company-name"]/text()'))
# Append extracted data to the jobs_data list
jobs_data.append({
'Title': title,
'Link': f"https://www.indeed.com{link}",
'Location': location,
'Description': description,
'Company': company_name
})
return jobs_data
Now that we have both the browser automation and parsing steps set up, let’s combine them to scrape job listings from the Indeed page.
Explanation:
import pandas as pd
async def scrape_indeed_jobs(url):
# Step 1: Get page content using Playwright
content = await get_page_content(url)
# Step 2: Parse the HTML and extract job details
jobs_data = parse_job_listings(content)
return jobs_data
# URL to scrape
url = 'https://www.indeed.com/q-usa-jobs.html'
# Scraping and saving data
async def main():
# Scrape job data from the specified URL
jobs = await scrape_indeed_jobs(url)
# Step 3: Save data to CSV using pandas
df = pd.DataFrame(jobs)
df.to_csv('indeed_jobs.csv', index=False)
print("Data saved to indeed_jobs.csv")
# Run the main function
asyncio.run(main())
Indeed paginates its job listings, and you can easily extend the scraper to handle multiple pages. The page URL is adjusted using a query parameter start, which increments by 10 for each new page.
To enhance your scraper's functionality for collecting data from multiple pages, you can implement a function called scrape_multiple_pages. This function will modify the base URL by incrementally adjusting the start parameter, enabling access to subsequent pages. By systematically progressing through each page, you can expand the scope and quantity of data collected, such as vacancies, ensuring a more comprehensive dataset.
async def scrape_multiple_pages(base_url, pages=3):
all_jobs = []
for page_num in range(pages):
# Update URL for pagination
url = f"{base_url}&start={page_num * 10}"
print(f"Scraping page: {url}")
# Scrape job data from each page
jobs = await scrape_indeed_jobs(url)
all_jobs.extend(jobs)
# Save all jobs to CSV
df = pd.DataFrame(all_jobs)
df.to_csv('indeed_jobs_all_pages.csv', index=False)
print("Data saved to indeed_jobs_all_pages.csv")
# Scrape multiple pages of job listings
asyncio.run(scrape_multiple_pages('https://www.indeed.com/jobs?q=usa', pages=3))
To target specific job titles or keywords in your scraping efforts, you'll need to configure the query search parameter in the URL used by Indeed. This customization allows the scraper to collect data specific to particular jobs or sectors. For instance, if you're searching for Python developer positions on http://www.indeed.com, you would adjust the query parameter to include “Python+developer” or relevant keywords.
query = "python+developer"
base_url = f"https://www.indeed.com/jobs?q={query}"
asyncio.run(scrape_multiple_pages(base_url, pages=3))
By modifying this parameter according to your data collection needs, you can focus your scraping on specific jobs, enhancing the flexibility and efficiency of your data collection process. This approach is especially useful for adapting to the dynamic demands of the job market.
import asyncio
from playwright.async_api import async_playwright
from lxml import html
import pandas as pd
# Step 1: Fetch page content using Playwright
async def get_page_content(url):
async with async_playwright() as p:
browser = await p.chromium.launch(
headless=False
proxy = {
'server': '',
'username': '',
'password': ''
}
) # Run browser in headed mode
page = await browser.new_page()
await page.goto(url, wait_until='networkidle')
# Extract page content
content = await page.content()
await browser.close() # Close browser after use
return content
# Step 2: Parse the HTML content using lxml
def parse_job_listings(content):
# Parse the HTML using lxml
parser = html.fromstring(content)
# Select individual job postings using XPath
job_posting = parser.xpath('//ul[@class="css-zu9cdh eu4oa1w0"]/li')
# Extract job data
jobs_data = []
for element in job_posting[:-1]:
title = ''.join(element.xpath('.//h2/a/span/@title'))
if title:
link = ''.join(element.xpath('.//h2/a/@href'))
location = ''.join(element.xpath('.//div[@data-testid="text-location"]/text()'))
description = ', '.join(element.xpath('.//div[@class="css-9446fg eu4oa1w0"]/ul//li/text()'))
company_name = ''.join(element.xpath('.//span[@data-testid="company-name"]/text()'))
# Append extracted data to the jobs_data list
jobs_data.append({
'Title': title,
'Link': f"https://www.indeed.com{link}",
'Location': location,
'Description': description,
'Company': company_name
})
return jobs_data
# Step 3: Scrape Indeed jobs for a single page
async def scrape_indeed_jobs(url):
# Get page content using Playwright
content = await get_page_content(url)
# Parse HTML and extract job data
jobs_data = parse_job_listings(content)
return jobs_data
# Step 4: Handle pagination and scrape multiple pages
async def scrape_multiple_pages(base_url, query, pages=3):
all_jobs = []
for page_num in range(pages):
# Update the URL to handle pagination and add the search query
url = f"{base_url}?q={query}&start={page_num * 10}"
print(f"Scraping page: {url}")
# Scrape jobs for the current page
jobs = await scrape_indeed_jobs(url)
all_jobs.extend(jobs)
# Save all jobs to a CSV file
df = pd.DataFrame(all_jobs)
df.to_csv(f'indeed_jobs_{query}.csv', index=False)
print(f"Data saved to indeed_jobs_{query}.csv")
# Function to run the scraper with dynamic query input
async def run_scraper():
# Step 5: Ask user for input query and number of pages to scrape
query = input("Enter the job title or keywords to search (e.g., python+developer): ")
pages = int(input("Enter the number of pages to scrape: "))
# Scrape jobs across multiple pages based on the query
base_url = 'https://www.indeed.com/jobs'
await scrape_multiple_pages(base_url, query, pages)
# Run the scraper
asyncio.run(run_scraper())
To ensure a smooth scraping process and reduce the risk of blocks and CAPTCHA appearances, it's crucial to choose the right proxy server. The most optimal option for scraping are ISP proxies, which provide high speed and connection stability, as well as a high trust factor, making them rarely blocked by platforms. This type of proxy is static, so for large-scale scraping, it's necessary to create a pool of ISP proxies and configure IP rotation for their regular change. An alternative option would be residential proxies, which are dynamic and have the broadest geographic coverage compared to other types of proxy servers.
Comments: 0