Scraping LinkedIn data can be incredibly valuable for several reasons:
The article will emphasize important techniques and strategies, including the importance of avoiding detection via proxies and headers in the first place. The requests library will be used for making HTTP requests while lxml will be employed to parse HTML content.
Before you begin, make sure you have Python installed on your machine.
Install the required libraries using pip:
pip install requests
pip install lxml
Here’s a comprehensive code example for scraping LinkedIn job listings using Python:
We'll need several Python libraries:
import requests
from lxml import html
import csv
import random
Start by defining the LinkedIn job search URL that you want to scrape.
url = 'https link'
To scrape LinkedIn effectively, it’s crucial to use the correct headers, especially the User-Agent header, to mimic requests from an actual browser.
user_agents = [
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
]
Modern-day proxy providers often support internal rotation, meaning they automatically rotate IP addresses for you. This eliminates the need to manually select proxies from a list. However, for illustrative purposes, here’s how1 you would handle proxy rotation if needed:
proxies = {
'http': random.choice(proxies),
'https': random.choice(proxies)
}
Successful LinkedIn scraping hinges on the correct setup of headers that emulate the behavior of a real browser. Properly configured headers not only aid in circumventing anti-bot protection systems but also diminish the chances of your scraping activities being blocked.
headers = {
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7',
'accept-language': 'en-IN,en;q=0.9',
'dnt': '1',
'sec-ch-ua': '"Not/A)Brand";v="8", "Chromium";v="126", "Google Chrome";v="126"',
'sec-ch-ua-mobile': '?0',
'sec-ch-ua-platform': '"Linux"',
'sec-fetch-dest': 'document',
'sec-fetch-mode': 'navigate',
'sec-fetch-site': 'none',
'sec-fetch-user': '?1',
'upgrade-insecure-requests': '1',
}
To effectively collect and store job posting information, you should start by initializing a data store. In Python, this typically involves creating an empty list. This list will serve as a storage container where you can add job details as they are extracted from the HTML content. This method ensures that the information is systematically collected and easily accessible for further processing or analysis.
job_details = []
After sending an HTTP GET request, the next step is to parse the HTML content using the lxml library. This will allow us to navigate through the HTML structure and identify the data we want to extract.
# Set random User-Agent and proxy with IP authorization method
headers['user-agent'] = random.choice(user_agents)
proxies = {
'http': IP:PORT,
'https': IP:PORT
}
# Send an HTTP GET request to the URL
response = requests.get(url=url, headers=headers, proxies=proxies)
parser = html.fromstring(response.content)
Once the HTML content is parsed, we can extract specific job details such as title, company name, location, and job URL using XPath queries. These details are stored in a dictionary and appended to a list.
# Extract job details from the HTML content
for job in parser.xpath('//ul[@class="jobs-search__results-list"]/li'):
title = ''.join(job.xpath('.//div/a/span/text()')).strip()
company = ''.join(job.xpath('.//div/div[2]/h4/a/text()')).strip()
location = ''.join(job.xpath('.//div/div[2]/div/span/text()')).strip()
job_url = job.xpath('.//div/a/@href')[0]
job_detail = {
'title': title,
'company': company,
'location': location,
'job_url': job_url
}
job_details.append(job_detail)
After collecting the job data, save it to a CSV file.
with open('linkedin_jobs.csv', 'w', newline='', encoding='utf-8') as csvfile:
fieldnames = ['title', 'company', 'location', 'job_url']
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writeheader()
for job_detail in job_details:
writer.writerow(job_detail)
Here is the complete code, combining all the sections above:
from selenium.webdriver.chrome.options import Options
from seleniumwire import webdriver as wiredriver
from selenium.webdriver.common.by import By
import csv
# Specify the proxy server address with username and password
proxy_address = ""
proxy_username = ""
proxy_password = ""
# Set up Chrome options with the proxy and authentication
chrome_options = Options()
chrome_options.add_argument(f'--proxy-server={proxy_address}')
chrome_options.add_argument(f'--proxy-auth={proxy_username}:{proxy_password}')
# Create a WebDriver instance with selenium-wire
driver = wiredriver.Chrome(options=chrome_options)
url = 'https link'
# Perform your Selenium automation with the enhanced capabilities of selenium-wire
driver.get(url)
job_details = []
all_elements = driver.find_elements(By.XPATH,
'//*[@id="main-content"]/section/ul/li')
for i in all_elements:
title = i.find_element(By.XPATH,
'.//div/div/h3').text
company = i.find_element(By.XPATH, './/div/div[2]/h4/a').text
location = i.find_element(By.XPATH, './/div/div[2]/div/span').text
job_url = i.find_element(By.XPATH,
'.//div/a').get_attribute('href')
job_detail = {
'title': title,
'company': company,
'location': location,
'job_url': job_url
}
job_details.append(job_detail)
with open('linkedin_jobs.csv', 'w', newline='', encoding='utf-8') as csvfile:
fieldnames = ['title', 'company', 'location', 'job_url']
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writeheader()
for job_detail in job_details:
writer.writerow(job_detail)
driver.quit()
Extracting data from LinkedIn using Python with the requests and lxml libraries offers a powerful way to analyze the job market and recruit personnel. To ensure a smooth scraping process, datacenter proxies with high speeds are utilized, as well as ISP proxies with a higher trust factor, which reduces the risk of blocks on automated actions.
Comments: 0