Guide to scraping LinkedIn data with Python

Comments: 0

Scraping LinkedIn data can be incredibly valuable for several reasons:

  • Job market analysis: analyze trends in job listings, such as the most in-demand skills and industries;
  • Recruitment: gather data on job postings to inform hiring strategies;
  • Competitor research: monitor hiring patterns and strategies of competitors.

The article will emphasize important techniques and strategies, including the importance of avoiding detection via proxies and headers in the first place. The requests library will be used for making HTTP requests while lxml will be employed to parse HTML content.

Setting up the environment

Before you begin, make sure you have Python installed on your machine.

Install the required libraries using pip:


pip install requests
pip install lxml

Getting started with the scraper

Here’s a comprehensive code example for scraping LinkedIn job listings using Python:

Import libraries

We'll need several Python libraries:

  • requests: For making HTTP requests to retrieve web pages.
  • lxml: For parsing HTML content.
  • csv: For writing the extracted data to a CSV file.

import requests
from lxml import html
import csv
import random

Define job search URL

Start by defining the LinkedIn job search URL that you want to scrape.


url = 'https link'

User-agent strings and proxies

To scrape LinkedIn effectively, it’s crucial to use the correct headers, especially the User-Agent header, to mimic requests from an actual browser.


user_agents = [
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
    'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
]

Modern-day proxy providers often support internal rotation, meaning they automatically rotate IP addresses for you. This eliminates the need to manually select proxies from a list. However, for illustrative purposes, here’s how1 you would handle proxy rotation if needed:


proxies = {
    'http': random.choice(proxies),
    'https': random.choice(proxies)
}

Headers for requests

Successful LinkedIn scraping hinges on the correct setup of headers that emulate the behavior of a real browser. Properly configured headers not only aid in circumventing anti-bot protection systems but also diminish the chances of your scraping activities being blocked.


headers = {
    'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7',
    'accept-language': 'en-IN,en;q=0.9',
    'dnt': '1',
    'sec-ch-ua': '"Not/A)Brand";v="8", "Chromium";v="126", "Google Chrome";v="126"',
    'sec-ch-ua-mobile': '?0',
    'sec-ch-ua-platform': '"Linux"',
    'sec-fetch-dest': 'document',
    'sec-fetch-mode': 'navigate',
    'sec-fetch-site': 'none',
    'sec-fetch-user': '?1',
    'upgrade-insecure-requests': '1',
}

Initialize Data Storage

To effectively collect and store job posting information, you should start by initializing a data store. In Python, this typically involves creating an empty list. This list will serve as a storage container where you can add job details as they are extracted from the HTML content. This method ensures that the information is systematically collected and easily accessible for further processing or analysis.

job_details = []

Parsing the HTML content

After sending an HTTP GET request, the next step is to parse the HTML content using the lxml library. This will allow us to navigate through the HTML structure and identify the data we want to extract.


# Set random User-Agent and proxy with IP authorization method
headers['user-agent'] = random.choice(user_agents)
proxies = {
    'http': IP:PORT,
    'https': IP:PORT
}

# Send an HTTP GET request to the URL
response = requests.get(url=url, headers=headers, proxies=proxies)
parser = html.fromstring(response.content)

Extracting job data

Once the HTML content is parsed, we can extract specific job details such as title, company name, location, and job URL using XPath queries. These details are stored in a dictionary and appended to a list.


# Extract job details from the HTML content
for job in parser.xpath('//ul[@class="jobs-search__results-list"]/li'):
    title = ''.join(job.xpath('.//div/a/span/text()')).strip()
    company = ''.join(job.xpath('.//div/div[2]/h4/a/text()')).strip()
    location = ''.join(job.xpath('.//div/div[2]/div/span/text()')).strip()
    job_url = job.xpath('.//div/a/@href')[0]
    
    job_detail = {
        'title': title,
        'company': company,
        'location': location,
        'job_url': job_url
    }
    job_details.append(job_detail)

Save data to CSV

After collecting the job data, save it to a CSV file.


with open('linkedin_jobs.csv', 'w', newline='', encoding='utf-8') as csvfile:
    fieldnames = ['title', 'company', 'location', 'job_url']
    writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
    writer.writeheader()
    for job_detail in job_details:
        writer.writerow(job_detail)

Complete code

Here is the complete code, combining all the sections above:


from selenium.webdriver.chrome.options import Options
from seleniumwire import webdriver as wiredriver
from selenium.webdriver.common.by import By
import csv

# Specify the proxy server address with username and password
proxy_address = ""
proxy_username = ""
proxy_password = ""

# Set up Chrome options with the proxy and authentication
chrome_options = Options()
chrome_options.add_argument(f'--proxy-server={proxy_address}')
chrome_options.add_argument(f'--proxy-auth={proxy_username}:{proxy_password}')

# Create a WebDriver instance with selenium-wire
driver = wiredriver.Chrome(options=chrome_options)

url = 'https link'

# Perform your Selenium automation with the enhanced capabilities of selenium-wire
driver.get(url)

job_details = []

all_elements = driver.find_elements(By.XPATH,
                                   '//*[@id="main-content"]/section/ul/li')

for i in all_elements:
   title = i.find_element(By.XPATH,
                          './/div/div/h3').text
   company = i.find_element(By.XPATH, './/div/div[2]/h4/a').text
   location = i.find_element(By.XPATH, './/div/div[2]/div/span').text
   job_url = i.find_element(By.XPATH,
                            './/div/a').get_attribute('href')

   job_detail = {
       'title': title,
       'company': company,
       'location': location,
       'job_url': job_url
   }
   job_details.append(job_detail)

with open('linkedin_jobs.csv', 'w', newline='', encoding='utf-8') as csvfile:
   fieldnames = ['title', 'company', 'location', 'job_url']
   writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
   writer.writeheader()
   for job_detail in job_details:
       writer.writerow(job_detail)

driver.quit()

Extracting data from LinkedIn using Python with the requests and lxml libraries offers a powerful way to analyze the job market and recruit personnel. To ensure a smooth scraping process, datacenter proxies with high speeds are utilized, as well as ISP proxies with a higher trust factor, which reduces the risk of blocks on automated actions.

Comments:

0 comments