Guide to Scraping Craigslist with Python

Comments: 0

Craigslist is still a significant platform for accessing specific classified advertisements in the current digital scene. Using Python in streamlining details extraction from the ads proves to be very helpful. Productive web scraping operations are enabled by the adaptability and strong libraries in Python such as Requests or BeautifulSoup. This guide delves into the realm of Craigslist scraping with Python, highlighting the utilization of BeautifulSoup and Requests for content extraction, alongside proxy rotation to navigate anti-bot defenses effectively.

Basic steps to scrape Craigslist with Python

Next, we'll go through the scraping process step by step, starting with sending HTTP requests and extracting specific page elements, and finishing with saving the data in the required format.

Setting up your environment

You'll need to install the necessary libraries:


pip install beautifulsoup4
pip install requests

Sending HTTP requests to Craigslist pages

Use the requests library to send HTTP GET requests to Craigslist listing pages.


import requests

# List of Craigslist URLs to scrape
urls = [
    "link",
    "link"
]

for url in urls:
    # Send a GET request to the URL
    response = requests.get(url)
    
    # Check if the request was successful (status code 200)
    if response.status_code == 200:
        # Extract HTML content from the response
        html_content = response.text
        
    else:
        # If the request failed, print an error message with the status code
        print(f"Failed to retrieve {url}. Status code: {response.status_code}")

Parsing HTML content with BeautifulSoup

Use BeautifulSoup for HTML parsing and navigating through the retrieved content.


from bs4 import BeautifulSoup

# Iterate through each URL in the list
for url in urls:
    # Send a GET request to the URL
    response = requests.get(url)
    
    # Check if the request was successful (status code 200)
    if response.status_code == 200:
        # Extract HTML content from the response
        html_content = response.text
        
        # Parse the HTML content using BeautifulSoup
        soup = BeautifulSoup(html_content, 'html.parser')
        
    else:
        # If the request failed, print an error message with the status code
        print(f"Failed to retrieve {url}. Status code: {response.status_code}")

Extracting data using BeautifulSoup methods

Extract data such as item titles and prices from Craigslist listings using BeautifulSoup methods.


from bs4 import BeautifulSoup

# Iterate through each URL in the list
for url in urls:
    # Send a GET request to the URL
    response = requests.get(url)
    
    # Check if the request was successful (status code 200)
    if response.status_code == 200:
        # Extract HTML content from the response
        html_content = response.text
        
        # Parse the HTML content using BeautifulSoup
        soup = BeautifulSoup(html_content, 'html.parser')
        
        # Extracting specific data points
        # Find the title of the listing
        title = soup.find('span', id='titletextonly').text.strip()
        
        # Find the price of the listing
        price = soup.find('span', class_='price').text.strip()
        
        # Find the description of the listing (may contain multiple paragraphs)
        description = soup.find('section', id='postingbody').find_all(text=True, recursive=False)
        
        # Print extracted data (for demonstration purposes)
        print(f"Title: {title}")
        print(f"Price: {price}")
        print(f"Description: {description}")
        
    else:
        # If the request fails, print an error message with the status code
        print(f"Failed to retrieve {url}. Status code: {response.status_code}")

Title:

1.png

Price:

2.png

Description:

3.png

Saving scraped data to CSV File

Once data is extracted, save it to a CSV file for further analysis or integration with other tools.


import csv

# Define the CSV file path and field names
csv_file = 'craigslist_data.csv'
fieldnames = ['Title', 'Price', 'Description']

# Writing data to CSV file
try:
    # Open the CSV file in write mode with UTF-8 encoding
    with open(csv_file, mode='w', newline='', encoding='utf-8') as file:
        # Create a CSV DictWriter object with the specified fieldnames
        writer = csv.DictWriter(file, fieldnames=fieldnames)
        
        # Write the header row in the CSV file
        writer.writeheader()
        
        # Iterate through each item in the scraped_data list
        for item in scraped_data:
            # Write each item as a row in the CSV file
            writer.writerow(item)
        
    # Print a success message after writing data to the CSV file
    print(f"Data saved to {csv_file}")

except IOError:
    # Print an error message if an IOError occurs while writing to the CSV file
    print(f"Error occurred while writing data to {csv_file}")

Handling potential roadblocks

Craigslist may implement measures to prevent scraping, such as IP blocking or CAPTCHA challenges. To mitigate these issues, consider using proxies and rotating user agents.

Using proxies:

This example demonstrates the use of a proxy with IP address authorization.


proxies = {
    'http': 'http://your_proxy_ip:your_proxy_port',
    'https': 'https://your_proxy_ip:your_proxy_port'
}

response = requests.get(url, proxies=proxies)

User-Agent rotation:


import random

user_agents = [
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
    # Add more user agents as needed
]

headers = {
    'User-Agent': random.choice(user_agents)
}

response = requests.get(url, headers=headers)

Full code

This complete Python script shows how to integrate different components to build an efficient Craigslist scraper that extracts, parses, and retrieves data from multiple URLs.


import requests
import urllib3
from bs4 import BeautifulSoup
import csv
import random
import ssl

ssl._create_default_https_context = ssl._create_stdlib_context
urllib3.disable_warnings()


# List of Craigslist URLs to scrape
urls = [
    "link",
    "link"
]

# User agents and proxies
user_agents = [
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
]

proxies = [
    {'http': 'http://your_proxy_ip1:your_proxy_port1', 'https': 'https://your_proxy_ip1:your_proxy_port1'},
    {'http': 'http://your_proxy_ip2:your_proxy_port2', 'https': 'https://your_proxy_ip2:your_proxy_port2'},
]

# List to store scraped data
scraped_data = []

# Loop through each URL in the list
for url in urls:
    # Rotate user agent for each request to avoid detection
    headers = {
        'User-Agent': random.choice(user_agents)
    }

    # Use a different proxy for each request to avoid IP blocking
    proxy = random.choice(proxies)

    try:
        # Send GET request to the Craigslist URL with headers and proxy
        response = requests.get(url, headers=headers, proxies=proxy, timeout=30, verify=False)
        
        # Check if the request was successful (status code 200)
        if response.status_code == 200:
            # Parse HTML content of the response
            html_content = response.text
            soup = BeautifulSoup(html_content, 'html.parser')

            # Extract data from the parsed HTML
            title = soup.find('span', id='titletextonly').text.strip()
            price = soup.find('span', class_='price').text.strip()
            description = soup.find('section', id='postingbody').get_text(strip=True, separator='\n')  # Extracting description

            # Append scraped data as a dictionary to the list
            scraped_data.append({'Title': title, 'Price': price, 'Description': description})
            print(f"Data scraped for {url}")
        else:
            # Print error message if request fails
            print(f"Failed to retrieve {url}. Status code: {response.status_code}")
    except Exception as e:
        # Print exception message if an error occurs during scraping
        print(f"Exception occurred while scraping {url}: {str(e)}")

# CSV file setup for storing scraped data
csv_file = 'craigslist_data.csv'
fieldnames = ['Title', 'Price', 'Description']

# Writing scraped data to CSV file
try:
    with open(csv_file, mode='w', newline='', encoding='utf-8') as file:
        writer = csv.DictWriter(file, fieldnames=fieldnames)

        # Write header row in the CSV file
        writer.writeheader()

        # Iterate through scraped_data list and write each item to the CSV file
        for item in scraped_data:
            writer.writerow(item)

    # Print success message if data is saved successfully
    print(f"Data saved to {csv_file}")
except IOError:
    # Print error message if there is an IOError while writing to the CSV file
    print(f"Error occurred while writing data to {csv_file}")

Craigslist is important because it provides a place where we can find classified ads that give us useful information for examining markets, finding leads among others. Craigslist web scraping is made easy by Python using libraries such as BeautifulSoup and Request. Key tactics discussed in this tutorial are handling dynamic content and rotating proxies. By leveraging Python responsibly, you can extract actionable insights from Craigslist listings, supporting informed decision-making across various domains.

Comments:

0 comments