Step-by-step Python web scraping guide for beginners

Comments: 0

Python stands out as a top choice for web scraping due to its robust libraries and straightforward syntax. In this article, we'll explore the fundamentals of web scraping and guide you through setting up your Python environment to create your first web scraper. We'll introduce you to key Python libraries suited for scraping tasks, including Beautiful Soup, Playwright, and lxml.

Python libraries for web scraping

Python provides several libraries to make web scraping easier. Here are some of the most commonly used ones:

  • requests: a simple and elegant HTTP library for Python, used to send HTTP requests to fetch web pages.
  • Beautiful Soup: great for parsing HTML and XML documents. It creates parse trees from page source code that make it easy to extract data.
  • lxml: known for its speed and efficiency, lxml is excellent for parsing XML and HTML documents.
  • Playwright: a robust tool for dynamic content scraping and interacting with web pages.

Introduction to HTTP requests

HTTP (HyperText Transfer Protocol) is an application layer protocol for the transfer of data across the web. You type a URL in the browser, and it generates an HTTP request and sends it to the web server. The web server then sends back the HTTP response to the browser which it renders to you as the HTML page. For web scraping, you need to mimic this process and generate HTTP requests from your script to get the HTTP content of web pages programmatically.

Setting up your environment

First, ensure you have Python installed on your system. You can download it from Python's official website.

A virtual environment helps manage dependencies. Use these commands to create and activate a virtual environment:


python -m venv scraping_env
source scraping_env/bin/activate

Next, install the required packages using the following commands:


pip install requests
pip install beautifulsoup4 
pip install lxml

Building web scraper with Beautiful Soup

Let’s start with a simple web scraper using the request to scrape static HTML content.

Making HTTP GET request

The most common type of HTTP request is the GET request, which is used to retrieve data from a specified URL. Here is a basic example of how to perform a GET request to http://example.com.


import requests
url = 'http://example.com'
response = requests.get(url)

Handling HTTP responses

The requests library provides several ways to handle and process the response:

Check status code: ensure the request was successful.


if response.status_code == 200:
    print('Request was successful!')
else:
    print('Request failed with status code:', response.status_code)

Extracting content: extract the text or JSON content from the response.


# Get response content as text
page_content = response.text
print(page_content)

# Get response content as JSON (if the response is in JSON format)
json_content = response.json()
print(json_content)

Handling HTTP and network errors

HTTP and network errors may occur when a resource is not reachable, a request timed out, or the server returns an error HTTP status (e.g. 404 Not Found, 500 Internal Server Error). We can use the exception objects raised by requests to handle these situations.


import requests

url = 'http://example.com'

try:
    response = requests.get(url, timeout=10)  # Set a timeout for the request
    response.raise_for_status()  # Raises an HTTPError for bad responses
except requests.exceptions.HTTPError as http_err:
    print(f'HTTP error occurred: {http_err}')
except requests.exceptions.ConnectionError:
    print('Failed to connect to the server.')
except requests.exceptions.Timeout:
    print('The request timed out.')
except requests.exceptions.RequestException as req_err:
    print(f'Request error: {req_err}')
else:
    print('Request was successful!')

Extracting data from HTML elements

For web scraping, we often need to extract data from the HTML content. This part will talk about how to locate and extract data from HTML elements with some libraries like Beautiful Soup or lxml.

HTML (HyperText Markup Language) is the standard markup language for creating web pages. It consists of nested elements represented by tags, such as <div>, <p>, <a>, etc. Each tag can have attributes and contain text, other tags, or both.

XPath and CSS selectors provide a versatile way to select HTML elements based on their attributes or their position in the document.

Finding XPath and CSS selectors

When web scraping, extracting specific data from web pages often requires identifying the correct XPath or CSS selectors to target HTML elements. Here’s how you can find these selectors efficiently:

Most modern web browsers come with built-in developer tools that allow you to inspect the HTML structure of web pages. Here’s a step-by-step guide on how to use these tools:

  1. Open developer tools:
    • In Chrome: Right-click on the page and select "Inspect" or press Ctrl+Shift+I (Windows/Linux) or Cmd+Opt+I (Mac).
    • In Firefox: Right-click on the page and select "Inspect Element" or press Ctrl+Shift+I (Windows/Linux) or Cmd+Opt+I (Mac).
  2. Inspect the element:
    • Use the inspect tool (a cursor icon) to hover over and click the element you want to scrape. This will highlight the element in the HTML structure view.
  3. Copy XPath or CSS selector:
    • Right-click on the highlighted HTML element in the developer tools pane.
    • Select "Copy" and then choose either "Copy XPath" or "Copy selector" (CSS selector).

1n.png

XPath: /html/body/div/h1

CSS Selector: body > div > h1

Extraction using Beautiful Soup

Beautiful Soup is a Python library for parsing HTML and XML documents. It provides simple methods and attributes to navigate and search through the HTML structure.


from bs4 import BeautifulSoup
import requests

# URL of the webpage to scrape
url = 'https://example.com'

# Send an HTTP GET request to the URL
response = requests.get(url)

# Parse the HTML content of the response using Beautiful Soup
soup = BeautifulSoup(response.content, 'html.parser')

# Use the CSS selector to find all <h1> tags that are within <div> tags
# that are direct children of the <body> tag
h1_tags = soup.select('body > div > h1')

# Iterate over the list of found <h1> tags and print their text content
for tag in h1_tags:
    print(tag.text)

Handling parsing errors

Parsing errors occur when the HTML or XML structure is not as expected, causing issues in data extraction. These can be managed by handling exceptions like AttributeError.


from bs4 import BeautifulSoup
import requests

# URL of the webpage to scrape
url = 'https://example.com'

# Send an HTTP GET request to the URL
response = requests.get(url)

try:
    # Parse the HTML content of the response using Beautiful Soup
    soup = BeautifulSoup(response.content, 'html.parser')

    # Use the CSS selector to find all <h1> tags that are within <div> tags
    # that are direct children of the <body> tag
    h1_tags = soup.select('body > div > h1')

    # Iterate over the list of found <h1> tags and print their text content
    for tag in h1_tags:
        print(tag.text)
except AttributeError as attr_err:
    # Handle cases where an AttributeError might occur (e.g., if the response.content is None)
    print(f'Attribute error occurred: {attr_err}')
except Exception as parse_err:
    # Handle any other exceptions that might occur during the parsing
    print(f'Error while parsing HTML: {parse_err}')

Extraction using lxml

In addition to Beautiful Soup, another popular library for parsing HTML and XML documents in Python is lxml. While BeautifulSoup focuses on providing a convenient interface for navigating and manipulating parsed data, lxml is known for its speed and flexibility, making it a preferred choice for performance-critical tasks.


from lxml.html import fromstring
import requests

# URL of the webpage to scrape
url = 'https://example.com'

# Send an HTTP GET request to the URL
response = requests.get(url)

# Parse the HTML content of the response using lxml's fromstring method
parser = fromstring(response.text)

# Use XPath to find the text content of the first <h1> tag
# that is within a <div> tag, which is a direct child of the <body> tag
title = parser.xpath('/html/body/div/h1/text()')[0]

# Print the title
print(title)

Handling parsing errors

Similar to Beautiful Soup, lxml allows you to handle parsing errors gracefully by catching exceptions like lxml.etree.XMLSyntaxError.


from lxml.html import fromstring
from lxml import etree
import requests

# URL of the webpage to scrape
url = 'https://example.com'

# Send an HTTP GET request to the URL
response = requests.get(url)

try:
    # Parse the HTML content of the response using lxml's fromstring method
    parser = fromstring(response.text)

    # Use XPath to find the text content of the first <h1> tag
    # that is within a <div> tag, which is a direct child of the <body> tag
    title = parser.xpath('/html/body/div/h1/text()')[0]

    # Print the title
    print(title)
except IndexError:
    # Handle the case where the XPath query does not return any results
    print('No <h1> tag found in the specified location.')
except etree.XMLSyntaxError as parse_err:
    # Handle XML syntax errors during parsing
    print(f'Error while parsing HTML: {parse_err}')
except Exception as e:
    # Handle any other exceptions
    print(f'An unexpected error occurred: {e}')

Saving extracted data

Once you have successfully extracted data from HTML elements, the next step is to save this data. Python provides several options for saving scraped data, including saving to CSV files, JSON files, and databases. Here’s an overview of how to save extracted data using different formats:

Saving data to a CSV file

CSV (Comma-Separated Values) is a simple and widely used format for storing tabular data. The CSV module in Python makes it easy to write data to CSV files.


import csv

# Sample data
data = {
    'title': 'Example Title',
    'paragraphs': ['Paragraph 1', 'Paragraph 2', 'Paragraph 3']
}

# Save data to a CSV file
with open('scraped_data.csv', mode='w', newline='', encoding='utf-8') as file:
    writer = csv.writer(file)
    writer.writerow(['Title', 'Paragraph'])
    for paragraph in data['paragraphs']:
        writer.writerow([data['title'], paragraph])

print('Data saved to scraped_data.csv')

Saving data to a JSON file

JSON (JavaScript Object Notation) is a lightweight data-interchange format that is easy to read and write. The JSON module in Python provides methods to save data in JSON format.


import json

# Sample data
data = {
    'title': 'Example Title',
    'paragraphs': ['Paragraph 1', 'Paragraph 2', 'Paragraph 3']
}

# Save data to a JSON file
with open('scraped_data.json', mode='w', encoding='utf-8') as file:
    json.dump(data, file, ensure_ascii=False, indent=4)

print('Data saved to scraped_data.json')

Advanced web scraping techniques with Playwright

Playwright is a powerful tool for scraping dynamic content and interacting with web elements. It can handle JavaScript-heavy websites that static HTML parsers cannot.

Install Playwright and set it up:


pip install playwright
playwright install

Scraping dynamic content

Playwright allows you to interact with web elements like filling out forms and clicking buttons. It can wait for AJAX requests to complete before proceeding, making it ideal for scraping dynamic content.

2n.png

The provided code performs web scraping on an Amazon product page using Playwright and lxml. Initially, the necessary modules are imported. A run function is defined to encapsulate the scraping logic. The function begins by setting up a proxy server and launching a new browser instance with the proxy and in non-headless mode, allowing us to observe the browser actions. Within the browser context, a new page is opened and navigated to the specified Amazon product URL, with a timeout of 60 seconds to ensure the page fully loads.

The script then interacts with the page to select a specific product style from a dropdown menu and a product option by using locators and text matching. After ensuring these interactions are complete and the page has fully loaded again, the HTML content of the page is captured.

The HTML content is then parsed using lxml's fromstring method to create an element tree. An XPath query is used to extract the text content of the product title from a specific element with the ID productTitle. The script includes error handling to manage cases where the XPath query does not return results, where there are XML syntax errors during parsing, or any other unexpected exceptions. Finally, tlxml'she extracted product title is printed, and the browser context and browser are closed to end the session.

The run function is executed within a Playwright session started by sync_playwright, ensuring that the entire process is managed and executed within a controlled environment. This structure ensures robustness and error resilience while performing the web scraping task.


from playwright.sync_api import Playwright, sync_playwright
from lxml.html import fromstring, etree


def run(playwright: Playwright) -> None:
   # Define the proxy server
   proxy = {"server": "https://IP:PORT", "username": "LOGIN", "password": "PASSWORD"}

   # Launch a new browser instance with the specified proxy and in non-headless mode
   browser = playwright.chromium.launch(
       headless=False,
       proxy=proxy,
       slow_mo=50,
       args=['--ignore-certificate-errors'],
   )

   # Create a new browser context
   context = browser.new_context(ignore_https_errors=True)

   # Open a new page in the browser context
   page = context.new_page()

   # Navigate to the specified Amazon product page
   page.goto(
       "https://www.amazon.com/A315-24P-R7VH-Display-Quad-Core-Processor-Graphics/dp/B0BS4BP8FB/",
       timeout=10000,
   )

   # Wait for the page to fully load
   page.wait_for_load_state("load")

   # Select a specific product style from the dropdown menu
   page.locator("#dropdown_selected_style_name").click()

   # Select a specific product option
   page.click('//*[@id="native_dropdown_selected_style_name_1"]')
   page.wait_for_load_state("load")

   # Get the HTML content of the loaded page
   html_content = page.content()

   try:
       # Parse the HTML content using lxml's fromstring method
       parser = fromstring(html_content)

       # Use XPath to extract the text content of the product title
       product_title = parser.xpath('//span[@id="productTitle"]/text()')[0].strip()

       # Print the extracted product title
       print({"Product Title": product_title})
   except IndexError:
       # Handle the case where the XPath query does not return any results
       print('Product title not found in the specified location.')
   except etree.XMLSyntaxError as parse_err:
       # Handle XML syntax errors during parsing
       print(f'Error while parsing HTML: {parse_err}')
   except Exception as e:
       # Handle any other exceptions
       print(f'An unexpected error occurred: {e}')

   # Close the browser context and the browser
   context.close()
   browser.close()


# Use sync_playwright to start the Playwright session and run the script
with sync_playwright() as playwright:
   run(playwright)

Web scraping with Python is a powerful method for harvesting data from websites. The tools discussed facilitate the extraction, processing, and storage of web data for various purposes. In this process, the use of proxy servers to alternate IP addresses and implementing delays between requests are crucial for circumventing blocks. Beautiful Soup is user-friendly for beginners, while lxml is suited for handling large datasets thanks to its efficiency. For more advanced scraping needs, especially with dynamically loaded JavaScript websites, Playwright proves to be highly effective.

Comments:

0 comments