How to parse XML in Python

Comments: 0

XML, or eXtensible Markup Language, is a versatile markup language that is commonly used for encoding documents in a format which is both human-readable and machine-readable. This format is extensively utilized for transferring data, creating configuration files, and developing web services.

In XML documents, elements are encapsulated within tags that signify the beginning and end of each element, providing a clear structure for the data.

Example of XML:

<data>
    <item>
        <name>Item 1</name>
        <price>10</price>
      </item>
    <item>
        <name>Item 2</name>
        <price>20</price>
    </item>
</data>

This article explores different libraries and methods available for parsing XML in Python.

Python libraries for XML parsing

Next, we will delve into the various libraries available for XML data extraction in Python, and guide you through the installation process to set up your environment for working with XML documents.

xml.etree.ElementTree

xml.etree.ElementTree is a standard library module for parsing and creating XML data. It provides an efficient and straightforward API for parsing XML from strings and files and for creating XML documents.

Basic example:


import xml.etree.ElementTree as ET

xml_data = """
<data>
    <item>
        <name>Item 1</name>
        <price>10</price>
    </item>
    <item>
        <name>Item 2</name>
        <price>20</price>
    </item>
</data>
"""

root = ET.fromstring(xml_data)

for item in root.findall('item'):
    name = item.find('name').text
    price = item.find('price').text
    print(f'Name: {name}, Price: {price}')

Output:

1.png

xml.dom.minidom

xml.dom.minidom is another built-in Python library that provides a DOM (Document Object Model) representation of XML. This library allows for more detailed XML manipulation.

Basic example:

from xml.dom.minidom import parseString

xml_data = """
<data>
    <item>
        <name>Item 1</name>
        <price>10</price>
    </item>
    <item>
        <name>Item 2</name>
        <price>20</price>
    </item>
</data>
"""

dom = parseString(xml_data)
items = dom.getElementsByTagName('item')

for item in items:
    name = item.getElementsByTagName('name')[0].firstChild.data
    price = item.getElementsByTagName('price')[0].firstChild.data
    print(f'Name: {name}, Price: {price}')

Output:

2.png

BeautifulSoup XML parsing

BeautifulSoup is a popular library for parsing HTML and XML documents. It's particularly useful for scraping web data and handling poorly-formed XML.

Basic example:


from bs4 import BeautifulSoup

xml_data = """
<data>
    <item>
        <name>Item 1</name>
        <price>10</price>
    </item>
    <item>
        <name>Item 2</name>
        <price>20</price>
    </item>
</data>
"""

soup = BeautifulSoup(xml_data, 'xml')
items = soup.find_all('item')

for item in items:
    name = item.find('name').text
    price = item.find('price').text
    print(f'Name: {name}, Price: {price}')

Output:

3.png

lxml library

lxml is a powerful library that combines the ease of use of ElementTree with the speed and features of the libxml2 library. It supports both XML and HTML parsing.

Basic example:

from lxml import etree

xml_data = """
<data>
    <item>
        <name>Item 1</name>
        <price>10</price>
    </item>
    <item>
        <name>Item 2</name>
        <price>20</price>
    </item>
</data>
"""

root = etree.fromstring(xml_data)
items = root.findall('item')

for item in items:
    name = item.find('name').text
    price = item.find('price').text
    print(f'Name: {name}, Price: {price}')

Output:

4.png

XML to dictionary conversion

Converting XML to a dictionary can be useful for manipulating and processing XML data with more flexibility.


import xml.etree.ElementTree as ET

def xml_to_dict(element):
    if len(element) == 0:
        return element.text
    return {child.tag: xml_to_dict(child) for child in element}

xml_data = """
<data>
    <item>
        <name>Item 1</name>
        <price>10</price>
    </item>
    <item>
        <name>Item 2</name>
        <price>20</price>
    </item>
</data>
"""

root = ET.fromstring(xml_data)
data_dict = xml_to_dict(root)
print(data_dict)

Output:

5.png

XML to CSV conversion

Converting data from XML to CSV format simplifies the process of analyzing and storing data, making it easier to integrate with spreadsheet applications and enhancing the ability to visualize data effectively.

import csv
import xml.etree.ElementTree as ET

xml_data = """
<data>
    <item>
        <name>Item 1</name>
        <price>10</price>
    </item>
    <item>
        <name>Item 2</name>
        <price>20</price>
    </item>
</data>
"""

root = ET.fromstring(xml_data)

with open('output.csv', 'w', newline='') as csvfile:
    writer = csv.writer(csvfile)
    writer.writerow(['Name', 'Price'])
    
    for item in root.findall('item'):
        name = item.find('name').text
        price = item.find('price').text
        writer.writerow([name, price])

Handling XML errors

When parsing XML, handling errors is crucial to ensure that your code can manage unexpected or malformed data gracefully.

import xml.etree.ElementTree as ET

xml_data = """
<data>
    <item>
        <name>Item 1</name>
        <price>10</price>
    </item>
    <item>
        <name>Item 2</name>
        <price>20</price>
    </item>
</data>
"""

try:
    root = ET.fromstring(xml_data)
except ET.ParseError as e:
    print(f'Error parsing XML: {e}')

Parsing XML from a URL

Here's a practical example using BeautifulSoup to parse XML data from a URL:

import requests  # Importing the requests library to make HTTP requests
from bs4 import BeautifulSoup  # Importing BeautifulSoup from the bs4 library for parsing XML

# Define the URL of the XML data
url = "https://httpbin.org/xml"

# Send a GET request to the URL
response = requests.get(url)

# Parse the XML content of the response using BeautifulSoup
soup = BeautifulSoup(response.content, 'xml')

# Loop through all 'slide' elements in the XML
for slide in soup.find_all('slide'):
    # Find the 'title' element within each 'slide' and get its text
    title = slide.find('title').text
    # Print the title text
    print(f'Title: {title}')

Output:

6 (1).png

Various applications use the elementary data format XML ranging from web services to configuration files. Python features numerous robust libraries for XML data parsing as well as manipulation. Python has a comprehensive library for XML data parsing whether one needs basic data mining or elaborate document processing. Python web developers’ must-have ability is to code XML for both data interchange and web scraping.

Comments:

0 comments