Tag Archive: lxml

Nov 24 2015

Parsing XML and HTML using xpath and lxml in Python

For the last few years my life has been full of the processing of HTML and XML using the lxml library for Python and the xpath query language. xpath is a query language designed specifically to search XML, unlike regular expressions which should definitely not be used to process XML related languages. Typically this has involved a lot of searching my own code to remind me how to do stuff. This blog post captures some handy snippets to avoid the inevitable Googling, and solidify for me exactly what I’ve been doing for the last few years!

But what does it {xml, html} look like?

xml and html are made up of “elements”, delimited by pointy brackets and attributes which are equal to things:

<element1 attribute1=”thing”>content</element1>

Elements can be nested inside other elements to make a tree structure. A wrinkle to be aware of is the so-called “tail” of an element. This is most often seen with <br/> tags (I think it is general):

<element1 attribute1=”thing”>content</br>tail</element1>

The “content” is accessed using text(), whilst the tail is accessed using .tail.

Web pages are made from HTML which is a “relaxed” XML format. XML is the basis of many other file formats found in the wild (such as GPX and GML). Dealing with XML is very similar to dealing with HTML except for namespaces, which I discuss in more detail at the end of this post.

XPath Helper

Before I get onto xpath I should introduce xpath helper – which is a plugin for Google Chrome which helps you develop xpath queries.

You can find XPath Helper in the Chrome Store, it is free. I use it in combination with the Google Chrome Developer tools, particular the “Inspect Element” functionality. XPath helper allows you to see the results of an xpath query live. You open up the XPath console (Ctrl+shift+x), type in your xpath and you see the results in both in the xpath helper console, and also as highlighting on the page.

You can get automatically generated xpath queries, however typically I have used these just as inspiration since they tend to be rather long and “brittle”.

Loading up the data

My Python scripts nearly always start with the following imports:

import lxml.html
import requests
import requests_cache
requests_cache.install_cache('demo_cache')

requests and requests_cache to access data on the web and lxml.html to parse the HTML. Then I can get a webpage using:

r = requests.get(url)
root = lxml.html.fromstring(r.content)

You might want to make any URLs absolute rather than relative:

root.make_links_absolute(base_url)

If I’m dealing with XML rather than HTML then I might do:

from lxml import etree

And then when it came to loading in a local XML file:

with open(input_file, "rb") as f:
	root = etree.XML(f.read())

XPath queries

With your root element in hand you can now get on with querying. Xpath queries are designed to extract a set of elements or attributes from an XML/HTML document by the name of the element, the value of an attribute on an element, by the relationship an element has with another element or by the content of an element.

Quite often xpath will return elements or lists of elements which, when printed in Python, don’t show you the content you want to see. To get the text content of an element you need to use .text, text_content(), or .tail, and make sure you ask for an array element rather than the whole array.

The follow examples show the key features of xpath. I’m using this blog (http:/www.ianhopkinson.org.uk/) as an example website so you can play along with xpath:

Specifying a complete path with / as separator

title = root.xpath('/html/body/div/div/div[2]/h1')

is the full path to my blog title. Notice how we request the 2nd element of the third set of div elements using div[2] – xpath arrays are one-based, not zero-based.

Specifying a path with wildcards using //

This expression also finds the title but the preamble of /html/body/div/div is absorbed by the // wildcard match:

title = root.xpath('//div[2]/h1')

To obtain the text of the title in Python, rather than an element object, we would do:

title_text = title[0].text.strip() or maybe title_text = title[0].text_content().strip()

text_content() would pick up any tail content, and any text in child elements. I use strip() here to remove leading and trailing whitespace

Selecting attribute values

we’ve seen that //element selects all of the elements of type “element”. We select attribute values like this:

ids = root.xpath('//li/@id')

which selects the id attribute from the list elements (li) on my blog

Specifying an element by attribute

We can select elements which have particular attribute values:

tagcloud = root.xpath('//*[@class="tagcloud"]')

this selects the tag cloud on my blog by selecting elements which having the class attribute “tagcloud”.

Select an element containing some specified text

We can do something similar with the text content of an element:

title = root.xpath(‘//h1[contains(., ‘SomeBeans’)]’)

This selects h1 elements which contain the text “SomeBeans”.

Select via a parent or sibling relationship

Sometimes we want to select elements by their relationship to another element, for example:

subtitle = root.xpath('//h1[contains(@class,"header_title")]/../h2')

this selects the h1 title of my blog (SomeBeans) then navigates to the parent with .. and selects the sibling h2 element (the subtitle “the makings of a small casserole”).

The same effect can be achieved with the following-sibling keyword:

subtitle = root.xpath('//h1[contains(@class,"header_title")]/following-sibling::h2')

XML Namespaces

When dealing with XML, we need to worry about namespaces. In principle the elements of an XML document are described in a schema which can be looked up and is universally unique. In practice the use of namespaces in XML documents can lead to much banging head against wall! This is largely because trivial examples of XML wrangling don’t use namespaces, except as a “special” example.

Here is a fragment of XML defining two namespaces:

<foo:Results xmlns:foo="http://www.foo.com" xmlns="http://www.bah.com">

xmlns:foo defines a namespace whose short form is “foo”, we select elements in this space using a namespace parameter to the xpath query:

records = root.xpath('//foo:Title', namespaces = {"foo": "http://www.foo.com"})

The “catch” here is we also define a default namespace xmlns = “http://www.bah.com”, which means that elements which don’t have a prefix cannot be selected unless we define the namespace in our xpath:

records = root.xpath('//bah:Title', namespaces = {"bah": http://www.bah.com})

Worse than that we need to include our namespace prefix in the query, even though it doesn’t appear in the file!

Conclusion

These snippets cover the majority of the xpath queries I’ve needed over the past few years, I’ll add any others as I find them. I’ve put all the code used here in a GitHub gist.

Xpath is the right tool for the job of extracting information from XML documents, including HTML – do not accept inferior alternatives!