XPath is a powerful language often used in website parsing. It allows you to access nodes or calculate values from XML and HTML. Similar functions use CSS selectors, but XPath allows you to do much more.
With XPath you can parse data based on text elements and not just in the page structure. So when you need to spar a fairly “crooked” site, XPath can save you a lot of time.
This tutorial will introduce you to the basics of XPath, and then you can move on to more advanced features.
You can use this service to experiment with XPath.
Basics of Basics
Let’s say we have a document like this:
<!-- html --> <title>This is page</title> <h2>Go to my cool <a href="#">page</a></h2> <p>It is the first paragraph.</p> <p>This is the second paragraph.</p>
Xpath represents any XML/HTML document as a tree of elements (nodes). The root node is not part of the element, but is considered to be the parent node of the initial element in the document (for HTML it is “). This is roughly what it looks like:
As you can see, there are several types for nodes in the XPath tree.
- Element: represents an HTML element, such as a tag.
- Attribute: represents an attribute of an element, such as the
hrefattribute in the
- Text node: represents the text within the element, for example,
- Comment: represents a comment in the document (“).
It’s important to be aware of the difference between these nodes. Now let’s dive into XPath.
This is how XPath is used to select an element:
Such paths are called
location path. They allow you to specify a path relative to the context node (in this case
root). This path consists of three parts, separated by slashes. It means “starting from the
html element, look inside the
head element, and inside the
title element”. The context node changes every step, so it will be equal to the head on the last step.
Usually we don’t know (or are just lazy) the exact path node-to-node, we can use a search through the whole document:
This expression means “browse the whole tree from the beginning (
//) and find the
Generally speaking, the expressions we saw above are abbreviated XPath syntax. The full version of the last expression will look like this:
// is analogous to
descendant-or-self, which means the current node, or any level below. This part of the expression is called
axis and defines a set of nodes from which the fetching will be performed (either below, above or at the same level).
The next part of the expression is
node (), which is called the node test, which stores the expression that decides whether the current node should be selected or not. In this case, nodes of all types are selected. Then comes another axis –
child, which means “pass the child nodes, relative to the current node”, and the test node in this case the
That is, the axis determines with respect to which elements to test the node. And the nodes that pass it will be returned as a result.
You can select nodes both by name and by type.
Here are some examples:
/Html– selects all nodes with the name
htmlrelative to the native element.
/Html/head– selects nodes named
//title– selects all
titlenodes in the document.
//h2 a– selects all nodes
ain the document nested in the
And here are some examples of selection by type:
//comment()– selects only comment nodes.
//node()– selects all nodes in the tree.
//text()– selects only the text nodes.
// *– selects all nodes, except comments and text nodes.
And of course we can combine these methods:
This expression selects text nodes inside all
p elements. In the HTML, we showed above, this expression will highlight “It is the first paragraph.” and “This is the second paragraph.”.
Now let’s look at how we can filter the results. Suppose we have a document like this:
<ul> <li>Line 1</li> <li>Line 2 with <a href="...">link</a></li> <li>Line 3 with <a href="...">second link</a></li> <li><h2>Line 4 title</h2></li> </ul>
We can select the first item in the list like this:
//li[position() = 1]
The expression in square brackets is called a predicate. It filters the nodes returned by the expression
//li. In this case, it checks the position of each node using the
position() function, which returns the position of the current node in the result (the set of nodes). Note that the numbering starts with 1. Otherwise this expression can be written like this:
Both expressions will return this:
<li class="line">Line 1</li>
Here are some examples of predicates:
lielements at even positions.
lielements, which have element
//li[h2 or a]– selects the
lielements, which have element
//li[a[text()="link"]]– highlights the
lielements that have a element
awith the text “link”. It can also be written as
//li[last()]– selects the last
lielement in the document.
To summarize: the path consists of steps separated by a slash, each step contains an axis, a node test, and a predicate. Here is an example of a two-step expression, each with its own axis, node test, and predicate.
//li[ 4 ]/h2[ text() = "Line 4 title" ]
And here is the analog without the abbreviations:
/descendant-or-self::node() /child::li[ position() = 4 ] /child::h2[ text() = "Line 4 title" ]
We can also combine two expressions into one using the
| operator. For example, we can select all elements
h2 in a document.
//a | //h2
Now consider the following document:
<ul> <li id="pr-lt"><a href="https://examplesite.com">examplesite</a></li> <li><a href="https://examplesitestack.com">examplesitestack</a></li> <li><a href="https://blog.examplesite.com">domhtmlstack blog</a></li> <li id="sh-cl"><a href="http://hub.toexamplesite.com">hub to examplesite</a></li> </ul>
Let’s imagine that we need to get all the links to the HTTPS URL’s. We can do this by checking the
This expression first highlights all the links on the page and then checks the
href attribute starts with
https. The attribute is accessed using the
And some more examples.
//a[@href="https://examplesite.com "]– highlights
aelements leading to https://examplesite.com.
// a / @ href– highlights the addresses that the page links to.
// li [@id]– highlights only those
lielements for which an
More about the axis
We’ve seen two kinds of axes before:
But there are many more. Consider such a document:
<p>First paragraph</p> <h2>Brand #1</h2> <p>Another paragraph #1</p> <p>Random one #1</p> A second paragraph, with no markup <h2>Brand #2</h2> <p>Another paragraph #2</p> <p>Random one #2</p> A third paragraph, with no markup <div><p>Footer data</p></div>
Now we want to mine only the first paragraph after each header. To do this, we can use the
following-sibling axis, which highlights all elements on the same level after the current element.
In this example, the context node to which the
following-sibling was applied was
But what if we want to highlight the text before the footer? We can use
In this case, we are highlighting the first text node before the footer (“A third paragraph, with no markup”).
XPath also allows us to select items based on their text content. We can use this feature along with the parent axis to get the
parent node of the ‘p’ element with the “Footer data” text.
//p[ text()="Footer data" ]/..
As a result, we will get the ‘Footer data’ element. As you can see, “
..” is used as an abbreviation for the parent axis.
An alternative to the expression above is this expression:
It will select all the elements which have
p elements with the text “Footer data” embedded into them.
You can find the whole specification here.