XPath is a powerful language often used in website parsing. It allows you to access nodes or calculate values from XML and HTML. Similar functions use CSS selectors, but XPath allows you to do much more.
With XPath you can parse data based on text elements and not just in the page structure. So when you need to spar a fairly “crooked” site, XPath can save you a lot of time.
This tutorial will introduce you to the basics of XPath, and then you can move on to more advanced features.
You can use this service to experiment with XPath.
Basics of Basics
Let’s say we have a document like this:
<!-- html -->
<title>This is page</title>
<h2>Go to my cool <a href="#">page</a></h2>
<p>It is the first paragraph.</p>
<p>This is the second paragraph.</p>
Xpath represents any XML/HTML document as a tree of elements (nodes). The root node is not part of the element, but is considered to be the parent node of the initial element in the document (for HTML it is “). This is roughly what it looks like:

As you can see, there are several types for nodes in the XPath tree.
- Element: represents an HTML element, such as a tag.
- Attribute: represents an attribute of an element, such as the
href
attribute in themysite
tag. - Text node: represents the text within the element, for example,
mysite
in mysite. - Comment: represents a comment in the document (“).
It’s important to be aware of the difference between these nodes. Now let’s dive into XPath.
This is how XPath is used to select an element:
/html/head/title
Such paths are called location path
. They allow you to specify a path relative to the context node (in this case root
). This path consists of three parts, separated by slashes. It means “starting from the html
element, look inside the head
element, and inside the title
element”. The context node changes every step, so it will be equal to the head on the last step.
Usually we don’t know (or are just lazy) the exact path node-to-node, we can use a search through the whole document:
//title
This expression means “browse the whole tree from the beginning (//
) and find the title
element.”
Generally speaking, the expressions we saw above are abbreviated XPath syntax. The full version of the last expression will look like this:
/descendant-or-self::node()/child::title
That is, //
is analogous to descendant-or-self
, which means the current node, or any level below. This part of the expression is called axis
and defines a set of nodes from which the fetching will be performed (either below, above or at the same level).
The next part of the expression is node ()
, which is called the node test, which stores the expression that decides whether the current node should be selected or not. In this case, nodes of all types are selected. Then comes another axis – child
, which means “pass the child nodes, relative to the current node”, and the test node in this case the title
.
That is, the axis determines with respect to which elements to test the node. And the nodes that pass it will be returned as a result.
You can select nodes both by name and by type.
Here are some examples:
/Html
– selects all nodes with the namehtml
relative to the native element./Html/head
– selects nodes namedhead
in thehtml
node.//title
– selects alltitle
nodes in the document.//h2 a
– selects all nodesa
in the document nested in theh2
node.
And here are some examples of selection by type:
//comment()
– selects only comment nodes.//node()
– selects all nodes in the tree.//text()
– selects only the text nodes.// *
– selects all nodes, except comments and text nodes.
And of course we can combine these methods:
//p/text()
This expression selects text nodes inside all p
elements. In the HTML, we showed above, this expression will highlight “It is the first paragraph.” and “This is the second paragraph.”.
Now let’s look at how we can filter the results. Suppose we have a document like this:
<ul>
<li>Line 1</li>
<li>Line 2 with <a href="...">link</a></li>
<li>Line 3 with <a href="...">second link</a></li>
<li><h2>Line 4 title</h2></li>
</ul>
We can select the first item in the list like this:
//li[position() = 1]
The expression in square brackets is called a predicate. It filters the nodes returned by the expression //li
. In this case, it checks the position of each node using the position()
function, which returns the position of the current node in the result (the set of nodes). Note that the numbering starts with 1. Otherwise this expression can be written like this:
//li[1]
Both expressions will return this:
<li class="line">Line 1</li>
Here are some examples of predicates:
//li[position()%2=0]
– highlightsli
elements at even positions.//li[a]
– selectsli
elements, which have elementa
.//li[h2 or a]
– selects theli
elements, which have elementh2
ora
.//li[a[text()="link"]]
– highlights theli
elements that have a elementa
with the text “link”. It can also be written as//li[a/text()="link"]
.//li[last()]
– selects the lastli
element in the document.
To summarize: the path consists of steps separated by a slash, each step contains an axis, a node test, and a predicate. Here is an example of a two-step expression, each with its own axis, node test, and predicate.
//li[ 4 ]/h2[ text() = "Line 4 title" ]
And here is the analog without the abbreviations:
/descendant-or-self::node()
/child::li[ position() = 4 ]
/child::h2[ text() = "Line 4 title" ]
We can also combine two expressions into one using the |
operator. For example, we can select all elements a
and h2
in a document.
//a | //h2
Now consider the following document:
<ul>
<li id="pr-lt"><a href="https://examplesite.com">examplesite</a></li>
<li><a href="https://examplesitestack.com">examplesitestack</a></li>
<li><a href="https://blog.examplesite.com">domhtmlstack blog</a></li>
<li id="sh-cl"><a href="http://hub.toexamplesite.com">hub to examplesite</a></li>
</ul>
Let’s imagine that we need to get all the links to the HTTPS URL’s. We can do this by checking the href
attribute:
//a[starts-with(@href, "https")]
This expression first highlights all the links on the page and then checks the href
attribute starts with https
. The attribute is accessed using the @attributename
syntax.
And some more examples.
//a[@href="https://examplesite.com "]
– highlightsa
elements leading to https://examplesite.com.// a / @ href
– highlights the addresses that the page links to.// li [@id]
– highlights only thoseli
elements for which anid
is given.
More about the axis
We’ve seen two kinds of axes before:
- descendant-or-self
- child
But there are many more. Consider such a document:
<p>First paragraph</p>
<h2>Brand #1</h2>
<p>Another paragraph #1</p>
<p>Random one #1</p>
A second paragraph, with no markup
<h2>Brand #2</h2>
<p>Another paragraph #2</p>
<p>Random one #2</p>
A third paragraph, with no markup
<div><p>Footer data</p></div>
Now we want to mine only the first paragraph after each header. To do this, we can use the following-sibling
axis, which highlights all elements on the same level after the current element.
//h1/following-sibling::p[1]
In this example, the context node to which the following-sibling
was applied was h2
.
But what if we want to highlight the text before the footer? We can use preceding-sibling
:
//div[@id='footer']/preceding-sibling::text()[1]
In this case, we are highlighting the first text node before the footer (“A third paragraph, with no markup”).
XPath also allows us to select items based on their text content. We can use this feature along with the parent axis to get the parent
node of the ‘p’ element with the “Footer data” text.
//p[ text()="Footer data" ]/..
As a result, we will get the ‘Footer data’ element. As you can see, “..
” is used as an abbreviation for the parent axis.
An alternative to the expression above is this expression:
//*[p/text()="Footer data"]
It will select all the elements which have p
elements with the text “Footer data” embedded into them.
You can find the whole specification here.