XPAth html queries

XPath is a powerful language often used in website parsing. It allows you to access nodes or calculate values from XML and HTML. Similar functions use CSS selectors, but XPath allows you to do much more.

With XPath you can parse data based on text elements and not just in the page structure. So when you need to spar a fairly “crooked” site, XPath can save you a lot of time.

This tutorial will introduce you to the basics of XPath, and then you can move on to more advanced features.

You can use this service to experiment with XPath.

Basics of Basics

Let’s say we have a document like this:

<!-- html -->
<title>This is page</title>
 <h2>Go to my cool <a href="#">page</a></h2>
  <p>It is the first paragraph.</p>
  <p>This is the second paragraph.</p>

Xpath represents any XML/HTML document as a tree of elements (nodes). The root node is not part of the element, but is considered to be the parent node of the initial element in the document (for HTML it is “). This is roughly what it looks like:

XPAth html queries are represented as a tree of elements

As you can see, there are several types for nodes in the XPath tree.

  • Element: represents an HTML element, such as a tag.
  • Attribute: represents an attribute of an element, such as the href attribute in the mysite tag.
  • Text node: represents the text within the element, for example, mysite in mysite.
  • Comment: represents a comment in the document (“).

It’s important to be aware of the difference between these nodes. Now let’s dive into XPath.

This is how XPath is used to select an element:

/html/head/title

Such paths are called location path. They allow you to specify a path relative to the context node (in this case root). This path consists of three parts, separated by slashes. It means “starting from the html element, look inside the head element, and inside the title element”. The context node changes every step, so it will be equal to the head on the last step.

Usually we don’t know (or are just lazy) the exact path node-to-node, we can use a search through the whole document:

//title

This expression means “browse the whole tree from the beginning (//) and find the title element.”

Generally speaking, the expressions we saw above are abbreviated XPath syntax. The full version of the last expression will look like this:

/descendant-or-self::node()/child::title

That is, // is analogous to descendant-or-self, which means the current node, or any level below. This part of the expression is called axis and defines a set of nodes from which the fetching will be performed (either below, above or at the same level).

The next part of the expression is node (), which is called the node test, which stores the expression that decides whether the current node should be selected or not. In this case, nodes of all types are selected. Then comes another axis – child, which means “pass the child nodes, relative to the current node”, and the test node in this case the title.

That is, the axis determines with respect to which elements to test the node. And the nodes that pass it will be returned as a result.

You can select nodes both by name and by type.

Here are some examples:

  • /Html – selects all nodes with the name html relative to the native element.
  • /Html/head – selects nodes named head in the html node.
  • //title – selects all title nodes in the document.
  • //h2 a – selects all nodes a in the document nested in the h2 node.

And here are some examples of selection by type:

  • //comment() – selects only comment nodes.
  • //node() – selects all nodes in the tree.
  • //text() – selects only the text nodes.
  • // * – selects all nodes, except comments and text nodes.

And of course we can combine these methods:

//p/text()

This expression selects text nodes inside all p elements. In the HTML, we showed above, this expression will highlight “It is the first paragraph.” and “This is the second paragraph.”.

Now let’s look at how we can filter the results. Suppose we have a document like this:

<ul>
      <li>Line 1</li>
      <li>Line 2 with <a href="...">link</a></li>
      <li>Line 3 with <a href="...">second link</a></li>
      <li><h2>Line 4 title</h2></li>
    </ul>

We can select the first item in the list like this:

//li[position() = 1]

The expression in square brackets is called a predicate. It filters the nodes returned by the expression //li. In this case, it checks the position of each node using the position() function, which returns the position of the current node in the result (the set of nodes). Note that the numbering starts with 1. Otherwise this expression can be written like this:

//li[1]

Both expressions will return this:

<li class="line">Line 1</li>

Here are some examples of predicates:

  • //li[position()%2=0] – highlights li elements at even positions.
  • //li[a] – selects li elements, which have element a.
  • //li[h2 or a] – selects the li elements, which have element h2 or a.
  • //li[a[text()="link"]] – highlights the li elements that have a element a with the text “link”. It can also be written as //li[a/text()="link"].
  • //li[last()] – selects the last li element in the document.

To summarize: the path consists of steps separated by a slash, each step contains an axis, a node test, and a predicate. Here is an example of a two-step expression, each with its own axis, node test, and predicate.

//li[ 4 ]/h2[ text() = "Line 4 title" ]

And here is the analog without the abbreviations:

/descendant-or-self::node()
    /child::li[ position() = 4 ]
        /child::h2[ text() = "Line 4 title" ]

We can also combine two expressions into one using the | operator. For example, we can select all elements a and h2 in a document.

//a | //h2

Now consider the following document:

<ul>
      <li id="pr-lt"><a href="https://examplesite.com">examplesite</a></li>
      <li><a href="https://examplesitestack.com">examplesitestack</a></li>
      <li><a href="https://blog.examplesite.com">domhtmlstack blog</a></li>
      <li id="sh-cl"><a href="http://hub.toexamplesite.com">hub to examplesite</a></li>
    </ul>

Let’s imagine that we need to get all the links to the HTTPS URL’s. We can do this by checking the href attribute:

//a[starts-with(@href, "https")]

This expression first highlights all the links on the page and then checks the href attribute starts with https. The attribute is accessed using the @attributename syntax.

And some more examples.

  • //a[@href="https://examplesite.com "] – highlights a elements leading to https://examplesite.com.
  • // a / @ href – highlights the addresses that the page links to.
  • // li [@id] – highlights only those li elements for which an id is given.

More about the axis

We’ve seen two kinds of axes before:

  • descendant-or-self
  • child

But there are many more. Consider such a document:

<p>First paragraph</p>
    <h2>Brand #1</h2>
    <p>Another paragraph #1</p>
    <p>Random one #1</p>
    A second paragraph, with no markup
    <h2>Brand #2</h2>
    <p>Another paragraph #2</p>
    <p>Random one #2</p>
    A third paragraph, with no markup
    <div><p>Footer data</p></div>

Now we want to mine only the first paragraph after each header. To do this, we can use the following-sibling axis, which highlights all elements on the same level after the current element.

//h1/following-sibling::p[1]

In this example, the context node to which the following-sibling was applied was h2.

But what if we want to highlight the text before the footer? We can use preceding-sibling:

//div[@id='footer']/preceding-sibling::text()[1]

In this case, we are highlighting the first text node before the footer (“A third paragraph, with no markup”).

XPath also allows us to select items based on their text content. We can use this feature along with the parent axis to get the parent node of the ‘p’ element with the “Footer data” text.

//p[ text()="Footer data" ]/..

As a result, we will get the ‘Footer data’ element. As you can see, “..” is used as an abbreviation for the parent axis.

An alternative to the expression above is this expression:

//*[p/text()="Footer data"]

It will select all the elements which have p elements with the text “Footer data” embedded into them.

You can find the whole specification here.

Leave a Comment

Your email address will not be published.