6.1 What Is XPath?

XPath is a specification that allows you to address individual parts of an XML document, originally intended for use in the XSLT transformation language and the XPointer syntax for XML fragment identifiers. However, XPath is quite useful on its own, and is available for standalone use in .NET.

Although XSLT is covered in Chapter 7, XPointer is not implemented in the .NET Framework. Thus, XPointer falls outside of the range of this book. For more information on XPath, XPointer, and their relationship, see John Simpson's XPath & XPointer (O'Reilly).

XPath 1.0 became a formal recommendation of the W3C in November, 1999, although XPath 2.0 is currently a working draft, still evolving as of this writing. The official XPath recommendation is located on the web at http://www.w3.org/TR/xpath.

The essence of XPath is that you can select certain nodes from within an XML document through a simple XPath expression. In addition, XPath allows you to do some simple string, numeric, and Boolean data transformation on selected nodes. XPath expressions take the form of strings with a certain well-known syntax. This syntax is not explicitly XML itself; it is similar to filesystem pathnames and URLs, and this is where XPath gets its name.

In addition to addressing nodes by name, XPath syntax enables pattern matching, so that you can select individual nodes by their attribute or content values.

In this section, I'll discuss the structure and syntax of XPath expressions, and some of the functions built in to the specification.

6.1.1 Introduction to the XPath Specification

Just like DOM, XPath operates on a tree-based view of an XML document. The XPath tree is built of the same node types used in DOM, except that CDATA sections, entity references, and document type declarations are not directly addressable. Their content is, however; the net result is that you can navigate to a text node's content, but you cannot tell whether that content contains plain text, CDATA, expanded entity references, or some combination thereof. You cannot access document type declarations at all with XPath.

For this discussion, I'll return to the inventory example from Chapter 5. That example included an inventory database that looked similar to the one in Example 6-1; here I've added some additional products.

Example 6-1. Angus Hardware inventory database

<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE inventory SYSTEM "inventory.dtd">
<inventory>
<!-- Warehouse inventory for Angus Hardware -->
<date year="2002" month="7" day="6" />
  <items>
    <item quantity="15" productCode="R-273" description="14.4 Volt Cordless Drill" 
unitCost="189.95" />
    <item quantity="23" productCode="1632S" description="12 Piece Drill Bit Set" 
unitCost="14.95" />
    <item quantity="10023" productCode="GN0250" description="1/4 inch Galvanized 
Steel Nails, 1/2 pound box" unitCost="4.95" />
    <item quantity="9887" productCode="GN0375" description="3/8 inch Galvanized 
Steel Nails, 1/2 pound box" unitCost="189.95" />
    <item quantity="8761" productCode="GN0500" description="1/2 inch Galvanized 
Steel Nails, 1/2 pound box" unitCost="4.95" />
    <item quantity="3441" productCode="GN0625" description="5/8 inch Galvanized 
Steel Nails, 1/2 pound box" unitCost="4.95" />
    <item quantity="9987" productCode="GN0750" description="3/4 inch Galvanized 
Steel Nails, 1/2 pound box" unitCost="4.95" />
    <item quantity="10002" productCode="GN0875" description="7/8 inch Galvanized 
Steel Nails, 1/2 pound box" unitCost="4.95" />
    <item quantity="596" productCode="GN1000" description="1 inch Galvanized 
Steel Nails, 1/2 pound box" unitCost="4.95" />
  </items>
</inventory>

6.1.1.1 Parts of an XPath expression

To introduce the proper terminology, each part of the XPath expression is called a location step. Each location step is made up of an axis, a node test, and zero or more predicates. Location steps are separated by the slash character (/).

The axis specifies the tree relationship between the nodes selected by the location step and the context node. Many axes have abbreviations which, while very convenient, are not always obvious to someone new to XPath. Table 6-1 shows the axes, their abbreviations, and brief descriptions of their meanings.

Table 6-1. Location step axes and their abbreviations

Axis

Abbreviation

Meaning

child

Contains the immediate children of the context node.

parent

..

Contains the immediate parent of the context node.

self

.

Contains the context node itself.

attribute

@

Contains the attributes of the context node, if it is an element.

ancestor

Contains the parent of the context node, its parent, and so on, all the way up to the root node.

ancestor-or-self

Contains the context node in addition to all the nodes contained in the ancestor axis.

descendant

Contains the children of the context node, their children, and so on, all the way down to the lowest level comment, element, processing instruction, and text node. It does not include attributes or namespaces.

descendant-or-self

//

Contains the context node in addition to all the nodes contained in the descendant axis. (Use sparingly for performance reasons.)

preceding-sibling

Contains all children of the context node's parent node which appear before the context node.

following-sibling

Contains all children of the context node's parent node which appear after the context node.

preceding

Contains all nodes which appear before the context node that are not ancestors.

following

Contains all nodes which appear after the context node that are not descendants.

namespace

Contains the context node's namespace node.

The node test specifies the type and name of the nodes selected by the location step. Node tests include text( ), which selects the text content of the context node; comment( ), which selects all the child nodes of the context node that are comments; processing-instruction( ), which selects all the child nodes of the context node that are processing instructions; and node( ), which is the default, and selects all children of the context node. The child axis is the default for any location step that does not have an explicit axis.

A predicate further refines the set of nodes selected by the location step. Predicates can include selecting a specific element by position, as well as functions like count( ). Predicates always appear in square brackets ([ ]).

The double slash (//) represents the expression descendent-or-self::node( ). The XPath query //foo would return all elements named foo anywhere in the document. While this is a very powerful expression, it is also very inefficient, as it requires the XPath processor to evaluate every node in the document to see if it contains an element named foo. It should be used sparingly, and preferably within controlled contexts.

I'll show you some of these terms in their proper context as we go along.

6.1.1.2 Selecting elements

If you have an XML document such as the inventory database in Example 6-1, you might wish to select certain nodes from it. For example, you might want to know the date the inventory numbers were recorded. The following XPath expression would return the date element:

/child::date

The double colon (::) separates the axis from the element being selected. Since child is the default axis, this can also be expressed in the abbreviated syntax:

/date

Every XPath expression has a context node. The context node is the node from which the search begins. In most cases, an XPath implementation allows you to select the node you wish to use as the context node. However, you can explicitly indicate that the search is to begin from the root element by beginning the expression with /. Following the slash, the string date indicates that the expression is to return all nodes that are descendants of the root node, and have the name date.

The XPath recommendation does not require a standard way to set the XPath context node. In .NET, the XmlNode object's SelectNodes( ) method, which I introduced in Chapter 5, sets the context node to the XmlNode instance upon which you call the method.

For the inventory document example, this expression would return the element <date year="2002" month="7" day="6" />. If there are other nodes elsewhere in the tree with the name date, each of them would be returned as well. You can make your search more specific by including only those nodes with the name date that are children of any node named inventory, using this expression:

/child::inventory/child::date

And again, this can be expressed with the abbreviated syntax:

/inventory/date

In much the same vein, you could navigate to the items element with any of the following expressions; they can be considered equivalent if the context node is the root element:

//child::inventory/child::items
//inventory/items
/inventory/items
inventory/items

The single leading slash (/), as explained previously, is an axis that indicates that the context node is to be ignored and the search is to be done starting at the root. The double leading slash (//) has a slightly different meaning: at any point within the expression, it indicates that the search is to include the context node as well as all its descendants, although at the beginning of the expression the double slash is equivalent to a single slash. The expression with no leading slash indicates that the search is relative to the context node.

// is actually just an abbreviation for the descendant-or-self::node( )/ axis. So another equivalent to the expressions above would be:

descendant-or-self::node( )/inventory/child::items

This expansion and replacement of axes really could go on forever.

Once you have retrieved the items element, you can make it the context node for your next XPath expression. You can then return the list of item elements with this expression:

item

You can then iterate through each of these item nodes, doing as you wish with them.

If you have an item element and wish to gather information about the inventory date, you can use the double period axis (..), which is an abbreviation for parent::node( ). This axis selects the parent of the current node. So, to get the date element from an inventory element's context, you could use this expression:

../../date

The double period can be used anywhere in the expression. For example, you can combine some of the previous forms to return the date element in a fairly inefficient yet entirely legal way. This sort of construct really comes into its own when you start to build XPath expressions dynamically:

//item/../../date

It's interesting to note that although //item would select all the item elements within the document, //item/../../date returns only the one date element. This is because XPath removes duplicate nodes from the result set.

You can also select multiple elements at once, with the pipe character (|). The following expression selects both the date and item elements from the document:

//item|//date

6.1.1.3 Selecting attributes

XPath defines a special character to select an attribute node. The at sign (@) axis indicates that the node to select is an attribute. @ is an abbreviation for attribute::. Attributes can be intermingled with other nodes in the XPath expression. Thus, the following expression selects the year attribute of the date element:

//inventory/date/@year

And again, although it is an odd and somewhat inefficient way to do it, you could select the month attribute from any element that has a year attribute with this expression:

//@year/../@month

You can also use wildcards for element and attribute names. An asterisk (*) matches all element nodes, and @* matches all attribute nodes. This expression returns all attributes for all elements:

//*/@*

Finally, the node( ) function selects all nodes, of all types.

You may find it helpful to expand the axis abbreviations into their full axes as an aid to learning. For example, //inventory/date/@year is equivalent to descendant-or-self::node( )/child::date/attribute::year, which, while specific, is not exactly terse.

6.1.1.4 Selecting text, comments, and processing instructions

XPath also defines several functions to select the other types of nodes. The first of these, text( ), selects any text node. The data returned will concatenate all text, whitespace, CDATA, and entity references into a continuous stream of characters, as long as there is no markup separating them:

//text( )

Contrary to the XPath 1.0 recommendation, in .NET's XPath implementation, a CDATA section interrupts a text node. The CDATA itself and any text following the CDATA will not be returned by text( ).

The comment( ) function selects comments. Each comment is returned as a separate node, even if there is no text or markup between them:

//comment( )

As the name implies, the processing-instruction( ) function selects processing instructions:

//processing-instruction( )

With all the expressions you've seen so far, you can move up or down the node hierarchy at will, by inserting the appropriate axis. For example, you can select all the attributes of the parent nodes of any processing instructions with this expression:

//processing-instruction( )/../@*

6.1.1.5 Selecting nodes by value

However, there are times when selecting all the elements or attributes with a particular name is not enough. You may want to find all the elements with a particular attribute value. For this purposes, XPath defines predicates. The following expression selects any item elements that have a productCode attribute whose value is equal to GN0500:

//item[@productCode='GN0500']

You might also want to find all the items for which fewer than 10,000 units are in stock. The following XPath expression would discover that, and select their description attributes:

//item[@quantity<10000]/@description

XPath also supports the relational operators <, >, <=, >=, and !=, as well as and and or. Most values are converted automatically to an appropriate numeric or Boolean value, if the operator requires that type.

Although there is a lot more included in the XPath recommendation, there is not room in this volume to list it all. If you're interested in learning more about XPath, I recommend XML In a Nutshell (O'Reilly). If you want to learn about XPath in an XSLT context, take a look at XSLT (O'Reilly).

6.1.2 When to Use XPath

You should use XPath when you have an XML node in memory and you wish to navigate directly to a particular child node. This presumes that you have either created or loaded an XmlDocument in memory. You can also load an XML document directly into an XPathDocument from a Stream, URL, TextReader, or XmlReader. This method obviates the need to create an XmlDocument at all, and is more efficient than the DOM, since the XPathDocument is a read-only representation of the XML document.

XPath is a good substitute for XmlReader when you have already read an entire document into memory, and the document is to be processed randomly. If you have an extremely large XML document, or you wish to access it strictly sequentially, however, there can be a performance advantage to writing an XmlReader client that handles parsing events. For example, if you are only interested in a certain node within the document, there is no need to load the entire document into memory; you should write an XmlReader client to handle the specific parsing event that indicates the node in question has been read, and skip the rest.

[ Team LiB ]