PHP DomXPath: Read Complex XML files easily.

XPath allows traversing through XML elements and attributes very easily. For complex XMLs, using XPath can significantly reduce the complexity of coding.

Good Tutorial of XPath can be viewed on W3C schools here. Reference on DomXPath can be viewed here.

XPath is useful if someone needs to extract a specific node from an entire XML, rather than parsing the entire XML by running a query. Below I will explain how to use DomDocument and DomXPath to read XML. At first I will start with a simple XML, and then more complex.

Here’s a basic XML called ‘test.xml’:

<xml version="1.0" encoding="ISO-8859-1">
<library>
	<book isbn="781">
		<name>SCJP 1.5</name>
		<info><![CDATA[Sun Certified Java Programmer book]]></info>
	</book>
        <book isbn="980">
		<name>jQuery How To</name>
		<info><![CDATA[jQuery Reference Book]]></info>
	</book>
<library>

Few Details: A couple of DomXPath query syntax is:

// get all book element which has info attribute and is a child of library 
query("//library/book[@info]") 

// get all name element which is a child of book which is a child of library (library is the root node)
query("/library/book/name'); 

// note: above i use single slash before library to specify its a root node

For full specs, please check out W3C’s page on XPath syntax

To register a namespace in DomXPath of PHP, use the following:

$xpath->registerNamespace('localName', namespaceURI');

Below is a way to parse it, At first load XML File on DomDocument and initialize DomXPath and then run the query method.

$dom = new DomDocument("1.0", "ISO-8859-1");
$dom->load('test.xml');
$xpath = new DomXPath($dom);

Lets say, I want to get all the names of book, then just do the following:

$bookList  = array();
$bookNodes = $xpath->query('//book/name'); // selects all name element
for($i=0;$i<$bookNodes->length;$i++) {
 $bookList[] = $bookNodes->item($i)->nodeValue;
}

// below is print_r of bookList
array(2) {
  [0]=>
  string(8) "SCJP 1.5"
  [1]=>
  string(18) "jQuery is Awesome!"
}

OK, simple enough. The above example doesn’t demonstrate how XPath makes life easier, so, lets parse YouTube’s featured RSS Playlist, which has namespace and a whole lot of elements. To keep it simple, I am only going to fetch the recently added Video’s Title and their corresponding URL.

From the RSS file, it can be seen that URL is stored “href” attribute of ‘link’ the element which has attribute type as text/html. Below I show the snippet of ‘link’ element and title from the XML.


 ... 
  
    YouTube Symphony Orchestra @ Carnegie Hall - Act One
    ...
    
   ...
  
 ...

OK, so we need to get the following:

  1. The node value of ‘title’ element which has attribute type=’text’ and which is inside entry and entry is inside feed.
  2. The attribute value of ‘href’ which is of link element inside entry, and which is inside feed.

Below is the full code of how to read YouTube’s RSS

// initialize and the file into load DomDocument
$youTubeDom = new DomDocument();
$youTubeDom->load('http://gdata.youtube.com/feeds/api/standardfeeds/recently_featured');

// intialize an DomXPath object
$xPath 		= new DomXPath($youTubeDom);

// register the namespace on YouTube (its declared on feed element)
$xPath->registerNamespace('yte', 'http://www.w3.org/2005/Atom');

// now run the 2 queries, add the suffix of that namespace because feed, entry etc. belong to that namespace
$linkNodes 	= $xPath->query("/yte:feed/yte:entry/yte:link[@type='text/html']");
$titleNodes     = $xPath->query("/yte:feed/yte:entry/yte:title[@type='text']");


$recentList	= array();
for($i=0;$i<$titleNodes->length;$i++) {
	$recentList[$i] = array(
		'title' => $titleNodes->item($i)->nodeValue ,
		'url'	=> $linkNodes->item($i)->getAttribute('href')
	);
	
}

And Thats it. Here’s a var_dump of the $recentList


array(25) {
  [0]=>
  array(2) {
    ["title"]=>
    string(52) "YouTube Symphony Orchestra @ Carnegie Hall - Act One"
    ["url"]=>
    string(42) "http://www.youtube.com/watch?v=ueJcRmfweSM"
  }
  [1]=>
  array(2) {
    ["title"]=>
    string(38) ""The Internet Symphony" Global Mash Up"
    ["url"]=>
    string(42) "http://www.youtube.com/watch?v=oC4FAyg64OI"
  }
  [2]=>
  array(2) {
    ["title"]=>
    string(49) "Harmony: The Road to Carnegie Hall Teaser Trailer"
    ["url"]=>
    string(42) "http://m.youtube.com/details?v=oC4FAyg64OI"
  }
  [3]=>
  array(2) {
    ["title"]=>
    string(44) "The YouTube Symphony Orchestra Summit Begins"
    ["url"]=>
    string(42) "http://www.youtube.com/watch?v=wBZviTce94Q"
  }
  [4]=>
  array(2) {
    ["title"]=>
    string(47) "4/13@A.M. YouTubeSymphonyOrchestra Vlog by Eiko"
    ["url"]=>
    string(42) "http://m.youtube.com/details?v=wBZviTce94Q"
  }
  [5]=>
  array(2) {
    ["title"]=>
    string(50) ""Internet Symphony, Eroica" Rehearsal with Tan Dun"
    ["url"]=>
    string(42) "http://www.youtube.com/watch?v=lwVtmH9k-SI"
  }
  // ...... shortened .... 
  
}

I hope this explains how XPath simplifies reading XML files.