XML, the Perl Way

Extending XML::XPath


Extending XML::XPath

I have always admired Matt Sergeant's XML::XPath module. XML::XPath is a pretty complete implementation of the W3C's XPath recommendation. You see, I am the author of XML::Twig, which implements, in a way that can only be described as dirty, a subset of XPath. Matt's XPath engine is written properly: it uses the BNF (Backus-Naur Form, a syntax to formally describe the... syntax of languages) of the language as the specification, its regexp-based tokenizer feeds an analyzer that builds an internal form of the expression, which is then applied to an XML document (or fragment). This is way cleaner, and more powerful, than the simple regexp-based, all-in-one analyzer that I use in XML::Twig. Incidently, this sad state of affair is mostly due the fact that I never set out to write a full XPath engine, but it just grew as users, including me, asked for more and more features of the language to be supported.

So when I decided to add a complete XPath engine to XML::Twig, good software engineering principle told me that I should have a look at XML::XPath, and try to re-use it as much as possible.

This article will show how I integrated the XML::XPath engine into XML::Twig to (at last!) give it full XPath support, then how I got carried away and added XPath support to XML::DOM too. It will also give you a blueprint on how to add XPath support to modules that manage trees. Maybe more important, the method I used can be used in other cases, to wrestle re-use out of code that might not have been designed for it.

The Players

XPATH (http://www.w3c.org/XPATH/) is a W3C recommendation for the query of XML documents. XPath expressions can be as simple as /doc/title, which selects all elements title right within the root of the document (the doc element), and as complex as //item/title[following-sibling::description[starts_with( string(), "2002")]], which selects all title elements within item elements, which are followed by a description element that starts with the string ``2002''.

is a module that builds a tree representing an XML document, using the DOM (Document Object Model, yet another W3C recommendation). It offers methods to select nodes in the tree using XPath, and methods to modify the tree and to output it (or parts of it) based on the DOM. It is a stable module, quite popular, although XML::LibXML is probably a better choice if you can install it: it is more powerful, offering DOM, XPath, RelaxNG, XInclude... support, and more efficient, if often more unstable than XML::XPath.

XML::Twig is a Perl module that does ``XML, The Perl Way''. It offers a comprehensive API on XML documents, and includes a (very!) crude XPath engine.

An other module on CPAN, Class::XPath, is designed as a general purpose XPath engine. Its documentation lists clearly the methods to implement to add XPath support to any code that manages trees. I chose not to use it though, because it's XPath engine is quite limited, even less powerful than XML::Twig's.

Let's do it

First Contact

The first step was to have a look at XML::XPath, figure out how it worked, and see how I could use it's XPath engine.

My main goal was to implement the findnodes, findvalue and findnodes_as_string methods, which return the result of an XPath query on a document in different forms.

One option was to rip the relevant code from the module, modify it to work with XML::Twig and include it. This is Open-Source remember, and the Artistic License allows this. I thought a better option, though was to use XML::XPath and inherit from it. This would be much cleaner, allow me to use future versions of the module, and might be easier if XML::XPath had been written to allow this. Additionaly it would mean that I could deflect some bug reports towards Matt instead of having to maintain my fork of the code...

So could this be done?

Looking at the code of the module, I quickly figured that it consisted in 4 big parts:

XML::XPath::PerlSAX and XML::XPath::XMLParser
These packages parse the XML and build a tree of nodes

XML::XPath::Node and XML::XPath::Node::*
These packages store the various types of nodes in the tree

XML::XPath::Parser
This package parses an XPath expression and stores it in a form that can later be evaluated

evaluation modules
Those packages are used to evaluate the parsed XPath expression, on the tree (or on a sub-tree) built by XML::XPath::XMLParser.

Logically the parsing of the XPath expression does not depend on the tree, so this block could probably stay as-is. XML::Twig provides the parsing block, and has its own tree model, which will replace the equivalent code in XML::XPath. So the real challenge would be to allow the evaluation of the query to work on XML::Twig nodes instead of on XML::XPath nodes.

Encouragingly, the main call to evaluate from XPath.pm looks like this:

   return $parsed_path->evaluate($context);

$context is an XML::Path::Node. It looked like the context was just passed to the method, so it might be possible to pass a different one, an XML::Twig one.

I then looked for the evaluate method in the module: a quick grep found it in the different packages:

  % grep 'sub evaluate' *.pm XPath/*.pm XPath/Node/*.pm
  XPath/Expr.pm:sub evaluate {
  XPath/Function.pm:sub evaluate {
  XPath/Literal.pm:sub evaluate {
  XPath/LocationPath.pm:sub evaluate {
  XPath/Number.pm:sub evaluate {
  XPath/Root.pm:sub evaluate {
  XPath/Step.pm:sub evaluate {
  XPath/Step.pm:sub evaluate_node {
  XPath/Variable.pm:sub evaluate {

Interestingly enough Step.pm seemed to be the only package one that deals with the nodes in the XML tree. The others either just loop through the steps (LocationPath.pm) or work on objects that are generated by XML::XPath during the evaluation of the query.

It looked like I had found the small part of the code that I needed to take a closer look at.

Trusting Matt not to change the name of the parameter gratuitously, I once again grep-ed for information:

  % grep 'context->' XPath/Step.pm
  #    warn "Node: ", $context->[node_name], "\n";
      my $parent = $context->getParentNode;
      $context = $context->getParentNode;
      foreach my $attrib (@{$context->getAttributes}) {
      foreach my $node (@{$context->getChildNodes}) {
      my @stack = $context->getChildNodes;
      my $parent = $context->getParentNode;
      while ($context = $context->getNextSibling) {
      while ($context = $context->getNextSibling) {
      return $results unless $context->isElementNode;
      foreach my $ns (@{$context->getNamespaces}) {
      my $parent = $context->getParentNode;
      my $parent = $context->getParentNode;
      while ($context = $context->getPreviousSibling) {
      while ($context = $context->getPreviousSibling) {

Bingo! This looked very much like the list of methods I would need to implement on my nodes to allow XML::XPath to evaluate the queries.

A little post-processing later I had a nice list:

  % grep '\$context->' XPath/Step.pm | perl -ln -e'print $1 if( m{$context->(\w+)}); ' | sort -u
  getAttributes
  getChildNodes
  getNamespaces
  getNextSibling
  getParentNode
  getPreviousSibling
  isElementNode

In addition methods were called on nodes themselves, which led to a second list of methods:

  % grep 'node->' XPath/Step.pm | perl -ln -e'print $1 if( m{node->(\w+)}); ' | sort -u
  getChildNodes
  getExpanded
  getLocalName
  getName
  getNamespace
  getTarget
  getValue
  isCommentNode
  isElementNode
  isPINode
  isTextNode

Most of them were quite straightforward to write, as they existed under a different name in XML::Twig:

getParentNode, getChildNodes, getNextSibling, getPreviousSibling, getName, getValue, isElementNode, isCommentNode, isPINode, isTextNode, and getTarget could all be aliased to the equivalent XML::Twig method.

getNamespaces, getExpanded and getLocalName, which all deal with namespaces, would have to be written, as XML::Twig's namespace support was not quite there. This indeed would be a nice by-product of the whole exercice, giving me a reason, and already-written tests, to improve XML::Twig.

Finally getAttributes showed the real problem: the difference between XML::XPath's and XML::Twig's tree models.

Dealing with Model Mismatches

XML::XPath uses the DOM model to store XML: a base class of XML::XPath::Node provides the basic functions on nodes, and is subclassed in XML::XPath::Node::Attribute, XML::XPath::Node::Comment, XML::XPath::Node::Element, XML::XPath::Node::Namespace, XML::XPath::Node::PI and XML::XPath::Node::Text. XML::Twig on the other hand uses mostly 2 classes: the entire document, XML::Twig, element and XML::Twig::Elt. Elements can be regular elements, comments, PI's or text, depending on their tag name. Attributes are simply attached to the elements, they are not objects themselves. The problem was worse because attributes are actually stored in a hash, and despite the module providing methods to access them, most of the code I have seen uses that fact: users write $elt->{atts}->{foo} instead of the cleaner $elt->att( 'foo'). So even though the XML::Twig API would have allowed me to change the implementation and promote attributes to full-fledged objects, doing so would have likely considerably annoyed most users.

A look at the way XML::XPath uses attributes showed that getAttributes returns an array (or a reference to an array) of attribute nodes, each attribute having a name and a value, accessible through the getName and getValue methods.

In short I could not create an attribute class, as it would break existing code.

Or couldn't I?

The fact that XML::Twig does not use a separate class for attributes for its own processing does not preclude the use of one just for XML::XPath benefit. So I could perfectly have the getAttributes method create attribute objects and return them.

Namespaces could be dealt with the same way: although they would be handled through methods in XML::Twig::Elt, the class that would interface with XML::XPath would create XML::Twig::XPath::Namespace objects and return them when needed.

First implementation

It was now time to try implementing this:

I created a file XPath.pm that held 2 packages sub-classing XML::Twigs', XML::Twig::XPath and XML::Twig::XPath::Elt, and the new packages XML::Twig::XPath::Attributes and XML::Twig::XPath::Namespace.

As a side note, I don't like the common Perl convention of having only one package per file, along with its documentation, as it makes navigating the docs harder than it should. I cannot count the number of times people have asked questions about XML::Parser that were plainly stated in the documentation for XML::Parser::Expat, that no one ever reads. So the documentation for XML::Twig::XPath ended up in XML::Twig.

XML::Twig::XPath subclasses XML::Twig. It mostly adds an XML::XPath object to the base object, that will be used to perform the queries, and uses XML::Twig::XPath::Elt to store the nodes of the tree. It also includes the methods I wanted to add to XML::Twig: findnodes, findnodes_as_string, findvalue, exists, find and matches. Each of them is just a call to the same method on the twig's XML::XPath object, with the path and the twig as arguments.

XML::Twig::XPath::Elt is similar: it subclasses XML::Twig::Elt and adds it the XML::XPath methods. It also includes the two methods getAttributes and getNamespace, which create respectively an array of XML::Twig::XPath::Attribute objects and an XML::Twig::XPath::Namespace objects.

Those objects are initialized from the data in the XML::Twig::Elt object, and are designed to mimic the corresponding objects in XML::XPath::Node::Attribute and XML::XPath::Node::Namespace

Testing

This looked OK, it even passed perl -c ;--)

The next step was testing (even though I like testing, I am not an eXtreme Programer, I code first, then test).

The thing is, I am not an expert in XPath. For sure I have read the recommendation, and I know how to write the queries I need, but writing comprehensive tests requires quite a bit more knowledge, not to mention efforts, than that. Especially when you are too lazy to write the XPath engine yourself in the first place!

The solution was obvious: re-use the tests from XML::XPath!

A quick look at the t/ directory showed that the tests on the query engine were neatly separated from the ones on the other functions of XML::XPath. The tests themselves were written in a very consistent way, that lead itself to automatic processing. Ah the joys of working on a module written by a good coder!

This lead me to writing this simple script, that would take an XML::XPath test and generate an XML::Twig::XPath one.

  #!/usr/bin/perl -p -w
  use strict;

  BEGIN
    { # add the shebang line, use strict and the CVS id line
      print qq{#!/usr/bin/perl -w\nuse strict;\n# \$Id\n};
    }

  s{^use XML::XPath;}{use XML::Twig::XPath;};          # change the module to use
  s{^my \$xp = XML::XPath->new\(ioref => \*DATA\);}    # change the object creation
   {my \$t= XML::Twig::XPath->new->parse( \\*DATA);};
  s{^ok\(\$xp\);}{ok(\$t);};                           # test the creation on $t, not $xp
  s{\$xp->}{\$t->}g;                                   # method calls on $xp are now method calls on $t

Note that I changed the main object from $xp to $t, and added use strict in order to get an error if $xp was used in any other context than testing its creation (ok($xp)) or to call methods on it.

Then of course the first make test spewed errors all over the place...

Adding the missing pieces

Actually it wasn't too bad. Once I had fixed a couple of syntax errors in the tests and added the getRootNode method that I had missed, the error message I got, for every single test, was:

Can't locate object method "get_global_pos" via package "XML::Twig::XPath::Elt" at /usr/lib/perl5/site_perl/5.8.1/XML/XPath/NodeSet.pm line 20.

What was this mysterious get_global_pos method?

A little investigation in NodeSet.pm revealed the ugly truth: a NodeSet is an object that stores an array of results, or partial results, for a query. It is created deep in the bowels of XML::XPath, in every place where results are generated.

XML::XPath always return results in the document order. In my experience users are very often disapointed when the results of a query are not in document order, so really this makes sense. It is even needed for some methods (findnodes_as_string and findvalue). Plus of course the XPath recommendation seems to require it...

The way NodeSets are sorted in XML::XPath is by using an internal property of the nodes, that gives their position in the document. That's what get_global_pos returns. This position is updated, and the document renumbered every time a new element is added.

The problem here is that XML::Twig does not have such a property. As it is oriented towards modifying XML documents, even huge ones, I feel that the renumbering phase would be too expensive to perform. It can sort elements though, using the cmp method.

So the solution looked easy: I needed to subclass XML::XPath::NodeSet and change its sort methods.

But I couldn't...

XML::XPath::NodeSet objects are created within XML::XPath sub classes, not by user code. XML::XPath is an object factory. Practically, this means that the module does not offer any way to subclasse XML::XPath::NodeSet. Interestingly enough XML::Twig used to have the same problem with XML::Twig::Elt objects: as they were created by XML::Twig during the parsing, it was impossible to use subclasses of XML::Twig::Elt to add custom methods. I solved this by adding an option to the constructor of XML::Twig, that lets the user specify an alternate class to use.

This could not be done in this case though, as the XML::XPath::NodeSet constructor is called from places in the code where the initial XML::XPath object is not even in scope. And anyway I did not want to change XML::XPath at all.

So the solution was to use the power of Perl to cheat...

I redefined the sort method directly in the XML::XPath::NodeSet package, from XML::Twig::XPath:

  BEGIN
  { package XML::XPath::NodeSet;
    no warnings;  # to avoid the "Subroutine sort redefined" message
    # replace the native sort routine by a Twig'd one
    sub sort 
      { my $self = CORE::shift; # XML::XPath::NodeSet has a shift method
        @$self = CORE::sort { $a->node_cmp( $b) } @$self;
        return $self;
      }
  }

Note that this removes the original sort in XML::XPath::NodeSet, so you can't use both XML::Twig::XPath and XML::XPath at the same time (you might not want to do this, but it would be convenient for me, to add more tests that would compare the results with the 2 modules). I also sent an email to Matt to tell him about that problem, and he agreed to add a way to set this method cleanly in the next version of XML::XPath.

Now I just had to write the node_cmp methods for the various types of nodes, XML::Twig::XPath, XML::Twig::Elt and XML::Twig::Attribute, which turned out to be a little tedious because each comparison had to test the type of the other node before using the already existing cmp.

Here is for example node_cmp for the XML::Twig::XPath::Attribute package:

  sub node_cmp($$) 
    { my( $a, $b)= @_;
      # $a is always an attribute, but $b can be an attribute, element or document
      if( UNIVERSAL::isa( $b, 'XML::Twig::XPath::Attribute')) 
        { # 2 attributes: compare their elements, then their name if on the same element
          return ($a->{elt}->cmp( $b->{elt}) ) || ($a->{name} cmp $b->{name});
        }
      elsif( UNIVERSAL::isa( $b, 'XML::Twig::XPath::Elt'))
        { # attribute <=> elt : compare the attribute->elt and the elt
          # if attribute->elt is the elt (cmp returns 0) then 1 (elt is before att)
          return ($a->{elt}->cmp( $b) ) || 1 ;
        }
      elsif( UNIVERSAL::isa( $b, 'XML::Twig::XPath'))
        { # attribute <=> document, document is before att
          return 1;
        }
      else
        { die "unknown node type ", ref( $b); }
    }

Once that was done, most tests started passing, with the exception of the namespace one, and the one processing xml:lang attribute.

Writing the namespace processing code was a bit of a pain, due to the complete lack of support for them both in XML::Twig and in my brain. It was very valuable nevertheless, as most of it ended up in the main XML::Twig module, adding some much needed features to the module. Reading the code for XML::XPath and the tests helped a lot here. XML::XPath ended up providing me with the specification for the support of namespaces in XML::Twig. It would have been much harder for me to figure out what to do on my own.

Wrapping it up

Then I added some more tests just to make sure that everything worked fine, this time re-using XML::Twig's original tests. This showed me that I was still lacking a to_number method for all of my new classes.

I proceeded to add some more tests of my own... these methods are really well tested now: HARNESS_PERL_SWITCHES=-MDevel::Cover make test gives me a coverage of 100.0% subroutines, statements and branches, and 92.3% conditions, I suspect the missing condition is a bug in Devel::Cover ;--).

The documentation for the new module was quite easy to write: a quick synopsys, a list of methods, and a reference to XML::Twig and to XML::XPath for aditional details!

Then it was time to release XML::Twig 3.12, which includes XML::Twig::XPath.

But Wait, There's More!

XML::DOM::XPath

Once XML::Twig::XPath was complete, I also realized that the old-but-still-fairly-popular XML::DOM could well benefit from an injection of XPath. So I set out to write XML::DOM::XPath, which adds exactly the same methods to XML::DOM, and will help those poor souls who still have to maintain code written with XML::DOM. This was easier in many ways than the previous task, as XML::DOM and XML::XPath use very similar models for the documents (both are DOM based), and even the methods names are mostly identical. It was a bit of a pain in places though, precisely for the same reason: XML::XPath needed to use methods that already existed in XML::DOM, but with a very different result (in fact usually a different result type). This had to be solved using dirty hacks to return different results depending on the caller. Once again the main problem came from the fact that lots of objects are created during the parsing, with no clean way to change the class of the objects being created (short of re-writing a whole lot of methods).

Overall this was quick and fun, I even got queries using namespaces to work (by grabbing code from XML::Twig this time, sorry, I can't help it), even though XML::DOM does not support them!

Conclusion

Re-using XML::XPath:

Re-using XML::XPath for a different type of tree structure can be very simple. The XML model is a bit complex, but most trees only use attributes or content, not both, and don't have the concept of namespaces, not to mention processing instructions and comments.

For a simple tree of named elements, without any text, all you need is a node class that offers the following methods:

  getName
  getValue
  getRootNode
  getParentNode
  getChildNodes
  getNextSibling
  getPreviousSibling
  isElementNode

If you need text leaves, a text class is in order, with the same methods (getValue returns the text itself), except that isTextNode replaces isElementNode.

Easy hey? If you want attributes, then you will need a getAttributes method for the node class, plus an attribute class that offers getName, getValue and isAttributeNode.

In addition all of these classes need a to_number method that creates a new XML::XPath::Number from the text of the node.

And I pity the fool who wants namespace support on non-XML trees!

Note that you don't need to have separate classes, XML::Twig itself uses a single one for elements and text (and comments and processing instructions), you just need isElementNode or isText to return the appropriate value.

Add to these a custom sort method, et voila! Actually the sort can be quite a pain to write, as we have seen above. If the tree is quasi-static (ie if it is not modified too much and re-numbering it when necessary is not prohibitive), using a global_pos method on nodes and not subclassing XML::XPath::NodeSet might make things easier.

Overall I spent a good day coding this, writing about 200 lines of code between the XML::Twig::XPath module and the original XML::Twig. This is certainly a lot less than I would have spend re-writing an XPath parser on my own.

Considerations on re-using an existing module

The whole exercice goes beyond allowing me to extend XML::Twig for cheap. It also shows a way to re-use the code of a module with as little effort as possible, using grep to isolate an API for the module that was not necessarily designed for re-use. It also shows how to reuse the tests, thus saving even more time.

I have to stress though that this works much better if the existing module is written cleanly, with good OO discipline, consistent naming of variables and good tests (thanks Matt!).

And of course reusing the existing tests does not mean that you do not need more tests.

Finally

As I was procrastinating^Hfinishing writing the docs for the new version of the module, I received an email asking me why $twig->findnodes( q{/Radio/Sender[Gueltigkeiten[Gueltigkeit[@Code='Y2003']] and Gebiete[Gebiet[@Code='GES']]]}) did not work...

I was quite pleased with the answer I could give!