XML, the Perl Way




  perl wtr2_sax_base


This code uses SAX to extract the data from the invoices. It parses the invoice and extract the relevant data into a Perl data structure that is then used to check the invoice and update the data base.

The first problem to solve when using SAX is that the content of elements can be broken in different calls to the characters handler. So I needed to buffer the content. Luckily enough, Robin Berjon's XML::Filter::BufferText does just that!

So I used a SAX machine (using the SAX::Machines manpage) to pipe the 2 handlers, first the XML::Filter::BufferText manpage, then my own handler: wtr2_handler. Note that SAX::Machines takes care of

wtr2_handler extracts all the information needed to check the invoice, then store it in the data base. The resulting data (returned by the end_document handler) is then used by check_invoice and store_invoice.

As this is something that is likely to be quite common and as there are few SAX modules that do this, I decided to go generic: I created a small language to describe how to extract the data and store it in my custom data structure.

The idea is to give an element name (no namespaces are used in this DTD, so there is no need to get fancy) and associate an action to it. Actions can be associated with the start of an element or with its content. At the start of an element it is possible to store attributes or to create new sub-records, for repeatable data in the document, such as InvoiceRow

The content of an element can be stored, either as top-level data, for non-repeatable data, or in a sub-record, for repeatable data.

The easiest way I found to parse these actions was to use the Getopt::Long manpage Overall this is slightly overkill for this problem, but could be re-used in other cases, so I thought it would be worth it to show it here.

In order to know in which element the parser is from the characters handler I used a stack of element names: the start_element handler pushes the current element name on the stack and the end_element handler pops it. This is the only way to get access to the parent name, needed for the --parent option.

Overall the code was quite a pain to write, especially as the default parser, the XML::LibXML::SAX::Parser manpage had a problem during my tests, as once again I had upgraded libxml2 but not the Perl module. The hardest part was designing a way to express what I wanted to extract from the XML document and how to store it, without resorting with one of those long lists of ifs that I find make code such a pain to maintain.


Michel Rodriguez <mirod@xmltwig.com>


This code is Copyright (c) 2003 Michel Rodriguez. All rights reserved.

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

Comments can be sent to mirod@xmltwig.com


XML::SAX XML::SAX::Machines XML::Filter::BufferText

Ways to Rome 2 - Kourallinen Dollareita: http://www.xmltwig.com/article/ways_to_rome_2/