NAME
wtr2_sax_base
SYNOPSYS
perl wtr2_sax_base
DESCRIPTION
This code uses SAX to extract the data from the invoices. It parses the invoice and extract the relevant data into a Perl data structure that is then used to check the invoice and update the data base.
The first problem to solve when using SAX is that the content of elements
can be broken in different calls to the characters
handler. So I needed to
buffer the content. Luckily enough, Robin Berjon's XML::Filter::BufferText
does just that!
So I used a SAX machine (using the SAX::Machines manpage) to pipe the 2 handlers, first
the XML::Filter::BufferText manpage, then my own handler: wtr2_handler.
Note that SAX::Machines
takes care of
wtr2_handler
extracts all the information needed to check the invoice, then
store it in the data base. The resulting data (returned by the end_document
handler) is then used by check_invoice and store_invoice.
As this is something that is likely to be quite common and as there are few SAX modules that do this, I decided to go generic: I created a small language to describe how to extract the data and store it in my custom data structure.
The idea is to give an element name (no namespaces are used in this DTD, so
there is no need to get fancy) and associate an action to it. Actions can be
associated with the start of an element or with its content.
At the start of an element it is possible to store attributes or to create
new sub-records, for repeatable data in the document, such as InvoiceRow
The content of an element can be stored, either as top-level data, for non-repeatable data, or in a sub-record, for repeatable data.
The easiest way I found to parse these actions was to use the Getopt::Long manpage Overall this is slightly overkill for this problem, but could be re-used in other cases, so I thought it would be worth it to show it here.
In order to know in which element the parser is from the characters
handler
I used a stack of element names: the start_element
handler pushes the
current element name on the stack and the end_element
handler pops it. This
is the only way to get access to the parent name, needed for the --parent
option.
Overall the code was quite a pain to write, especially as the default parser,
the XML::LibXML::SAX::Parser manpage had a problem during my tests, as once again I had
upgraded libxml2
but not the Perl module. The hardest part was designing a
way to express what I wanted to extract from the XML document and how to store
it, without resorting with one of those long lists of if
s that I find make
code such a pain to maintain.
AUTHOR
Michel Rodriguez <mirod@xmltwig.com>
LICENSE
This code is Copyright (c) 2003 Michel Rodriguez. All rights reserved.
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
Comments can be sent to mirod@xmltwig.com
SEE ALSO
XML::SAX XML::SAX::Machines XML::Filter::BufferText
Ways to Rome 2 - Kourallinen Dollareita: http://www.xmltwig.com/article/ways_to_rome_2/