Ways to Rome 2 - Kourallinen Dollareita
last updated Thu Aug 21 18:16:07 2003
Introduction
This article follows the original Ways to Rome article. It shows how to process XML data using some of the numerous XML modules on CPAN.
The task for this article is to process data-oriented XML. Data-oriented, as opposed to text-oriented XML, is XML that essentially contains data, often extracted from a data base, or resulting from the serialization of an object, or just configuration data for a piece of software. A lot of XMl modules are specifically oriented towards processing data-oriented XML, which is typicaly easier to process than text-oriented XML (it is usually more regular, doesn't include mixed-content...).
Specification of the problem
Finvoice
The data to process will be XML invoices. Those invoices follow the Finvoice specification. Finvoice is the electronic invoice of the Finnish Bankers' Association. It is one of the few publically available standard way of exchanging invoices between a seller and a buyer. Visa also has published the Visa XML Invoice Specification which is quite a bit more complex than the Finvoice one, so I stuck to the simpler one. SAP also has a specification for invoices, but it does not seem to be public. Note that none of my emails asking for information and support to Finvoice, Visa and SAP was ever answered, so don't expect much support in the XML invoice World ;--(
The Finvoice DTD is quite comprehensive and look quite usable for general invoicing.
I found that it includes a couple of content model that in my experience are not optimal for processing, where elements that belong together are not wrapped in a containing element:
<! ELEMENT Finvoice (SellerPartyDetails, SellerOrganisationUnitNumber, SellerContactPersonName?, SellerCommunicationDetails?, SellerInformationDetails?, InvoiceRecipientPartyDetails?, BuyerPartyDetails, BuyerOrganisationUnitNumber?, BuyerContactPersonName?, BuyerCommunicationDetails?, DeliveryPartyDetails?, DeliveryDetails?, InvoiceDetails, PaymentStatusDetails?, PartialPaymentDetails*, VirtualBankBarcode?, InvoiceUrlText?, InvoiceRow+, SpecificationDetails?, EpiDetails)> |
I would much rather have this as:
<! ELEMENT Finvoice (SellerInfo, BuyerInfo, DeliveryInfo?, InvoiceDetails, PaymentDetails, InvoiceRows, SpecificationDetails?, EpiDetails)> |
with the sub-elements containing related data. I especially do not like the fact that InvoiceRow is repeated in the "main" Finvoice element, which I don't think is good practice.
Overall the Finvoice DTD is quite simple, most elements have non-contextual names: names that identify them uniquely without needing to know the context. For example there is a SellerStreetName and a BuyerStreetName. While this makes the DTD harder to maintain (if you change the address model you need to change it in several places), it also makes processing quite easier. The only, pretty annoying, exceptions to this rule are the CountryCode and CountryName, which are used in the SellerPostalAddressDetails and BuyerPostalAddressDetails. This is quite a pain because it forces the code to deal with the context. Using non-contextual element names is a valid design option, but not being consistent about it gives you the worst of both worlds: the DTD is hard to maintain and the code has to deal with the context.
In the end Finvoice is quite a convenient DTD and understanding it and using it was not too difficult.
It should be noted that, as for much "standard" DTDs in my experience, it is quite flexible. It allows for exemple invoice rows to be either highly structured (with a lot of optional elements) or... just plain text. This of course increases the complexity for the developer (but makes the DTD usable in the Real World).
A few words about the invoices
These invoice are an interesting class of document: they are clearly data-oriented: thhey have no mixed content and they are very structured. But on the other hand the order of elements is quite significant, at least for display purposes, which is a feature usually found in document-oriented XML. So as long as the task consists solely in checking them and extracting information from them they can be considered as strictly data-oriented, but if we need to modify them, they they should be dealt with as documents.
Additional characteristics of the invoices: they are quite small, and I would hope a single company wouldn't receive several thousands of then every minute. This allows us to load each invoice in memory, and not to have to worry too much about performance (especially as the cost of processing the XML is likely to be small compared to the time spend talking to the data base). This explains the lack of benchmark in this article (at least the initial version).
The task
The task that the various examples will perform is a typical application of what should be done with such data: process invoices, coming in separate files, check them (check that they reference an existing purchase order, that they are addressed to the right company...) and if they are ok store some of the data in a data base (created with DBI and DBD::SQLite for easy set-up). For extra credit the examples can output documents that do not pass the checks, in the original form plus the error messages as new XML elements at the beginning of the document. Note that this is quite arbitrary, and is there mostly to show how to create new elements with each module. A "real" application would likely use a different method to report errors.
Potential problems not dealt with in these examples are:
- Encoding problems: no matter what the encoding of the original documents is (it is usually ISO-8859-1, at least in the examples provided by Finvoice) they are stored as UTF-8 in the DB. This is actually a sane decision to make if your DB supports it, but it might be hard to do if you are dealing with a legacy system. In this case use of Text::Iconv or Encode should solve the problem
The Examples
XML::Simple
XML::Simple is extremely convenient for loading XML data into a Perl data structure, using XMLin
, and then forgetting that it was ever in XML.
The code was very easy to write. I used YAML during the debugging phase to dump the content of the $xml
structure, so it was really easy to see where was the data I needed.
Note the 2 options for XMLin
:
-
forcearray => [ qw(InvoiceRow)]
- this will cause
XMLin
to load the InvoiceRow's in an array, even if there is only one in the document (without that option a single InvoiceRow would be turned into a hash value instead of an array, which would have forced me to test it). -
forcecontent => 1
- despite the documentation for XML::Simple stating that this option is rarely used I found it made it easier and safer to write the code: all text content is stored in the
content
field of a hash. This makes accessing the values a little uglier (you have to add an extra->{content}
in the expression), but in fact it saved me treating differently elements that had attributes and elements which didn't, plus it's an extra security if new attributes are added to some elements of the DTD, or for optional attributes, that would cause the generated data structure to be different depending on them being present or not (for example inOrderedQuantity
theQuantityUnitCode
attribute is optional). If you use XML::Simple for data where the DTD is simpler and where attributes are always set then you might want not to use this option. In this case I found it convenient.
Finally note that XML::Simple cannot output a modified document, as it looses the order of elements in the document (once elements are assigne to a hash their order is lost), so $CAN_OUTPUT is set to 0 and the error messages are output to the console, rather than creating a new document.
This might or might not be a problem, depending on your specific case.
The complete example is in wtr2_simple
1 #!/usr/bin/perl -w 2 use strict; 3 4 use XML::Simple; 5 6 use FindBin qw($Bin); 7 use lib $Bin; 8 use wtr2_base; 9 10 my $DEBUG=0; 11 12 init_db(); 13 14 # XML::Simple cannot output properly the document 15 # the order of the elements will be lost 16 my $CAN_OUTPUT= 0; 17 18 my @files= @ARGV || (<$dir{invoices}/*.xml>); 19 20 foreach my $file (@files) 21 { my $xml= XMLin( $file, forcearray => [ qw(InvoiceRow)], forcecontent => 1); 22 23 my $errors= check_invoice( $xml); 24 25 if( !@$errors) 26 { store_invoice( $xml); } 27 else 28 { print "ERROR in $file\n ", join( "\n ", @$errors), "\n"; 29 if( $CAN_OUTPUT) 30 { my $rejected_file= rejected( $file); 31 print "adding errors in $rejected_file\n" if( $DEBUG); 32 add_errors( $xml, $errors); 33 output_doc_to_check( $rejected_file, $xml); 34 } 35 } 36 } 37 38 exit; 39 40 sub check_invoice 41 { my( $xml)= @_; 42 my $errors=[]; # array ref, holds the error messages 43 44 check_buyer( $xml->{BuyerPartyDetails}->{BuyerPartyIdentifier}->{content}, 45 $xml->{BuyerPartyDetails}->{BuyerOrganisationName}->{content}, 46 $errors 47 ); 48 check_po( $xml->{InvoiceDetails}->{OrderIdentifier}->{content}, $errors); 49 50 my @rows= @{$xml->{InvoiceRow}}; 51 52 reset_default_row_id(); 53 54 foreach my $row( @rows) 55 { # this does not cope well with broken row numbers 56 57 my $row_id= $row->{RowIdentifier}->{content} || default_row_id(); 58 59 print "checking row $row_id\n" if $DEBUG; 60 61 check_qtty( $row_id, 62 $row->{DeliveredQuantity}->{content}, 63 $row->{DeliveredQuantity}->{QuantityUnitCode}, 64 $row->{OrderedQuantity}->{content}, 65 $row->{OrderedQuantity}->{QuantityUnitCode}, 66 $errors 67 ); 68 } 69 70 return $errors; 71 } 72 73 sub store_invoice 74 { my( $xml)= @_; 75 print "storing invoice $xml->{InvoiceDetails}->{InvoiceNumber}->{content}\n"; 76 77 # build the various data structures 78 my $data; 79 80 my $invoice = $xml->{InvoiceDetails}; 81 $data->{invoice} = { number => $invoice->{InvoiceNumber}->{content}, 82 date => $invoice->{InvoiceDate}->{content}, 83 po => $invoice->{OrderIdentifier}->{content}, 84 amount_no_tax => $invoice->{InvoiceTotalVatExcludedAmount}->{content}, 85 tax => $invoice->{InvoiceTotalVatAmount}->{content}, 86 amount => $invoice->{InvoiceTotalVatIncludedAmount}->{content}, 87 payment_status => $xml->{PaymentStatusDetails}->{PaymentStatusCode}->{content}, 88 }; 89 90 my $seller = $xml->{SellerPartyDetails}; 91 $data->{seller} = { identifier => $seller->{SellerPartyIdentifier}->{content}, 92 name => $seller->{SellerOrganisationName}->{content}, 93 tax_code => $seller->{SellerOrganisationTaxCode}->{content}, 94 }; 95 96 my $address = $xml->{SellerPartyDetails}->{SellerPostalAddressDetails}; 97 $data->{address} = { street => $address->{SellerStreetName}->{content}, 98 town => $address->{SellerTownName}->{content}, 99 zip => $address->{SellerPostCodeIdentifier}->{content}, 100 country_code => $address->{CountryCode}->{content}, 101 po_box => $address->{SellerPostOfficeBoxIdentifier}->{content}, 102 }; 103 104 $data->{contact} = { name => $xml->{SellerContactPersonName}->{content}, 105 phone => $xml->{SellerCommunicationDetails}->{SellerPhoneNumberIdentifier}->{content}, 106 email => $xml->{SellerCommunicationDetails}->{SellerEmailaddressIdentifier}->{content}, 107 }; 108 109 $data->{invoicerow} ||= []; 110 reset_default_row_id(); 111 112 foreach my $invoicerow (@{$xml->{InvoiceRow}}) 113 { push @{$data->{invoicerow}}, 114 { row_id => $invoicerow->{RowIdentifier}->{content} || default_row_id(), 115 sku => $invoicerow->{ArticleIdentifier}->{content}, 116 name => $invoicerow->{ArticleName}->{content}, 117 qty => $invoicerow->{DeliveredQuantity}->{content}, 118 qty_unit => $invoicerow->{DeliveredQuantity}->{QuantityUnitCode}, 119 unit_price => $invoicerow->{UnitPriceAmount}->{content}, 120 amount_no_tax => $invoicerow->{RowVatExcludedAmount}->{content}, 121 tax => $invoicerow->{RowVatAmount}->{content}, 122 amount => $invoicerow->{RowAmount}->{content}, 123 } 124 } 125 126 127 store_all( $data); 128 } |
XML::Smart
I wrote this XML::SMART example after having written the XML::Simple one and I must say I was impressed by how compatible it was: it took me a good 10 minutes to modufy the wtr2_simple code to get this one: basically understanding that I needed to work with <$xml-
{Finvoice} >> instead of directly the original $xml
object, and removing all the extra ->{content}
calls as XML::Smart lets me happily get the content of an element using $row->{DeliveredQuantity}
and the value of an attribute with $row->{DeliveredQuantity}->{QuantityUnitCode}
. Very neat.
In 10 more minutes I could add the errors, by just unshifting the newly created container in the document object.
One big annoyance with this module: I had to remove the -w
switch as I kept getting warnings. This should be in the TODO list of the author (along with the unnecessary attacks against XML::Simple in the docs, the module can stand on its own and certainly does not need them).
The complete example is in wtr2_smart
1 #!/usr/bin/perl 2 use strict; 3 4 use XML::Smart; 5 6 use FindBin qw($Bin); 7 use lib $Bin; 8 use wtr2_base; 9 10 init_db(); 11 12 my $DEBUG=0; 13 14 my $CAN_OUTPUT= 1; 15 16 my @files= @ARGV || (<$dir{invoices}/*.xml>); 17 18 foreach my $file (@files) 19 { my $doc= XML::Smart->new( $file); 20 my $xml= $doc->{Finvoice}; 21 22 my $errors= check_invoice( $xml); 23 24 if( !@$errors) 25 { store_invoice( $xml); } 26 else 27 { print "ERROR in $file\n ", join( "\n ", @$errors), "\n"; 28 if( $CAN_OUTPUT) 29 { my $rejected_file= rejected( $file); 30 print "adding errors in $rejected_file\n" if( $DEBUG); 31 add_errors( $doc, $errors); 32 output_doc_to_check( $rejected_file, $doc); 33 } 34 } 35 } 36 37 exit; 38 39 sub check_invoice 40 { my( $xml)= @_; 41 my $errors=[]; # array ref, holds the error messages 42 43 check_buyer( $xml->{BuyerPartyDetails}->{BuyerPartyIdentifier}, 44 $xml->{BuyerPartyDetails}->{BuyerOrganisationName}, 45 $errors 46 ); 47 check_po( $xml->{InvoiceDetails}->{OrderIdentifier}, $errors); 48 49 my @rows= @{$xml->{InvoiceRow}}; 50 51 reset_default_row_id(); 52 53 foreach my $row( @rows) 54 { # this does not cope well with broken row numbers 55 56 my $row_id= $row->{RowIdentifier} || default_row_id(); 57 58 print "checking row $row_id\n" if $DEBUG; 59 60 check_qtty( $row_id, 61 $row->{DeliveredQuantity}, 62 $row->{DeliveredQuantity}->{QuantityUnitCode}, 63 $row->{OrderedQuantity}, 64 $row->{OrderedQuantity}->{QuantityUnitCode}, 65 $errors 66 ); 67 } 68 69 return $errors; 70 } 71 72 sub store_invoice 73 { my( $xml)= @_; 74 print "storing invoice $xml->{InvoiceDetails}->{InvoiceNumber}\n"; 75 76 # build the various data structures 77 my $data; 78 79 my $invoice = $xml->{InvoiceDetails}; 80 $data->{invoice} = { number => $invoice->{InvoiceNumber}, 81 date => $invoice->{InvoiceDate}, 82 po => $invoice->{OrderIdentifier}, 83 amount_no_tax => $invoice->{InvoiceTotalVatExcludedAmount}, 84 tax => $invoice->{InvoiceTotalVatAmount}, 85 amount => $invoice->{InvoiceTotalVatIncludedAmount}, 86 payment_status => $xml->{PaymentStatusDetails}->{PaymentStatusCode}, 87 }; 88 89 my $seller = $xml->{SellerPartyDetails}; 90 $data->{seller} = { identifier => $seller->{SellerPartyIdentifier}, 91 name => $seller->{SellerOrganisationName}, 92 tax_code => $seller->{SellerOrganisationTaxCode}, 93 }; 94 95 my $address = $xml->{SellerPartyDetails}->{SellerPostalAddressDetails}; 96 $data->{address} = { street => $address->{SellerStreetName}, 97 town => $address->{SellerTownName}, 98 zip => $address->{SellerPostCodeIdentifier}, 99 country_code => $address->{CountryCode}, 100 po_box => $address->{SellerPostOfficeBoxIdentifier}, 101 }; 102 103 $data->{contact} = { name => $xml->{SellerContactPersonName}, 104 phone => $xml->{SellerCommunicationDetails}->{SellerPhoneNumberIdentifier}, 105 email => $xml->{SellerCommunicationDetails}->{SellerEmailaddressIdentifier}, 106 }; 107 108 $data->{invoicerow} ||= []; 109 reset_default_row_id(); 110 111 foreach my $invoicerow (@{$xml->{InvoiceRow}}) 112 { push @{$data->{invoicerow}}, 113 { row_id => $invoicerow->{RowIdentifier} || default_row_id(), 114 sku => $invoicerow->{ArticleIdentifier}, 115 name => $invoicerow->{ArticleName}, 116 qty => $invoicerow->{DeliveredQuantity}, 117 qty_unit => $invoicerow->{DeliveredQuantity}->{QuantityUnitCode}, 118 unit_price => $invoicerow->{UnitPriceAmount}, 119 amount_no_tax => $invoicerow->{RowVatExcludedAmount}, 120 tax => $invoicerow->{RowVatAmount}, 121 amount => $invoicerow->{RowAmount}, 122 } 123 } 124 125 126 store_all( $data); 127 } 128 129 130 sub add_errors 131 { my( $doc, $error_messages)= @_; 132 my $errors= { error => [@$error_messages]}; 133 unshift @$doc, $errors; 134 return $doc; 135 } 136 137 sub output_doc_to_check 138 { my( $file, $doc)= @_; 139 open( FILE, ">$file") or die "cannot create file to check $file: $!"; 140 print FILE $doc->data; 141 close FILE; 142 } |
XML::Twig
An XML::Twig version based on the XML::Simple code
This code uses XML::Twig to process the invoices
As this code was written after the wtr2_simple
example, it was very, very easy to write. I blatantly cheated ;--): it uses XML::Twig simplify
method, which generates the same data structure as XML::Simple for a document (or an element) (available only in XML::Twig 3.10). So once the XML document is parsed in memory a call to that method gives me a data structure that can be processed with exactly the same code as in wtr2_simple
.
The original document is still in memory though, which makes it possible to update it with the error messages and output it.
Note that in this example the only modification of the original document is the addition of the error messages. If I had needed to change data within the document before outputing it (for example to link the error messages to the place where th error occurs), then I would have had to access it using XML::Twig methods, see wtr2_twig
.
The complete example is in wtr2_twig_simple
Here are the parts that differ from wtr2_simple:
... 20 { my $doc= XML::Twig->new( pretty_print => 'indented')->parsefile( $file); 21 my $xml= $doc->simplify( forcearray => [ qw(InvoiceRow)], forcecontent => 1); ... |
A "native" XML::Twig version (in full-tree mode)
This code uses XML::Twig to process the invoices. It uses the full-tree mode: load the entire XML document through the parsefile
method and then process it.
It uses mostly navigation to access the information, the first_child
and field
methods. New elements are created using the insert_new_elt
method.
This was easy to write, but you would expect so, as I wrote XML::Twig and I am obviously quite familiar with it ;--)
The complete example is in wtr2_twig
1 #!/usr/bin/perl -w 2 use strict; 3 4 use XML::Twig; 5 6 use FindBin qw($Bin); 7 use lib $Bin; 8 use wtr2_base; 9 10 init_db(); 11 12 my $DEBUG=0; 13 14 # XML::Twig can output the updated document, whith the error messages 15 my $CAN_OUTPUT= 1; 16 17 my @files= @ARGV || (<$dir{invoices}/*.xml>); 18 19 foreach my $file (@files) 20 { my $doc= XML::Twig->new( pretty_print => 'indented')->parsefile( $file); 21 22 my $errors= check_invoice( $doc); 23 24 if( !@$errors) 25 { store_invoice( $doc); } 26 else 27 { print "ERROR in $file\n ", join( "\n ", @$errors), "\n"; 28 if( $CAN_OUTPUT) 29 {my $rejected_file= rejected( $file); 30 print "adding errors in $rejected_file\n" if( $DEBUG); 31 add_errors( $doc, $errors); 32 output_doc_to_check( $rejected_file, $doc); 33 } 34 }; 35 } 36 37 exit; 38 39 sub check_invoice 40 { my( $doc)= @_; 41 my $root= $doc->root; 42 my $errors=[]; # array ref, holds the error messages 43 44 check_buyer( $root->first_child( 'BuyerPartyDetails')->field( 'BuyerPartyIdentifier'), 45 $root->first_child( 'BuyerPartyDetails')->field( 'BuyerOrganisationName'), 46 $errors 47 ); 48 check_po( $root->first_child( 'InvoiceDetails')->field( 'OrderIdentifier'), $errors); 49 50 my @rows= $root->children( 'InvoiceRow'); 51 52 reset_default_row_id(); 53 54 foreach my $row( @rows) 55 { # this does not cope well with broken row numbers 56 57 my $row_id= $row->field( 'RowIdentifier') || default_row_id(); 58 59 print "checking row $row_id\n" if $DEBUG; 60 61 my $DeliveredQuantity = $row->first_child( 'DeliveredQuantity'); 62 my $OrderedQuantity = $row->first_child( 'OrderedQuantity'); 63 my $delivered_qty = $DeliveredQuantity ? $DeliveredQuantity->text : 0; 64 my $delivered_unit = $DeliveredQuantity ? $DeliveredQuantity->att( 'QuantityUnitCode') : ''; 65 my $ordered_qty = $OrderedQuantity ? $OrderedQuantity->text : 0; 66 my $ordered_unit = $OrderedQuantity ? $OrderedQuantity->att( 'QuantityUnitCode') : ''; 67 68 check_qtty( $row_id, $delivered_qty, $delivered_unit, $ordered_qty, $ordered_unit, $errors); 69 } 70 71 return $errors; 72 } 73 74 sub store_invoice 75 { my( $doc)= @_; 76 my $root= $doc->root; 77 print "storing invoice " . $root->first_child( 'InvoiceDetails')->field('InvoiceNumber') . "\n"; 78 79 # build the various data structures 80 my $data; 81 82 my $invoice = $root->first_child( 'InvoiceDetails'); 83 $data->{invoice} = { number => $invoice->field( 'InvoiceNumber'), 84 date => $invoice->field( 'InvoiceDate'), 85 po => $invoice->field( 'OrderIdentifier'), 86 amount_no_tax => $invoice->field( 'InvoiceTotalVatExcludedAmount'), 87 tax => $invoice->field( 'InvoiceTotalVatAmount'), 88 amount => $invoice->field( 'InvoiceTotalVatIncludedAmount'), 89 payment_status => $root->first_child( 'PaymentStatusDetails') 90 ->field( 'PaymentStatusCode'), 91 }; 92 93 my $seller = $root->first_child( 'SellerPartyDetails'); 94 $data->{seller} = { identifier => $seller->field( 'SellerPartyIdentifier'), 95 name => $seller->field( 'SellerOrganisationName'), 96 tax_code => $seller->field( 'SellerOrganisationTaxCode'), 97 }; 98 99 my $address = $root->first_child( 'SellerPartyDetails') 100 ->first_child( 'SellerPostalAddressDetails'); 101 $data->{address} = { street => $address->field( 'SellerStreetName'), 102 town => $address->field( 'SellerTownName'), 103 zip => $address->field( 'SellerPostCodeIdentifier'), 104 country_code => $address->field( 'CountryCode'), 105 po_box => $address->field( 'SellerPostOfficeBoxIdentifier'), 106 }; 107 108 $data->{contact} = { name => $root->field( 'SellerContactPersonName'), 109 phone => $root->first_child( 'SellerCommunicationDetails') 110 ->field( 'SellerPhoneNumberIdentifier'), 111 email => $root->first_child( 'SellerCommunicationDetails') 112 ->field( 'SellerEmailaddressIdentifier'), 113 }; 114 115 $data->{invoicerow} ||= []; 116 reset_default_row_id(); 117 118 foreach my $invoicerow ($root->children( 'InvoiceRow')) 119 { my $DeliveredQuantity= $invoicerow->first_child( 'DeliveredQuantity'); 120 my $qty_unit= $DeliveredQuantity ? $DeliveredQuantity->att( 'QuantityUnitCode') : ''; 121 122 push @{$data->{invoicerow}}, 123 { row_id => $invoicerow->field( 'RowIdentifier') || default_row_id(), 124 sku => $invoicerow->field( 'ArticleIdentifier'), 125 name => $invoicerow->field( 'ArticleName'), 126 qty => $invoicerow->field( 'DeliveredQuantity'), 127 qty_unit => $qty_unit, 128 unit_price => $invoicerow->field( 'UnitPriceAmount'), 129 amount_no_tax => $invoicerow->field( 'RowVatExcludedAmount'), 130 tax => $invoicerow->field( 'RowVatAmount'), 131 amount => $invoicerow->field( 'RowAmount'), 132 } 133 } 134 135 136 store_all( $data); 137 } 138 139 sub add_errors 140 { my( $doc, $error_messages)= @_; 141 my $errors= $doc->root->insert_new_elt( first_child => 'errors'); 142 foreach my $message (@$error_messages) 143 { $errors->insert_new_elt( last_child => error => $message); } 144 return $doc; 145 } 146 147 sub output_doc_to_check 148 { my( $file, $doc)= @_; 149 open( FILE, ">$file") or die "cannot create file to check $file: $!"; 150 $doc->print( \*FILE); 151 close FILE; 152 } |
XML::DOM
This code uses XML::DOM to process the invoices
I have never liked the DOM. The Object Model is good, very complete and solid, but the API is very Java oriented (camels are good as Perl mascottes, not as veryLongMethodNames), and at least at level 1 (XML::DOM is a level 1 DOM implementation) quite weak and indeed dangerous.
Having gotten this out of my system...
Writing the DOM example wasn't particulary hard. It was long and quite painful, but more boring than difficult.
A minor annoyance, that could actually be a blessing in a different context: every time I needed to access an attribute for an optional element (DeliveredQuantity for example) I had to check the existence of the element, or calling getAttribute
on undef
(the non existent element) would cause the script to die
. XML::Simple and the likes, by contrast, would let me access non existent hash values in the Perl data structure without complaining. Having to check is a pain in a short script like this one, but could be very useful in a bigger project, as it is always better to have the code die with a bang than fail silently (and cause errors later).
I wrote a little layer on top of it, with the first_child, children and text functions, that basically ensure that when I get a child I get the proper one, and not an extra whitespace, comment or other, that would happen to be in the XML.
The complete example is in wtr2_dom
1 #!/usr/bin/perl -w 2 use strict; 3 4 use XML::DOM; 5 6 use FindBin qw($Bin); 7 use lib $Bin; 8 use wtr2_base; 9 10 init_db(); 11 12 my $DEBUG=0; 13 14 my $CAN_OUTPUT= 1; 15 16 my @files= @ARGV || (<$dir{invoices}/*.xml>); 17 18 foreach my $file (@files) 19 { my $parser = new XML::DOM::Parser; 20 my $doc = $parser->parsefile ( $file); 21 22 my $errors= check_invoice( $doc); 23 24 if( !@$errors) 25 { store_invoice( $doc); } 26 else 27 { print "ERROR in $file\n ", join( "\n ", @$errors), "\n"; 28 if( $CAN_OUTPUT) 29 { my $rejected_file= rejected( $file); 30 print "adding errors in $rejected_file\n" if( $DEBUG); 31 add_errors( $doc, $errors); 32 output_doc_to_check( $rejected_file, $doc); 33 } 34 }; 35 } 36 37 exit; 38 39 sub check_invoice 40 { my( $doc)= @_; 41 my $root= $doc->getDocumentElement; # so we really work with the root element 42 my $errors=[]; # array ref, holds the error messages 43 44 my $BuyerPartyDetails = first_child( $root, 'BuyerPartyDetails'); 45 my $BuyerPartyIdentifier = first_child( $BuyerPartyDetails, 'BuyerPartyIdentifier'); 46 my $BuyerOrganisationName = first_child( $BuyerPartyDetails, 'BuyerOrganisationName'); 47 check_buyer( text( $BuyerPartyIdentifier), 48 text( $BuyerOrganisationName), 49 $errors 50 ); 51 my $InvoiceDetails = first_child( $root, 'InvoiceDetails'); 52 my $OrderIdentifier = first_child( $InvoiceDetails, 'OrderIdentifier'); 53 check_po( text( $OrderIdentifier), $errors); 54 55 my @rows= children( $root, 'InvoiceRow'); 56 57 reset_default_row_id(); 58 59 foreach my $row ( @rows) 60 { # this does not cope well with broken row numbers 61 62 my $row_id= text( first_child( $row, 'RowIdentifier')) || default_row_id(); 63 64 print "checking row $row_id\n" if $DEBUG; 65 66 my $DeliveredQuantity= first_child( $row, 'DeliveredQuantity'); 67 my $OrderedQuantity = first_child( $row, 'OrderedQuantity'); 68 69 my $delivered_qty = text( $DeliveredQuantity); 70 my $delivered_unit = $DeliveredQuantity ? $DeliveredQuantity->getAttribute( 'QuantityUnitCode') : ''; 71 my $ordered_qty = text( $OrderedQuantity); 72 my $ordered_unit = $OrderedQuantity ? $OrderedQuantity->getAttribute( 'QuantityUnitCode') : ''; 73 74 check_qtty( $row_id, $delivered_qty, $delivered_unit, $ordered_qty, $ordered_unit, $errors); 75 } 76 77 return $errors; 78 } 79 80 sub store_invoice 81 { my( $doc)= @_; 82 my $root= $doc->getDocumentElement; # so we really work with the root element 83 84 print "storing invoice ", text( first_child( first_child( $root, 'InvoiceDetails'), 'InvoiceNumber')), "\n"; 85 86 # build the various data structures 87 my $data; 88 89 my $invoice = first_child( $root, 'InvoiceDetails'); 90 $data->{invoice} = { number => text( first_child( $invoice, 'InvoiceNumber')), 91 date => text( first_child( $invoice, 'InvoiceDate')), 92 po => text( first_child( $invoice, 'OrderIdentifier')), 93 amount_no_tax => text( first_child( $invoice, 'InvoiceTotalVatExcludedAmount')), 94 tax => text( first_child( $invoice, 'InvoiceTotalVatAmount')), 95 amount => text( first_child( $invoice, 'InvoiceTotalVatIncludedAmount')), 96 payment_status => text( first_child( first_child( $root, 'PaymentStatusDetails'), 'PaymentStatusCode')), 97 }; 98 99 my $seller = first_child( $root, 'SellerPartyDetails'); 100 $data->{seller} = { identifier => text( first_child( $seller, 'SellerPartyIdentifier')), 101 name => text( first_child( $seller, 'SellerOrganisationName')), 102 tax_code => text( first_child( $seller, 'SellerOrganisationTaxCode')), 103 }; 104 105 my $SellerPartyDetails = first_child( $root, 'SellerPartyDetails'); 106 my $address = first_child( $SellerPartyDetails, 'SellerPostalAddressDetails'); 107 $data->{address} = { street => text( first_child( $address, 'SellerStreetName')), 108 town => text( first_child( $address, 'SellerTownName')), 109 zip => text( first_child( $address, 'SellerPostCodeIdentifier')), 110 country_code => text( first_child( $address, 'CountryCode')), 111 po_box => text( first_child( $address, 'SellerPostOfficeBoxIdentifier')), 112 }; 113 114 my $contact = first_child( $root, 'SellerCommunicationDetails'); 115 $data->{contact} = { name => text( first_child( $root, 'SellerContactPersonName')), 116 phone => text( first_child( $contact, 'SellerPhoneNumberIdentifier')), 117 email => text( first_child( $contact, 'SellerEmailaddressIdentifier')), 118 }; 119 120 $data->{invoicerow} ||= []; 121 reset_default_row_id(); 122 123 foreach my $invoicerow ( children($root, 'InvoiceRow')) 124 { # need to check that the DeliveredQuantity element is present before getting its attribute 125 my $DeliveredQuantity= first_child( $invoicerow, 'DeliveredQuantity'); 126 my $qty_unit= $DeliveredQuantity ? $DeliveredQuantity->getAttribute( 'QuantityUnitCode') : ''; 127 128 push @{$data->{invoicerow}}, 129 { row_id => text( first_child( $invoicerow, 'RowIdentifier')) || default_row_id(), 130 sku => text( first_child( $invoicerow, 'ArticleIdentifier')), 131 name => text( first_child( $invoicerow, 'ArticleName')), 132 qty => text( $DeliveredQuantity), 133 qty_unit => $qty_unit, 134 unit_price => text( first_child( $invoicerow, 'UnitPriceAmount')), 135 amount_no_tax => text( first_child( $invoicerow, 'RowVatExcludedAmount')), 136 tax => text( first_child( $invoicerow, 'RowVatAmount')), 137 amount => text( first_child( $invoicerow, 'RowAmount')), 138 } 139 } 140 141 142 store_all( $data); 143 } 144 145 sub add_errors 146 { my( $doc, $error_messages)= @_; 147 my $root= $doc->getDocumentElement; 148 my $errors= $doc->createElement( 'errors'); 149 $root->insertBefore ( $errors, $root->getFirstChild); 150 foreach my $message (@$error_messages) 151 { my $error= $doc->createElement( 'error'); 152 $errors->appendChild( $error); 153 # those 2 lines could be replaced by $error->addText( $message) 154 # which is not in the DOM spec 155 my $text= $doc->createTextNode( $message); 156 $error->appendChild( $text); 157 } 158 return $doc; 159 } 160 161 sub output_doc_to_check 162 { my( $file, $doc)= @_; 163 open( FILE, ">$file") or die "cannot create file to check $file: $!"; 164 print FILE $doc->toString; 165 close FILE; 166 } 167 168 # this is very important,this ensures that we get the proper child 169 # never use XML::DOM's getFirstChild method directly! 170 sub first_child 171 { my( $node, $tag)= @_; 172 my $child= $node->getFirstChild or return undef; 173 while( $child && ($child->getNodeName ne $tag) ) 174 { $child= $child->getNextSibling; } 175 return $child; 176 } 177 178 sub children 179 { my( $node, $tag)= @_; 180 my @children; 181 my $child= first_child( $node, $tag) or return undef; 182 push @children, $child; 183 while( $child= $child->getNextSibling) 184 { push @children, $child if( $child->getNodeName eq $tag); } 185 return @children; 186 } 187 188 # node must include only text (and comments) 189 sub text 190 { my( $node)= @_; 191 unless( $node) { return undef; } 192 my $text=''; 193 foreach my $child ($node->getChildNodes) 194 { if( $child->getNodeName eq '#text') 195 { $text.= $child->getData; } 196 } 197 return $text; 198 } |
XML::EasyOBJ
This code uses XML::EasyOBJ to process the invoices
XML::EasyOBJ is built on top of XML::DOM, and mostly allows for easy navigation by letting you write $invoice= $doc->InvoiceDetails->InvoiceNumber
to get the invoice number element, and $invoice->getString
to get its value.
The fact that tag names (from the document) are used as method names in the code feels a little weird, and it can make name collision possible (XML::EasyOBJ lets you rename methods though, so you can deal with this problem). It makes it really easy to work with these invoices though, the code was easy to write and
The bad news here is that XML::EasyOBJ mostly helps you accessing the data. If you want to create new elements you have to use the DOM methods.
Overall XML::EasyOBJ feels like a very convenient layer on top of the DOM, which lets you easily navigate and access the data in the DOM (making easy things easy), and lets you use native DOM methods for advanced (making harder things possible).
The complete example is in wtr2_easyobj
1 #!/usr/bin/perl -w 2 use strict; 3 4 use XML::EasyOBJ; 5 6 use FindBin qw($Bin); 7 use lib $Bin; 8 use wtr2_base; 9 10 init_db(); 11 12 my $DEBUG=0; 13 14 my $CAN_OUTPUT= 0; 15 16 my @files= @ARGV || (<$dir{invoices}/*.xml>); 17 18 foreach my $file (@files) 19 { my $doc = XML::EasyOBJ->new( $file); 20 21 my $errors= check_invoice( $doc); 22 23 if( !@$errors) 24 { store_invoice( $doc); } 25 else 26 { print "ERROR in $file\n ", join( "\n ", @$errors), "\n"; 27 if( $CAN_OUTPUT) 28 { my $rejected_file= rejected( $file); 29 print "adding errors in $rejected_file\n" if( $DEBUG); 30 my $dom= $doc->getDomObj->getOwnerDocument; # we need to use the DOM for this 31 add_errors( $dom, $errors); 32 output_doc_to_check( $rejected_file, $dom); 33 } 34 }; 35 } 36 37 exit; 38 39 sub check_invoice 40 { my( $doc)= @_; 41 my $errors=[]; # array ref, holds the error messages 42 43 my $BuyerPartyDetails = $doc->BuyerPartyDetails; 44 my $BuyerPartyIdentifier = $BuyerPartyDetails->BuyerPartyIdentifier; 45 my $BuyerOrganisationName = $BuyerPartyDetails->BuyerOrganisationName; 46 check_buyer( $BuyerPartyIdentifier->getString, 47 $BuyerOrganisationName->getString, 48 $errors 49 ); 50 my $InvoiceDetails = $doc->InvoiceDetails; 51 my $OrderIdentifier = $InvoiceDetails->OrderIdentifier; 52 check_po( $OrderIdentifier->getString, $errors); 53 54 my @rows= $doc->InvoiceRow; 55 56 reset_default_row_id(); 57 58 foreach my $row ( @rows) 59 { # this does not cope well with broken row numbers 60 61 my $row_id= $row->RowIdentifier->getString || default_row_id(); 62 63 print "checking row $row_id\n" if $DEBUG; 64 65 my $DeliveredQuantity= $row->DeliveredQuantity; 66 my $OrderedQuantity = $row->OrderedQuantity; 67 68 my( $delivered_qty, $delivered_unit, $ordered_qty, $ordered_unit)= ('','','',''); 69 if( $DeliveredQuantity) 70 { $delivered_qty = $DeliveredQuantity->getString; 71 $delivered_unit = $DeliveredQuantity->getAttr( 'QuantityUnitCode'); 72 } 73 if( $OrderedQuantity) 74 { $ordered_qty = $OrderedQuantity->getString; 75 $ordered_unit = $OrderedQuantity->getAttr( 'QuantityUnitCode'); 76 } 77 78 check_qtty( $row_id, $delivered_qty, $delivered_unit, $ordered_qty, $ordered_unit, $errors); 79 } 80 81 return $errors; 82 } 83 84 sub store_invoice 85 { my( $doc)= @_; 86 87 print "storing invoice ", $doc->InvoiceDetails->InvoiceNumber->getString, "\n"; 88 89 # build the various data structures 90 my $data; 91 92 my $invoice = $doc->InvoiceDetails; 93 $data->{invoice} = { number => $invoice->InvoiceNumber->getString, 94 date => $invoice->InvoiceDate->getString, 95 po => $invoice->OrderIdentifier->getString, 96 amount_no_tax => $invoice->InvoiceTotalVatExcludedAmount->getString, 97 tax => $invoice->InvoiceTotalVatAmount->getString, 98 amount => $invoice->InvoiceTotalVatIncludedAmount->getString, 99 payment_status => $doc->PaymentStatusDetails->PaymentStatusCode->getString, 100 }; 101 102 my $seller = $doc->SellerPartyDetails; 103 $data->{seller} = { identifier => $seller->SellerPartyIdentifier->getString, 104 name => $seller->SellerOrganisationName->getString, 105 tax_code => $seller->SellerOrganisationTaxCode->getString, 106 }; 107 108 my $address = $doc->SellerPartyDetails->SellerPostalAddressDetails; 109 $data->{address} = { street => $address->SellerStreetName->getString, 110 town => $address->SellerTownName->getString, 111 zip => $address->SellerPostCodeIdentifier->getString, 112 country_code => $address->CountryCode->getString, 113 po_box => $address->SellerPostOfficeBoxIdentifier->getString, 114 }; 115 116 my $contact = $doc->SellerCommunicationDetails; 117 $data->{contact} = { name => $doc->SellerContactPersonName->getString, 118 phone => $contact->SellerPhoneNumberIdentifier->getString, 119 email => $contact->SellerEmailaddressIdentifier->getString, 120 }; 121 122 $data->{invoicerow} ||= []; 123 reset_default_row_id(); 124 125 foreach my $invoicerow ( $doc->InvoiceRow) 126 { # need to check that the DeliveredQuantity element is present before getting its attribute 127 my $DeliveredQuantity= $invoicerow->DeliveredQuantity; 128 my( $qty, $qty_unit)= ('',''); 129 if( $DeliveredQuantity) 130 { $qty = $DeliveredQuantity->getString; 131 $qty_unit = $DeliveredQuantity->getAttr( 'QuantityUnitCode'); 132 } 133 134 push @{$data->{invoicerow}}, 135 { row_id => $invoicerow->RowIdentifier->getString || default_row_id(), 136 sku => $invoicerow->ArticleIdentifier->getString, 137 name => $invoicerow->ArticleName->getString, 138 qty => $qty, 139 qty_unit => $qty_unit, 140 unit_price => $invoicerow->UnitPriceAmount->getString, 141 amount_no_tax => $invoicerow->RowVatExcludedAmount->getString, 142 tax => $invoicerow->RowVatAmount->getString, 143 amount => $invoicerow->RowAmount->getString, 144 } 145 } 146 147 148 store_all( $data); 149 } 150 151 sub add_errors 152 { my( $doc, $error_messages)= @_; 153 my $root= $doc->getDocumentElement; 154 my $errors= $doc->createElement( 'errors'); 155 $root->insertBefore ( $errors, $root->getFirstChild); 156 foreach my $message (@$error_messages) 157 { my $error= $doc->createElement( 'error'); 158 $errors->appendChild( $error); 159 # those 2 lines could be replaced by $error->addText( $message) 160 # which is not in the DOM spec 161 my $text= $doc->createTextNode( $message); 162 $error->appendChild( $text); 163 } 164 return $doc; 165 } 166 167 sub output_doc_to_check 168 { my( $file, $doc)= @_; 169 open( FILE, ">$file") or die "cannot create file to check $file: $!"; 170 print FILE $doc->toString; 171 close FILE; 172 } |
XML::LibXML
An XML::LibXML version based on the XML::DOM code
This code uses XML::LibXML to process the invoices
It is directly derived from the XML::DOM code in wtr2_dom. It only differs where XML::DOM and XML::LibXML DOM methods or constants have different names:
XML::DOM XML::LibXML
getNodeName nodeName method #text text constant returned by nodeName/getNodeName
This was very easy to write, but does not take advantage of XML::LibXML's best feature: its support for XPath. See wtr2_libxml
for a slightly different version, made a lot safer by using XPath queries instead of navigation methods (like getFirstChild
) to access the data.
This exemple is here just to show how easy it is to port code from XML::DOM (which IMHO should be deprecated) to XML::LibXML, and then take advantage of XML::LibXML more powerful features.
The complete example is in wtr2_libxml_dom
Here are the parts that differ from wtr2_dom:
... 19 { my $parser = new XML::LibXML; 20 my $doc = $parser->parse_file ( $file); ... 173 while( $child && ($child->nodeName ne $tag) ) ... 184 { push @children, $child if( $child->nodeName eq $tag); } ... 194 { if( $child->nodeName eq 'text') ... |
A "native" XML::LibXML version
This code uses XML::LibXML to process the invoices. It uses a lot of XML::LibXML specific methods instead of using the DOM. This makes for nicer and safer code. XML::LibXML is a Perl wrapper on top of the libxml2 library, an XML/XPath/DOM/RelaxNG/... library written by Daniel Veillard for the Gnome project.
Instead of using navigation methods (getFirstChild
) it relies mostly on findnodes
and findvalue
, which use XPath to select nodes to access.
The error message is built as text and then parsed using parse_xml_chunk
. While this might not be the best method I found it really nice to use.
Overall the code was quite easy to write, all the extra goodies provided by XML::LibXML compared to the DOM are really useful and make it much easier to write compact and safe code.
My main gripe with XML::LibXML is that it is often unstable, as the Perl module tries to keep up with the development of the librarie. The problem is that libxml2 is so widely used that I find that it is often upgraded by unrelated software, which can then cause trouble to code that uses XML::LibXML.
The complete example is in wtr2_libxml
1 #!/usr/bin/perl -w 2 use strict; 3 4 use XML::LibXML; 5 6 use FindBin qw($Bin); 7 use lib $Bin; 8 use wtr2_base; 9 10 init_db(); 11 12 my $DEBUG=0; 13 14 my $CAN_OUTPUT= 1; 15 16 my @files= @ARGV || (<$dir{invoices}/*.xml>); 17 18 foreach my $file (@files) 19 { my $parser = new XML::LibXML; 20 my $doc = $parser->parse_file ( $file); 21 22 my $errors= check_invoice( $doc); 23 24 if( !@$errors) 25 { store_invoice( $doc); } 26 else 27 { print "ERROR in $file\n ", join( "\n ", @$errors), "\n"; 28 if( $CAN_OUTPUT) 29 { my $rejected_file= rejected( $file); 30 print "adding errors in $rejected_file\n" if( $DEBUG); 31 add_errors( $parser, $doc, $errors); # the parser is passed so it can be re-used 32 output_doc_to_check( $rejected_file, $doc); 33 } 34 }; 35 } 36 37 exit; 38 39 sub check_invoice 40 { my( $doc)= @_; 41 my $errors=[]; # array ref, holds the error messages 42 43 check_buyer( $doc->findvalue( '/Finvoice/BuyerPartyDetails/BuyerPartyIdentifier'), 44 $doc->findvalue( '/Finvoice/BuyerPartyDetails/BuyerOrganisationName'), 45 $errors 46 ); 47 check_po( $doc->findvalue( '/Finvoice/InvoiceDetails/OrderIdentifier'), $errors); 48 49 my @rows= $doc->findnodes( '/Finvoice/InvoiceRow'); 50 51 reset_default_row_id(); 52 53 foreach my $row ( @rows) 54 { # this does not cope well with broken row numbers 55 56 my $row_id= $row->findvalue( 'RowIdentifier') || default_row_id(); 57 58 print "checking row $row_id\n" if $DEBUG; 59 60 my( $delivered_qty, $delivered_unit, $ordered_qty, $ordered_unit)= ( '','','',''); 61 62 if( my $DeliveredQuantity= $row->findnodes( 'DeliveredQuantity')->[0]) 63 { $delivered_qty = $DeliveredQuantity->textContent; 64 $delivered_unit = $DeliveredQuantity->getAttribute( 'QuantityUnitCode'); 65 } 66 if( my $OrderedQuantity= $row->findnodes( 'OrderedQuantity')->[0]) 67 { $ordered_qty = $OrderedQuantity->textContent; 68 $ordered_unit = $OrderedQuantity->getAttribute( 'QuantityUnitCode'); 69 } 70 71 check_qtty( $row_id, $delivered_qty, $delivered_unit, $ordered_qty, $ordered_unit, $errors); 72 } 73 74 return $errors; 75 } 76 77 sub store_invoice 78 { my( $doc)= @_; 79 80 my $invoice_number= $doc->findvalue( '/Finvoice/InvoiceDetails/InvoiceNumber'); 81 print "storing invoice $invoice_number\n"; 82 83 # build the various data structures 84 my $data; 85 86 my $invoice = $doc->findnodes( '/Finvoice/InvoiceDetails')->[0]; 87 $data->{invoice} = { number => $invoice->findvalue( 'InvoiceNumber'), 88 date => $invoice->findvalue( 'InvoiceDate'), 89 po => $invoice->findvalue( 'OrderIdentifier'), 90 amount_no_tax => $invoice->findvalue( 'InvoiceTotalVatExcludedAmount'), 91 tax => $invoice->findvalue( 'InvoiceTotalVatAmount'), 92 amount => $invoice->findvalue( 'InvoiceTotalVatIncludedAmount'), 93 payment_status => $doc->findvalue( '/Finvoice/PaymentStatusDetails/PaymentStatusCode'), 94 }; 95 96 my $seller = $doc->findnodes( '/Finvoice/SellerPartyDetails')->[0]; 97 $data->{seller} = { identifier => $seller->findvalue( 'SellerPartyIdentifier'), 98 name => $seller->findvalue( 'SellerOrganisationName'), 99 tax_code => $seller->findvalue( 'SellerOrganisationTaxCode'), 100 }; 101 102 my $address = $doc->findnodes( '/Finvoice/SellerPartyDetails/SellerPostalAddressDetails')->[0]; 103 $data->{address} = { street => $address->findvalue( 'SellerStreetName'), 104 town => $address->findvalue( 'SellerTownName'), 105 zip => $address->findvalue( 'SellerPostCodeIdentifier'), 106 country_code => $address->findvalue( 'CountryCode'), 107 po_box => $address->findvalue( 'SellerPostOfficeBoxIdentifier'), 108 }; 109 110 my $contact = $doc->findnodes( '/Finvoice/SellerCommunicationDetails')->[0]; 111 $data->{contact} = { name => $doc->findvalue( '/Finvoice/SellerContactPersonName'), 112 phone => $contact->findvalue( 'SellerPhoneNumberIdentifier'), 113 email => $contact->findvalue( 'SellerEmailaddressIdentifier'), 114 }; 115 116 $data->{invoicerow} ||= []; 117 reset_default_row_id(); 118 119 foreach my $invoicerow ( $doc->findnodes( '/Finvoice/InvoiceRow')) 120 { # need to check that the DeliveredQuantity element is present before getting its attribute 121 my $DeliveredQuantity= $invoicerow->getChildrenByTagName( 'DeliveredQuantity')->[0]; 122 my $qty = $DeliveredQuantity ? $DeliveredQuantity->textContent : ''; 123 my $qty_unit= $DeliveredQuantity ? $DeliveredQuantity->getAttribute( 'QuantityUnitCode') : ''; 124 125 push @{$data->{invoicerow}}, 126 { row_id => $invoicerow->findvalue( 'RowIdentifier') || default_row_id(), 127 sku => $invoicerow->findvalue( 'ArticleIdentifier'), 128 name => $invoicerow->findvalue( 'ArticleName'), 129 qty => $qty, 130 qty_unit => $qty_unit, 131 unit_price => $invoicerow->findvalue( 'UnitPriceAmount'), 132 amount_no_tax => $invoicerow->findvalue( 'RowVatExcludedAmount'), 133 tax => $invoicerow->findvalue( 'RowVatAmount'), 134 amount => $invoicerow->findvalue( 'RowAmount'), 135 } 136 } 137 138 store_all( $data); 139 } 140 141 sub add_errors 142 { my( $parser, $doc, $error_messages)= @_; 143 my $root= $doc->documentElement(); 144 # here I chose to build the error messages as text and then parse them 145 my $chunk= "\n <errors>\n " . join( "\n ", map { "<error>$_</error>" } @$error_messages) . "\n </errors>"; 146 my $errors= $parser->parse_xml_chunk( $chunk ); 147 $root->insertBefore ( $errors, $root->getFirstChild); 148 return $doc; 149 } 150 151 sub output_doc_to_check 152 { my( $file, $doc)= @_; 153 open( FILE, ">$file") or die "cannot create file to check $file: $!"; 154 print FILE $doc->toString; 155 close FILE; 156 } |
XML::XPath version
This code uses XML::XPath. Accessing the data, the check_invoice
and store_invoice
functions, is very similar to the code using XML::LibXML (not surprisingly as the 2 modules were both written by Matt Sergeant ;--) The only difference are the way to create the document object and the name of the method used to get the text of an element (string_value
instead of textContent
).
Creating the errors
element proved a little more challenging.
The complete example is in wtr2_xpath
1 #!/usr/bin/perl -w 2 use strict; 3 4 use XML::XPath; 5 use XML::XPath::XMLParser; 6 7 8 use FindBin qw($Bin); 9 use lib $Bin; 10 use wtr2_base; 11 12 init_db(); 13 14 my $DEBUG=0; 15 16 my $CAN_OUTPUT= 1; 17 18 my @files= @ARGV || (<$dir{invoices}/*.xml>); 19 20 my $doc; # note that putting the my _in_ the loop causes an Out of Memory 21 # error after just a few documents 22 23 foreach my $file (@files) 24 { $doc= XML::XPath->new( filename => $file); 25 26 my $errors= check_invoice( $doc); 27 28 if( !@$errors) 29 { store_invoice( $doc); } 30 else 31 { print "ERROR in $file\n ", join( "\n ", @$errors), "\n"; 32 if( $CAN_OUTPUT) 33 { my $rejected_file= rejected( $file); 34 print "adding errors in $rejected_file\n" if( $DEBUG); 35 add_errors( $doc, $errors); 36 output_doc_to_check( $rejected_file, $doc); 37 } 38 }; 39 } 40 41 exit; 42 43 sub check_invoice 44 { my( $doc)= @_; 45 my $errors=[]; # array ref, holds the error messages 46 47 check_buyer( $doc->findvalue( '/Finvoice/BuyerPartyDetails/BuyerPartyIdentifier'), 48 $doc->findvalue( '/Finvoice/BuyerPartyDetails/BuyerOrganisationName'), 49 $errors 50 ); 51 check_po( $doc->findvalue( '/Finvoice/InvoiceDetails/OrderIdentifier'), $errors); 52 53 my @rows= $doc->findnodes( '/Finvoice/InvoiceRow'); 54 55 reset_default_row_id(); 56 57 foreach my $row ( @rows) 58 { # this does not cope well with broken row numbers 59 60 my $row_id= $row->findvalue( 'RowIdentifier') || default_row_id(); 61 62 print "checking row $row_id\n" if $DEBUG; 63 64 my( $delivered_qty, $delivered_unit, $ordered_qty, $ordered_unit)= ( '','','',''); 65 66 if( my $DeliveredQuantity= $row->findnodes( 'DeliveredQuantity')->[0]) 67 { $delivered_qty = $DeliveredQuantity->string_value; 68 $delivered_unit = $DeliveredQuantity->getAttribute( 'QuantityUnitCode'); 69 } 70 if( my $OrderedQuantity= $row->findnodes( 'OrderedQuantity')->[0]) 71 { $ordered_qty = $OrderedQuantity->string_value; 72 $ordered_unit = $OrderedQuantity->getAttribute( 'QuantityUnitCode'); 73 } 74 75 check_qtty( $row_id, $delivered_qty, $delivered_unit, $ordered_qty, $ordered_unit, $errors); 76 } 77 78 return $errors; 79 } 80 81 sub store_invoice 82 { my( $doc)= @_; 83 84 my $invoice_number= $doc->findvalue( '/Finvoice/InvoiceDetails/InvoiceNumber'); 85 print "storing invoice $invoice_number\n"; 86 87 # build the various data structures 88 my $data; 89 90 my $invoice = $doc->findnodes( '/Finvoice/InvoiceDetails')->[0]; 91 $data->{invoice} = { number => $invoice->findvalue( 'InvoiceNumber'), 92 date => $invoice->findvalue( 'InvoiceDate'), 93 po => $invoice->findvalue( 'OrderIdentifier'), 94 amount_no_tax => $invoice->findvalue( 'InvoiceTotalVatExcludedAmount'), 95 tax => $invoice->findvalue( 'InvoiceTotalVatAmount'), 96 amount => $invoice->findvalue( 'InvoiceTotalVatIncludedAmount'), 97 payment_status => $doc->findvalue( '/Finvoice/PaymentStatusDetails/PaymentStatusCode'), 98 }; 99 100 my $seller = $doc->findnodes( '/Finvoice/SellerPartyDetails')->[0]; 101 $data->{seller} = { identifier => $seller->findvalue( 'SellerPartyIdentifier'), 102 name => $seller->findvalue( 'SellerOrganisationName'), 103 tax_code => $seller->findvalue( 'SellerOrganisationTaxCode'), 104 }; 105 106 my $address = $doc->findnodes( '/Finvoice/SellerPartyDetails/SellerPostalAddressDetails')->[0]; 107 $data->{address} = { street => $address->findvalue( 'SellerStreetName'), 108 town => $address->findvalue( 'SellerTownName'), 109 zip => $address->findvalue( 'SellerPostCodeIdentifier'), 110 country_code => $address->findvalue( 'CountryCode'), 111 po_box => $address->findvalue( 'SellerPostOfficeBoxIdentifier'), 112 }; 113 114 my $contact = $doc->findnodes( '/Finvoice/SellerCommunicationDetails')->[0]; 115 $data->{contact} = { name => $doc->findvalue( '/Finvoice/SellerContactPersonName'), 116 phone => $contact->findvalue( 'SellerPhoneNumberIdentifier'), 117 email => $contact->findvalue( 'SellerEmailaddressIdentifier'), 118 }; 119 120 $data->{invoicerow} ||= []; 121 reset_default_row_id(); 122 123 foreach my $invoicerow ( $doc->findnodes( '/Finvoice/InvoiceRow')) 124 { # need to check that the DeliveredQuantity element is present before getting its attribute 125 my $DeliveredQuantity= $invoicerow->findnodes( 'DeliveredQuantity')->[0]; 126 my $qty = $DeliveredQuantity ? $DeliveredQuantity->string_value : ''; 127 my $qty_unit= $DeliveredQuantity ? $DeliveredQuantity->getAttribute( 'QuantityUnitCode') : ''; 128 129 push @{$data->{invoicerow}}, 130 { row_id => $invoicerow->findvalue( 'RowIdentifier') || default_row_id(), 131 sku => $invoicerow->findvalue( 'ArticleIdentifier'), 132 name => $invoicerow->findvalue( 'ArticleName'), 133 qty => $qty, 134 qty_unit => $qty_unit, 135 unit_price => $invoicerow->findvalue( 'UnitPriceAmount'), 136 amount_no_tax => $invoicerow->findvalue( 'RowVatExcludedAmount'), 137 tax => $invoicerow->findvalue( 'RowVatAmount'), 138 amount => $invoicerow->findvalue( 'RowAmount'), 139 } 140 } 141 142 store_all( $data); 143 } 144 145 sub add_errors 146 { my( $doc, $error_messages)= @_; 147 my $errors= $doc->createNode( '/Finvoice/errors'); 148 foreach my $message (@$error_messages) 149 { my $error= XML::XPath::Node::Element->new( 'error'); 150 $errors->appendChild( $error); 151 $doc->setNodeText( '/Finvoice/errors/error[last()]' => $message, ); 152 } 153 return $doc; 154 } 155 156 sub output_doc_to_check 157 { my( $file, $doc)= @_; 158 open( FILE, ">$file") or die "cannot create file to check $file: $!"; 159 print FILE $doc->findnodes( '/')->[0]->toString; 160 close FILE; 161 } |
XML::PYX
An XML::PYX version based on the XML::Simple code
This example uses XML::PYX in a kinda devious way: it reimplements a simplified version of XML::Simple's XMLin
that will work for the class of documents we are processing. The code is then exactly the same as the one for wtr2_simple
.
Note that this version of XMLin does not offer any option besides a limited version of forcearray
. It will not for exemple use keyattr
.
I guess this was pretty easy to write as it worked the first time I run it, much to my surprise I should say ;--)
Generally It illustrates the fact that it often pays to write a layer on top of generic modules to adapt them to your specific needs (please, just don't release it on CPAN! Chances are that it is really specific to your problem or to the way you like to code, and it will only add to the already considerable confusion in the XML namespace).
The complete example is in wtr2_pyx_simple
Here are the call to XMLin and the XMLin sub
... 17 { # XMLin just re-implements a simplified version of XML::Simple's XMLin 18 # that will work for this specific class of document 19 my $xml= XMLin( $file, forcearray => [ qw(InvoiceRow)]); ... 126 sub XMLin 127 { my( $file, %options)=@_; 128 129 # get the 'forcearray' elements in a convenient hash 130 my %forcearray= map { $_ => 1} @{$options{forcearray}}; 131 132 my @current; # the stack of hashes or array in the data structure 133 # $current[0] is the overall structure (the root of the document), 134 # $current[1] is the first_level child currently open 135 # ... 136 # $current[-1] is the current element 137 open( IN, "pyx $file | ") or die "cannot open pyx $file: $!"; 138 while( <IN>) 139 { if( m{^\((.*)$}) # open element 140 { my $tag= $1; 141 my $current={}; # will be filled later; 142 if( @current) 143 { # normal (non-root) element 144 if( $forcearray{$1}) { $current[-1]->{$1} ||= []; push @{$current[-1]->{$1}}, $current; } 145 else { $current[-1]->{$1} = $current; } 146 } 147 push @current, $current; 148 } 149 elsif( m{^\)(.*)$}) # close element (except when only the root is left) 150 { pop @current unless( @current == 1); } 151 elsif( m{^-\\n$}) # empty line, skip 152 { } 153 elsif( m{^-(.*)$}) # content, assign to the current element content 154 { $current[-1]->{content}= $1; } 155 elsif( m{^A(\w*) (.*)$}) # attribute, assign to a field in the current element 156 { $current[-1]->{$1}= $2; } 157 } 158 159 # note that in case there is an error during the parsing it will show when 160 # closing the file, so you need to check the result of close 161 close IN or die "error processing pyx $file: $!"; 162 163 return $current[0]; 164 } ... |
A more advanced XML::PYX version, still based on the XML::Simple code
The previous code might seem naive (and useless), but it can easily be improved and thus get extra credits:
This example uses XML::PYX and the XMLin layer on top of it, but stores the initial PYX flow so it can later update it with the error messages.
The code is very similar to wtr2_pyx_simple, but the PYX flow if first stored in $pyx
, before being passed to XMLin. If errors are found, $pyx
is used to add the errors (add_errors looks for the end of the root (Finvoice
) start tag, builds the PYX flow for the errors
element and insert it after the root. The pyx flow is then output using pyxw
which writes it back as XML.
Creating the PYX for the error element is not really elegant, but it is not that much of a pain either. A simple layer a-la-XML::Writer could make it even easier, but wasn't deemed necessary here.
The complete example is in wtr2_pyx_simple_plus
Here are the interesing bits:
... 19 my $pyx= pyx_in( $file); 20 my $xml= XMLin( $pyx, forcearray => [ qw(InvoiceRow)]); ... 30 add_errors( $pyx, $errors); 31 output_doc_to_check( $rejected_file, $pyx); ... 127 sub add_errors 128 { my( $pyx, $error_messages)= @_; 129 # first get the root element 130 my $root_index=0; 131 while($_=$pyx->[$root_index]) 132 { if( m{^\(Finvoice$}) 133 { # found the root element 134 # now skip attributes 135 while( $pyx->[$root_index+1]=~ m{^A}) { $root_index++; } 136 last; 137 } 138 $root_index++; 139 } 140 141 my $lf= "-\\n\n"; # a line feed in pyx 142 # build the error messages 143 my $messages= $lf 144 . "- \n(errors\n" 145 . join( "", map { "$lf- \n(error\n-$_\n)error\n" } @$error_messages) 146 . $lf 147 . "- \n)errors\n"; 148 149 # now insert the messages 150 splice( @$pyx, $root_index+1, 0, $messages); 151 152 return $pyx; 153 } 154 155 sub output_doc_to_check 156 { my( $file, $pyx)= @_; 157 open( FILE, "| pyxw > $file") or die "cannot create file to check $file: $!"; 158 print FILE @$pyx; 159 close FILE; 160 } 161 162 163 sub pyx_in 164 { my( $file)= @_; 165 open( IN, "pyx $file | ") or die "cannot open pyx $file: $!"; 166 my @pyx= <IN>; 167 # note that in case there is an error during the parsing it will show when 168 # closing the file, so you need to check the result of close 169 close IN or die "error processing pyx $file: $!"; 170 return \@pyx; 171 } 172 ... |
XML::SAX::Base
This code uses SAX to extract the data from the invoices. It parses the invoice and extract the relevant data into a Perl data structure that is then used to check the invoice and update the data base.
The first problem to solve when using SAX is that the content of elements can be broken in different calls to the characters
handler. So I needed to buffer the content. Luckily enough, Robin Berjon's XML::Filter::BufferText
does just that!
So I used a SAX machine (using SAX::Machines) to pipe the 2 handlers, first XML::Filter::BufferText, then my own handler: wtr2_handler. Note that SAX::Machines
takes care of
wtr2_handler
extracts all the information needed to check the invoice, then store it in the data base. The resulting data (returned by the end_document handler) is then used by check_invoice and store_invoice.
As this is something that is likely to be quite common and as there are few SAX modules that do this, I decided to go generic: I created a small language to describe how to extract the data and store it in my custom data structure.
The idea is to give an element name (no namespaces are used in this DTD, so there is no need to get fancy) and associate an action to it. Actions can be associated with the start of an element or with its content. At the start of an element it is possible to store attributes or to create new sub-records, for repeatable data in the document, such as InvoiceRow
The content of an element can be stored, either as top-level data, for non-repeatable data, or in a sub-record, for repeatable data.
The easiest way I found to parse these actions was to use Getopt::Long Overall this is slightly overkill for this problem, but could be re-used in other cases, so I thought it would be worth it to show it here.
In order to know in which element the parser is from the characters
handler I used a stack of element names: the start_element
handler pushes the current element name on the stack and the end_element
handler pops it. This is the only way to get access to the parent name, needed for the --parent
option.
Overall the code was quite a pain to write, especially as the default parser, XML::LibXML::SAX::Parser had a problem during my tests, as once again I had upgraded libxml2
but not the Perl module. The hardest part was designing a way to express what I wanted to extract from the XML document and how to store it, without resorting with one of those long lists of if
s that I find make code such a pain to maintain.
The complete example is in wtr2_sax_base
1 #!/usr/bin/perl -w 2 use strict; 3 4 #use diagnostics; 5 6 use XML::Filter::BufferText; # to buffer all character events 7 use XML::SAX::Machines qw(:all); # to pipe the 2 SAX handlers 8 9 use FindBin qw($Bin); 10 use lib $Bin; 11 use wtr2_base; 12 13 my $DEBUG=0; 14 15 init_db(); 16 17 my $CAN_OUTPUT= 0; 18 19 my @files= @ARGV || (<$dir{invoices}/*.xml>); 20 21 foreach my $file (@files) 22 { my $handler= wtr2_handler->new(); 23 24 my $pipeline = Pipeline( XML::Filter::BufferText->new(), 25 $handler, 26 ); 27 my $data = $pipeline->parse_uri ( $file); 28 29 my $errors= check_invoice( $data); 30 31 if( !@$errors) 32 { store_invoice( $data); } 33 else 34 { print "ERROR in $file\n ", join( "\n ", @$errors), "\n"; 35 if( $CAN_OUTPUT) 36 { my $rejected_file= rejected( $file); 37 print "adding errors in $rejected_file\n" if( $DEBUG); 38 add_errors( $pipeline, $data, $errors); # the parser is passed so it can be re-used 39 output_doc_to_check( $rejected_file, $data); 40 } 41 }; 42 } 43 44 sub check_invoice 45 { my( $data)= @_; 46 my $errors=[]; # array ref, holds the error messages 47 48 check_buyer( $data->{BuyerPartyIdentifier}, 49 $data->{BuyerOrganisationName}, 50 $errors 51 ); 52 check_po( $data->{OrderIdentifier}, $errors); 53 54 my @rows= @{$data->{InvoiceRow}}; 55 56 reset_default_row_id(); 57 58 foreach my $row( @rows) 59 { # this does not cope well with broken row numbers 60 61 my $row_id= $row->{RowIdentifier} || default_row_id(); 62 63 print "checking row $row_id\n" if $DEBUG; 64 65 check_qtty( $row_id, 66 $row->{DeliveredQuantity}, 67 $row->{DeliveredQuantityUC}, 68 $row->{OrderedQuantity}, 69 $row->{OrderedQuantityUC}, 70 $errors 71 ); 72 } 73 74 return $errors; 75 } 76 77 78 sub store_invoice 79 { my( $xml_data)= @_; 80 print "storing invoice $xml_data->{InvoiceNumber}\n"; 81 82 # build the various data structures 83 my $data; 84 85 $data->{invoice} = { number => $xml_data->{InvoiceNumber}, 86 date => $xml_data->{InvoiceDate}, 87 po => $xml_data->{OrderIdentifier}, 88 amount_no_tax => $xml_data->{InvoiceTotalVatExcludedAmount}, 89 tax => $xml_data->{InvoiceTotalVatAmount}, 90 amount => $xml_data->{InvoiceTotalVatIncludedAmount}, 91 payment_status => $xml_data->{PaymentStatusCode}, 92 }; 93 94 $data->{seller} = { identifier => $xml_data->{SellerPartyIdentifier}, 95 name => $xml_data->{SellerOrganisationName}, 96 tax_code => $xml_data->{SellerOrganisationTaxCode}, 97 }; 98 99 $data->{address} = { street => $xml_data->{SellerStreetName}, 100 town => $xml_data->{SellerTownName}, 101 zip => $xml_data->{SellerPostCodeIdentifier}, 102 country_code => $xml_data->{CountryCode}, 103 po_box => $xml_data->{SellerPostOfficeBoxIdentifier}, 104 }; 105 106 $data->{contact} = { name => $xml_data->{SellerContactPersonName}, 107 phone => $xml_data->{SellerPhoneNumberIdentifier}, 108 email => $xml_data->{SellerEmailaddressIdentifier}, 109 }; 110 111 $data->{invoicerow} ||= []; 112 reset_default_row_id(); 113 114 foreach my $invoicerow (@{$xml_data->{InvoiceRow}}) 115 { push @{$data->{invoicerow}}, 116 { row_id => $invoicerow->{RowIdentifier} || default_row_id(), 117 sku => $invoicerow->{ArticleIdentifier}, 118 name => $invoicerow->{ArticleName}, 119 qty => $invoicerow->{DeliveredQuantity}, 120 qty_unit => $invoicerow->{DeliveredQuantityUC}, 121 unit_price => $invoicerow->{UnitPriceAmount}, 122 amount_no_tax => $invoicerow->{RowVatExcludedAmount}, 123 tax => $invoicerow->{RowVatAmount}, 124 amount => $invoicerow->{RowAmount}, 125 } 126 } 127 128 store_all( $data); 129 } 130 131 132 133 package wtr2_handler; 134 135 use base qw(XML::SAX::Base); 136 137 use Getopt::Long; # to process actions associated with SAX events 138 139 # all those could be stored as part of the parser object 140 # but IMHO this implies potential name collision 141 142 my( $content, $start); 143 144 BEGIN { 145 146 # declare which element content and attributes we want to store 147 # note that this works fine because elements are not re-used, the element 148 # name always gives enough information to figure out what to do with it 149 150 # note that the little languages used in the action part depends on the fact 151 # that - is not a valid characters at the start of an XML identifier 152 153 # actions for start_element handler 154 # 155 # actions format is: 156 # actions : action (';' action)* # only one used here 157 # action : command options+ 158 # command : ('--create' | '--store_att' <attribute name>) 159 # options : option+ 160 # option : --in <sub_record> # sub_record must be have been defined 161 # --as <field_name> # name of the field in the hash 162 # # (defaults to the attribute name) 163 # --parent <name> # trigger only when in proper parent 164 # --create only supports the -as option (not used) 165 166 $start = { # elements that create a new sub_record for repeated content 167 InvoiceRow => '--create', 168 # attributes to be stored 169 # format is #store_att 170 DeliveredQuantity => '--store_att QuantityUnitCode --in InvoiceRow --as DeliveredQuantityUC', 171 OrderedQuantity => '--store_att QuantityUnitCode --in InvoiceRow --as OrderedQuantityUC', 172 }; 173 174 175 # actions for characters handler 176 # actions format is: 177 # actions : action (';' action)* # only one used here 178 # action : command options+ 179 # command : ('--store') 180 # options : option+ 181 # option : --in <sub_record> # sub_record must be have been defined 182 # --as <field_name> # not used here 183 # --parent <name> # trigger only when in proper parent 184 185 $content= { # content that needs to be stored once per invoice 186 BuyerPartyIdentifier => '--store', 187 BuyerOrganisationName => '--store', 188 InvoiceNumber => '--store', 189 InvoiceDate => '--store', 190 OrderIdentifier => '--store', 191 InvoiceTotalVatExcludedAmount => '--store', 192 InvoiceTotalVatAmount => '--store', 193 InvoiceTotalVatIncludedAmount => '--store', 194 PaymentStatusCode => '--store', 195 SellerPartyIdentifier => '--store', 196 SellerOrganisationName => '--store', 197 SellerOrganisationTaxCode => '--store', 198 SellerStreetName => '--store', 199 SellerTownName => '--store', 200 SellerPostCodeIdentifier => '--store', 201 CountryCode => '--store --parent SellerPostalAddressDetails', 202 SellerPostOfficeBoxIdentifier => '--store', 203 SellerContactPersonName => '--store', 204 SellerPhoneNumberIdentifier => '--store', 205 SellerEmailaddressIdentifier => '--store', 206 # repeated content 207 RowIdentifier => '--store --in InvoiceRow', 208 ArticleIdentifier => '--store --in InvoiceRow', 209 ArticleName => '--store --in InvoiceRow', 210 DeliveredQuantity => '--store --in InvoiceRow', 211 OrderedQuantity => '--store --in InvoiceRow', 212 UnitPriceAmount => '--store --in InvoiceRow', 213 RowVatExcludedAmount => '--store --in InvoiceRow', 214 RowVatAmount => '--store --in InvoiceRow', 215 RowAmount => '--store --in InvoiceRow', 216 }; 217 218 } 219 220 # very complex new! it's a hash so we can add the state data needed 221 sub new 222 { my $class = shift; 223 my $self= bless {}, $class; 224 } 225 226 # reset data for each invoice 227 sub start_document 228 { $_[0]->{data} = {}; # stored data 229 $_[0]->{context} = []; # element name stack 230 } 231 232 # 233 sub start_element 234 { my( $p, $elt)= @_; 235 236 my $name= $elt->{Name}; 237 238 # store the context 239 push @{$p->{context}}, $name; 240 241 # process actions for the relevant elements 242 if( my $actions= $start->{$name}) 243 { my @actions= split /;/, $actions; # actions are ; separated (not used here) 244 foreach my $action (@actions) 245 { # use Getopt::Long to parse the action 246 local @ARGV= split /\s+/, $action; 247 my %options; 248 GetOptions( \%options, "create", "store_att=s", "in=s", "as=s", "parent=s"); 249 250 if( $options{parent} and ($p->{context}->[-2] ne $options{parent})) { next; } 251 252 # process each type of action 253 if( $options{create}) 254 { # create a new sub_record 255 if( $options{store_att}) { die "can't use --store_att and --create in $name => '$action'\n"; } 256 if( $options{in}) { die "can't use --in with --create in $name => '$action'\n"; } 257 $options{as} ||= $name; # the sub_record name defaults to the element name 258 $p->{data}->{$name} ||= []; 259 push @{$p->{data}->{$name}}, {}; 260 } 261 elsif( my $att= $options{store_att}) 262 { # store an attribute 263 my $att_clarkian= "{}".$att; # attributes are indexed using the clarkian notation 264 my $value= $elt->{Attributes}->{$att_clarkian}->{Value}; 265 store( $p->{data}, $options{in}, $options{as} || $name, $value); 266 } 267 else 268 { die "no valid start action found in $name => '$action'\n"; } 269 } 270 } 271 } 272 273 274 sub characters 275 { my( $p, $characters)= @_; 276 my $name= $p->{context}->[-1]; 277 if( my $actions= $content->{$name}) 278 { my @actions= split /;/, $actions; 279 foreach my $action (@actions) 280 { local @ARGV= split /\s+/, $action; 281 my %options; 282 GetOptions( \%options, "store", "in=s", "as=s", "parent=s"); 283 if( $options{parent} and ($p->{context}->[-2] ne $options{parent})) { next; } 284 if( $options{store}) 285 { store( $p->{data}, $options{in}, $options{as} || $name, $characters->{Data}); } 286 else 287 { die "no valid contentt action found in $name => '$action'\n"; } 288 } 289 } 290 } 291 292 sub end_element { pop @{$_[0]->{context}}; } 293 294 sub end_document { return $_[0]->{data}; } 295 296 sub store 297 { my( $data, $in, $as, $value)= @_; 298 if( my $sub_record= $in) 299 { # create it in a sub_record 300 $data->{$sub_record}->[-1]->{$as}= $value; 301 } 302 else 303 { # create at top-level 304 $data->{$as}= $value; 305 } 306 } 307 308 1; |
Conclusions
Overall the code was quite easy to write with most modules.
I was surprised by how much common code I could re-use from one example to the other, due to modules using standard API (DOM/SAX) and to a lot of modules being written by the same author (as with XML::XPath / XML::LibXML) (not to mention the author of the article blatantly cheating and stealing method names, if not code, from other modules to add them to XML::Twig ;--).
Modules that simply slurp the XML into a Perl data structure were the easiest to use, sometimes at the cost of making it harder to output a modified document. XML::Smart looked good here, but it is a very recent module and will need to be tested some more before I can recommend it. Tree-based modules were a little harder to use (why can't we call methods on undef! that would save us quite a few test), but allowed for more control over modified XML output. Streaming is not really adapted to this problem, and requires a the bigger layer on top of the basic modules, so I would not really advise using it here.
Benchmark
All modules handled the load pretty well, except for XML::XPath which gives seemingly random Out of memory! errors.
Tested on 206 invoices
Module | Version | Timing (benchmark) | Factor |
---|---|---|---|
XML::Simple | 2.08 | 7.43 | 100 |
XML::Smart | 1.3.1 | 9.71 | 130 |
XML::Twig | 3.10 | 8.74 | 117 |
XML::Twig (alt) | � | 10.96 | 147 |
XML::DOM | 1.42 | 8.97 | 120 |
XML::EasyOBJ | 1.12 | 8.35 | 112 |
XML::LibXML | 1.55 | 3.72 | 50 |
XML::LibXML (alt) | � | 4.53 | 60 |
XML::XPath | 1.13 | 24.86 | 334 |
XML::PYX | 0.07 | 21.41 | 288 |
XML::PYX (alt) | � | 22.53 | 303 |
XML::SAX::Base | 1.04 | 11.66 | 156 |
benchmark environment: perl 5.008 on linux - Thu Aug 21 17:11:39 2003
Running the examples
Requirements
You will need to install the following modules: DBI, DBD::SQLite, Memoize, plus of course the various XML modules required by each example.
Files:
A tar file with everything you need
- all files (.tar.gz): everything you need to run wtr2, stored in the proper directory. run_tests init will run all the tests, provided you have all the modules installed.
Finvoice files
See Finvoice: Technical files to download the complete package.
- Finvoice.dtd: the Finvoice DTD,
- Finvoice.xsl: the Finvoice XSLT stylesheet,
- example invoice 1: named basic_invoice.xml in Finvoice technical package,
- example invoice 2: named all_details.xml in Finvoice technical package
- example invoice 3
The data base definition and creation file
- invoice_db.def: used by the various examples to reset the DB
A common package
- wtr2_base.pm: contains common functions used by the various examples, see the html doc or the text doc
The various examples
- wtr2_simple: XML::Simple version (doc),
- wtr2_smart: XML::SMART version, very similar to the XML::Simple one, but more powerful (it outputs the document with the errors) (doc)
- wtr2_twig_simple: XML::Twig version derived from the XML::Simple one (doc)
- wtr2_twig: XML::Twig full-tree version (doc),
- wtr2_dom: XML::DOM version, (doc),
- wtr2_easyobj: XML::EasyOBJ version, (doc),
- wtr2_libxml_dom: XML::LibXML version using strictly the same code as the XML::DOM example. (doc),
- wtr2_libxml: XML::LibXML version using XML::LibXML specific methods (doc),
- wtr2_xpath: XML::XPath version (doc),
- wtr2_pyx_simple: XML::PYX version built using the code for XML::Simple, and re-implementing XMLin on top of PYX (doc, run with perl wtr2_pyx_simple invoice_?.xml)
- wtr2_sax_base: a basic SAX example (doc),
Templates for writing examples
- wtr2_template_tree: a template convenient for tree-oriented examples (doc),