XML, the Perl Way

An Introduction to XML

by Michel Rodriguez
Boardwatch Magazine

XML is the latest buzzword. "XML is the future of HTML", "You need XML to do e-commerce", "My boss wants me to use XML,," everybody has heard those phrases, but not so many people have actually seen XML documents. So what is it? Where does it come from? Why is it important? What does it look like? What can you do with it? Any other three-letter acronym you should know of? What about four-letter acronyms? And can it brew coffee too? Those questions, most of them anyway, will be answered in this article.

What is XML and where does it come from?

XML (eXtensible Markup Language) is a W3C (www.w3c.org ) recommendation that defines a format used to store, exchange, process or display data or documents. It comes from SGML and HTML. And yes, it uses pointy brackets! It is designed to be both human-readable and easy to automatically process.

XML is a simplified version of SGML (Standard Generalized Markup Language), which probably does not say much to anybody. In a nutshell, XML is a kind of HTML that allows any tag to be defined. That means more freedom, which, as usual, comes at the price of more responsibility. Tags are not limited to H1, H2, P, etc., anymore, but can include price, rating or favorite-movie tags. The added responsibility, of course, is that whomever or whatever is going to use a document needs to be instructed what to do with all those new tags. But more or on that later, first enjoy your new freedom.

What does XML looks like?

Take an HTML document, say a comparison of ISP Web-hosting plans that would look like:


<html>
<body>
  <h1>ISP's<h1>
  <h2>My Local ISP<h2>
  <table>
    <tr><th align=left>Plan<th>Size<th> <th>Price<th><th>OS<th><th>CGI<th></tr>
    <tr><th>Basic<th><td>5Mo<td><td>$19.95<td><td>Linux<td><td>no</td></tr>
    <tr><th>Pro<th><td>1Go<td><td>$199<td><td>Linux<td><td>Perl</td></tr>
  <table>
  <hr />
  <h2>That Big ISP with Commercials on TV<h2>
  <table>
    <tr><th align=left>Plan<th>Size<th><th>Price<th><th>OS<th><th>CGI<th></tr>
    <tr><th>Basic<th><td>1Mo<td><td>$29.95<td><td>NT<td><td>no</td></tr>
    <tr><th>Pro<th><td>200Mo<td><td>$199<td><td>NT<td><td>VB</td></tr>
  <table>
<body>
<html>



So here is an XML document that could be used to store the same data:


<xml version="1.0"
</doc>
  <h1>ISP's<h1>
  <plans>
    <isp>
      <name id="local">My Local ISP<isp>
        <plan cat="basic"><size unit="Mo">5<size><price>19.95<price><os>Linux<os></plan>
        <plan cat="pro"><size unit="Go">1<size><price>199<price><os>Linux<os><cgi type="Perl"/></plan>
    </isp>
    <isp><name id="big">That Big ISP with Commercials on TV<isp>
      <plan cat="basic"><size unit="Mo">1<size><price>29.95<price><os>Linux<os></plan>
      <plan cat="pro"><size unit="Mo">200<size><price>199<price><os>Linux<os><cgi type="VB"/></plan>
    <isp>
  </plans>
</doc>


Simple hey?

First, notice some of the most important aspects of the XML syntax:

Oh, and by the way, there is nothing preventing the use of good ole HTML tags in XML. Most of the time it will even make it easier to display, whether in native XML or after being converted to HTML; they just have to be written according to the rules of XML syntax.

Why is XML important?

XML is important because it allows users to use descriptive markup. Simply put, it makes it possible to describe what the data is instead of what it looks like. This means that, in the previous example, a price is a price, not just one more cell in a table, so when the XML document above is being processed it is easy for software to figure out that 199 is a price while 200 is a size in MB.

It also makes it easy to deal with only some of the information in the document. If some users are just interested in the best hosting plan for each ISP and money is not an object, they can easily have software process only the plans where the attribute cat equals "pro" and not process the price.

A related benefit is that a user receiving periodic updates of that document can deal easily with changes in its structure. If more information, such as the number of e-mail accounts customers get, is added, the user software, if written properly, will just ignore that new information. Nothing will break, the system will behave as if that new information were not there. Then later, when the user decides to take that information into account, he can upgrade the processing software and include the number of e-mail accounts in his data. Nifty, hey?

Basically, the document has gone from a file that can only be displayed on the Web to a piece of information that can be used for various purposes. By using XML, the "write once use many times" motto can become reality.

Note that the underlying model for the data is a tree, as opposed to tables in the relational model.

What can XML be used for?

The short answer is "anything your boss wants".

The long answer is actually very long, so here is a summary:

More generally, XML can be used to exchange or store any structured data or documents in a neutral form. The data is structured into elements and attributes. XML is neutral as it is cross-platform and vendor independent.

Other three-letter acronyms (and more)

Using XML comes with an added burden, a.k.a. here's the bad news.

To harness the power of XML and impress your boss you will need style sheets ( in XSL or CSS ) maybe a formal definition of the structure of documents, using DTDs or schemas (or schemata if you want to sound really sharp). Standard ways to process the data also come in handy, either using an implementation of the DOM API or an XSLT processor. And this is only the beginning.

First, have a look at style sheets: When using HTML, the tag names are fixed, and the browser knows what to do with them: An h1 tag will be displayed in bold, large font; a td tag is a table cell. In standard lingo the semantics of the tags ARE fixed.

XML does not specify how a tag should be displayed. It is up to the application processing the document to take care of this. So to display an isp element in bold and with a slightly larger font than regular text, quite like an h2 actually, the browser will need a style sheet, along with the document, that will give the proper display instructions to the browser.

The two most popular style languages are CSS (Cascading Style Sheets) and XSL (XML Style Language). XSL, though supported by fewer browsers at the moment, should be preferred, as it is the most powerful, the one undergoing the most development at W3C level and most recent browsers support it (see XML Tools following).

Of course, in order not to rely on viewers having the latest browsers, HTML can also be generated from XML, either on the fly or just in batch, using one of the processing tools described below.

Another important part of XML is that it might be appropriate to have a little more control over documents than just "every open tag should be properly closed", especially if they are exchanged between different organizations. It might be helpful to make sure that they still use the same tags as last week. That's when a DTD (document type definition) or an XML schema should be used.

DTD has come from the SGML heritage of XML and schema are a W3C recommendation still under development that allow defining precisely what tags can be used in documents, what their attributes are and how they should be nested. A piece of software named an XML parser can then check that documents conform to their DTD or schema. XML documents without a DTD are described as well-formed, while those that conform to a DTD or schema are valid.

And then XML documents can be processed, whether to convert them to HTML, include data out of a database, merge documents ... whatever, there will be plenty of opportunities for that. Two W3C recommendations deal with XML processing: The DOM (Document Object Model) defines a model and an API to interact with an XML document, while XSLT (XSL Transformation Language) is a complete transformation language aimed at pre-processing a document before displaying it.

Just so you don't think that's all there is to it, please note that XSLT uses XPATH (a way to refer to parts of an XML document) and that other useful concepts (or buzzwords) include XLINK (advanced links), XQL (an XML query language) or namespaces (a way to mix several, possibly off-the-shelf, DTDs in a single document) and character encodings, which are usually easy to deal with in English but can be pretty annoying with other languages.

XML tools

Now let's be a little practical: Which tools can be used to create, edit, process and view XML files?

Text processors that can be used to create and edit XML include FrameMaker +SGML (very expensive), WordPerfect 9.0 ($149 everywhere) and, of course, EMACS and everybody's favorite, vi. After all, it's just text, isn't it? EMACS doesn't offer WYSIWYG display and vi won't even make sure a document is proper XML but it might be all that's needed to write a simple configuration file.

To process XML there is a plethora of options, most of them free by the way: There are Java tools (from IBM, Sun and Microsoft), a host of Perl modules (www.cpan.org/modules/by-module/XML), plenty of Python libraries (www.python.org/topics/xml); there are even Apache modules to integrate XML processing within the Web server (xml.apache.org).

Just one word of advice: XML looks deceptively easy to process. It should not be dealt with using something like regular expressions. Please use a real XML parser. The parser is the piece of software that actually reads the XML and tells your application what is a tag, an attribute, text etc.; there are tons of them, free, in any major language.

And finally browsers that can display XML, using XSL style sheets, include Microsoft's Internet Explorer 5.0 and above, Netscape 6.0 (and Mozilla, of course) and Opera 4.0.

Can it brew coffee?

XML is a wonderful tool, and can be used to vastly enhance the value of information available both on the Web and on an intranet. It is not a magic wand, though, and it will not solve anything until the overall system is properly designed.

XML defines a format for storing and exchanging information, and it comes with a host of tools and associated standards that make it much easier to handle than any "home-brewed" format. That's it.

Without a sensible scheme (most likely described through DTDs or schemas) for documents and data, harmonization with other organizations, and a clean design of the general architecture of the system and proper attention paid to the details of how all the pieces are actually going to work together, the result will be either gigabytes of useless tag soup or days spent tagging the middle initial of each and every person who ever had a look at the system.

On the other hand, using XML may be as simple as just choosing it as the format to store some configuration files, exchange a couple of simple data tables or add a small number of custom elements to HTML files. It will provide a flexible way to store and retrieve data while getting to understand better the format and what can be done with it. There is no need to start with a full-blown revamping of a whole Web site involving converting every single piece of data it contains.


Note: this article was published in 2000 in Boardwatch magazine. More recent articles about XML and especially Perl & XML can be found on www.xmltwig.com