XML, the Perl Way

Previous
1. Introduction
Table of Content
Table of Content
Next
3. Introduction to XML::Twig

2. Introduction to XML

2.1 What is XML

XML could be described as "HTML on steroids". Or conversely as "SGML on Prozac".

XML is a markup language, just like HTML, using the same basic syntax: pointy brackets, attributes... just slightly more dictatrial than HTML: tags MUST be closed, attributes MUST be enclosed in quotes, either single or double.

In fact it is just a little more than comma separated files, apart from the fact that fields are somewhat documented (by the element name and by attributes) and that they can be nested, thus defining a tree structure instead of a table.

What XML brings is syntaxic coherence, allowing the same tools to be used to process all XML files, and a host of associated standards to do formatting, transformation, linking...

XML complexity stems from 2 main facts:

2.2 XML example

A simple example would be: simple_doc.xml.

<?xml version="1.0" ?>
<doc>
  <title>A simple XML document</title>
  <section>
    <title>Section 1</title>
    <p>This is the first paragraph of section 1</p>
    <p>And this is the second paragraph of section 1 whith <b>bold</b> text</p>
    <empty desc="empty tag"/>
  </section>
  <section>
    <title>Section 2</title>
    <p>This is the first paragraph of section 2</p>
    <p>And this is the second paragraph of section 2</p>
  </section>
</doc>

2.3 Resources

The best resource on XML, and SGML by the way, is certainly Robin Cover's SGML/XML Web Page, which links to everything else anyway. XML.com and xmlhack are 2 good sites respectively for detailed articles on XML and for the latest news on the topic.

2.4 XML used in this tutorial

Just a word on the XML I use in this tutorial.

XML is usually used for 2 purposes these days: either purely to store data, to be exchanged between 2 pieces of software, or to store documents, possibly including data, that are destined to be printed or displayed on the web.

2.4.1 Data oriented XML

Data-oriented XML should be tagged according to a DTD that represents faithfully the data, we will see examples of that in the section about data base integration.

2.4.2 Document oriented XML

For document-oriented XML, after using SGML then XML for nearly 8 years, in all sorts of flavors and according to all sorts of DTD's I have become a firm believer in what I'd call "HTML++". By this I mean that as much as possible of the HTML DTD should be used for text. There is really no need to redefine paragraphs, lists, code, headers etc... Structuring elements can be added, such as sections, possibly typed ones, that's one +. Specific inline elements, for domain relevant data, such as part numbers and prices in a catalog, standard references in a standard, etc... constitue the second +. Links can either use the familiar <a> tag or use different tags, possibly typed.

XMLnews is a good example of such a DTD.

Starting from the XHTML DTD and adding the extra elements is definitely the easiest way to create that kind of DTD.

Although I did not use a DTD for this tutorial it would look like:

  <!ELEMENT tutorial (h1, section+)>
  <!ELEMENT section (html_stuff)>
  html_stuff is just the usual html content, plus a couple of elements:
  <!-- a link to a resource, so they can be gathered -->
  <!ELEMENT resource EMPTY>
  <!ATTLIST resource refid REFID>
  <!-- a method from XML::Twig, so it can be linked to the doc -->
  <!ELEMENT method (#PCDATA)>
  <!ATTLIST method class #REQUIRED>
  <!-- a code example, contains the file name -->
  <!ELEMENT example (#PCDATA)>
  <!ATTLIST example desc #REQUIRED>
  

Previous
1. Introduction
Table of Content
Table of Content
Next
3. Introduction to XML::Twig