XML, the Perl Way


OSCON Day 1

Arrival

I got here on Sunday night, checked in, and headed towards the main conference hotel, a 5 minute walk away. There I quickly spotted Haim, playing with his brand new Mac (shiny!), and soon Eric Cholet joined us, along with way too many mod_perl people (Stas, Geoff...) for me to feel comfortable ;--)

The general comment on Stas' and Eric's Book: it's BIG!

Once again this year it looks like everybody here has a Mac. As they all look the same, I wonder how many people will end up swapping them. As yet-another-side-note: Mac users seem to have problems with any device that does not conform to Apple's Gui Standard: Eric could not figure out how to use his O'Reilly bag for example. OTOH I had the worst time configuring my Linux laptop to connect to the O'Reilly wireless network instead of the hotel network (mostly because everybody else was making fun of me while I was fumbling with my config files ;--(

I got kind of a revenge later on when the wireless network went AWOL and O'Reilly started handing out flyers asking all mac users to turn out Rendez-vous because it could be the source of the problem... niark niark (although at this point I had no connection either so I wasn't laughing that hard). The latest rumour is that the problems were caused by a rogue Windows machine running a network with the same ID as the O'Reilly one though...

Making Programs Faster

MJD

Of course, this morning, the tutorial I had signed up for, Advanced DBI, is cancelled, my next choice, XSLT is sold-out... I decide to sit in MJD's "Making Programs Faster". MJD is always an entertaining speaker. Plus I am terrible at optimizing, the archetypal premature optimizer one might say.

MJD shows the Schwartzian transform (ST).

He then shows a simple case (from a post on the newsgroup) where it really doesn't make much sense to use it: lc is fast, you can use it directly in a sort, the overhead of the ST is not worth it in this case, the ST version is slower than the naive one.

Now he gives us some examples of I/O bound, CPU bound and memory bound code and adds some hints about how to optimize them: parallelize I/O bound code (or switch from CGI to mod_perl), optimize the code for CPU bound code, try reducing the memory used or buy more memory for memory bound code. The important is to figure out which factor impacts your code the most.

tools for optimizing

Timing

The shell time is the easiest way to quickly figure out how much usr/CPU time a process takes. There are often 2 versions of time on a system: the built-in and one in /usr/bin/time or equivalent, with different output.

In Perl time() can be used, and use Time::Hires; to get time() to work with a resolution better than 1s.

The usual way he writes benchmark: an empty loop, then the various options, so he has a base value to compare the different options to.

He does not use Benchmark.pm: the results are different from his simple benchmarks, for some unknown reason, they are not consistent if he repeats the test, he even gets results where the test takes... a negative time to complete! Plus the machinery for a benchmark should be a simple as possible, and you should be able to understand it.

Profiling

Profile your code before optimizing it, or you will probably optimized the wrong functions.

Use Devel::DProf:

  perl -d:DProf toto.pl <args> > /dev/null
  dprofpp

Send the output to /dev/null so output related problems don't interfere

This will give you a list of functions, and how much time the process spent in each one.

The usual rule here is 90-10: 10% of the code accounts for 90% of the run time. So focus on the 10%!

Devel::Smallprof gives you an even more detailed report, line by line.

Examples

Generic advice

A very important piece of advice: TEST THE OPTIMIZED VERSION! Make sure you don't break things in the process: You probably don't live in a World where you want to get the wrong answer as fast as possible

Also think about the big picture, is the decrease in maintainability worth it? Always remember that hardware is cheap. We still live under the impression that hardware is expensive and precious. This is no longer true.

MJD has a cute cat! And nice jokes about him that he somehow manage to tie to the subject at hand.

Perception is important too: sometimes removing warning messages that say "this will take a long time" actually makes the users happier and stop them complaining, as in fact they don't notice the "long time"!

Optimize for the common case.

Speeding up a mailbox analyzer

In this case he found out that Mail::Header was taking a lot of time. So he replaced it by custom code. This is risky as mail protocols, (like most protocols IMHO) are hard to get right.

As it turned out, in this case, when checking the output of the program had changed... for the best! It turns out there was a minor bug in Mail::Header that the simpler optimized code fixed!

So the optimized version is better, and 81% faster!

Yipee!

A look at the profiler results, followed by a detailed economic analyzis shows that he can stop here. That's actually a very interesting analyzis: if he spends 20 minutes optimizing the next function, and actually get it to improve its speed by 20%, he will need to run the code 25 million times to get a positive return on investment! I should do this more often, that would save me a lot of time.

Speeding up pod2man

This is useful because it is run quite often, any time you install a new module for example.

Here he finds out that he can optimize a POD tokenizer, by making it faster in the common case ( I<text>) at the expense of the less common case I<< text >> (man is it a pain to type POD exemples... in POD as I am doing right now! I even have to use the Z<> escape for the first time!)

To get a better idea of what's going on he needs more detailed output than provided by Devel::DProf

Devel::Smallprof gives too much output, so he now teaches us how to write our own profiling module, using the hooks available for Devel:: modules: @{"::_<toto.pl"}, %DB::sub, DB::DB() and caller().

It is not that difficult actually!

CGI (SOAP) application

A real-life example: speeding up a CGI application that was accessed in burst (several hundred times per minute for a while, then nothing for a long time). It received XML data, parsed it, updated a DB, then signaled succes or failure.

The solution: as the succes depended only on whether the XML parsed or not, just parse the XML, return success/error and save it to a file. A batch process then reads the files and updates the DB, at its own pace. The client gets a return much faster, and the DB gets updated

Plus he got rid of a couple of modules that were used. He doesn't like modules that do very simple stuff... but with an OO interface (here the object was a string and the method used was 2 lines of code).

Blunders

Now some examples of failed attempts at optimization.

He starts with pseudo-hashes. That's too easy! ;--) It's a quite well-known story

He actually gives a detailed, and interesting, explanation of how they work and why they make slower than regular hashes.

Then an exemple from the newsgroup where someone replaced an eval "$string" by an eval { $code }. This made the code_much_ faster! That would be because eval { $code } does not actually eval the code in $code.

Beware of benchmarks! Check that the different versions return the same result first, and then benchmark.

Then he shows how the old trick of pre-allocating arrays was found not to speed up execution. A benchmark showed that it actually slowed things down.

Darn! I thought I had guessed why the benchmark was wrong but no!

It's a classic : Perl (or the OS) already optimizes things, uses caches, pre-allocate arrays based on previous calls to a function... so your optimization might be redundant, and thus counter-productive.

He finishes with a similar exemple from Tie::File, where he write a real fancy caching algorithm for lines in the file... that turned out not to be used in practice.

The next example is from a thread on Perlmonks... with tons of silly advice on how to optimize a (rather simple) numerical problem ([id://134419]). His morals: don't micro-optimize, and There is plenty of crappy optimization advice.

He finishes with some general advice that boils down to "THINK before you start optimizing!" (then think some more).

Basically optimizing is rarely worth it!

Conclusion

Overall a good tutorial stressing the dangers and potential pitfalls of optimization. As most good tutorials I liked the fact that it showed how to go about optimizing code, from non-optimized to the final version, through analyzis, refinements, mistakes and knowing when to stop. I find that this is the most important thing you can get from such a class: seeing how the instructor mind works.

Efficient SQL

by Greg Sabino Mullane

The tutorial will focus on PostgreSQL and how to make DB applications faster (do you see a trend in the tutorials I attend?)

PostgreSQL 7.4

SQL is usually the weakest link in the chain: when someone (usually not you) comes and tells you that your application is too slow, the SQL code is usually where you can optimize.

How to speed things up? 6 ways: hardware, OS, DBMS, Application, DB design and Query tuning.

Hardware and OS
RAM is the most important, fast disks can be useful too.
DBMS
By default some of the settings for PostgreSQL are set way too low (in order to run out-of-the-box in low-end machines), sort_mem, shared_buffers for example, see section 3.4.2 of the admin guide.
Application
Not that important unless there is a major flaw in the code.

He advises to keep the data as objects in Perl, with the SQL in a separate module, isolated from the main code. This way optimizing the SQL code is easier.

Try to leave as much as possible on the DB side, it will be faster and will make it easier to enforce constraints on the DB.

DB Design
He thinks that normalization is really important, and doesn't necessarily impact speed. That's what a Data Base does: JOINTs. Column order can impact the speed of the DB.
Query Tuning
That's what the tutorial is all about!

When thinking about optimizing, first figure out whether the problem is that all queries are slow (in which case you'd better look at the previous items) or whether some specific queries are too slow, in which case you can start working on them.

He now describes how a SQL query is parsed, optimized and executed by the DBMS.

EXPLAIN is used to show how the query will be parsed and executed (without running it), the number of estimated rows returned by each step and the estimated cost of each one.

EXPLAIN ANALYZE also runs the query and provide the actual time spent in each step, number of rows returned by each step.

At this point the tutorial became quite boring for a while as he listed the various operators and their cost.

Indexes

Now we see how ANALYZE generates statistical data on the db (in the pg_stat table). Frm there we can figure out which columns should be indexed: anytime a slow operation (typically a sequence scan) shows up we can add an index. The results are as spectacular, as expected: from 27s to 5.5ms in the example shown.

PostgreSQL can build indexes on functions, eg on lc(column), to avoid having to recompute the function for each row.

Using a WHERE clause to build partial indexes that can be a lot more efficient than full indexes. For exemple NULL values can be excluded from the index. The results of ANALYZE are important to figure out if it is worth to narrow down the index.

CLUSTER can also be used to by moving the data physically on the disks, which increases the access speed.

Miscellaneous tidbits

Conclusion

Overall the tutorial was very deep and thourough, showing the process used to analyze and optimize SQL queries with PostgreSQL. It was also thouroughly boring at times: I found it hard to get excited by the fight to stave off a couple of milliseconds from a query (how to get from 6.07ms to 0.76ms in 5 painful^Heasy steps ;--)


Oscon Day2

Morning Mess up

I had breakfast with Antoine (Quint), who has only to write the slides for this afternoon tutorial on SVG... I have no doubt he will pull it out though, he is quite famous for this kind of prowess.

There was a messup with my registration this morning, so no eXtreme Programing day for me... I'll attend the PAR tutorial and the second part of Damian's OO class this afternoon.

In other news the network is down... again! It makes you realize how much we depend on it, no email, no IRC, no use.perl.org :--(

PAR

Autrijus Tang

I missed the first part of the tutorial, due to the registration problem, so this report might end up being a little weird. The second part of the tutorial shows exemples of using PAR.

PAR lets you package Perl applications and modules in a single parfile. Binary (XS) parts can be included, for one or more architecture.

In Mozilla (and Netscape or Firebird) you can use jar:file:///home/mirod/foo.par!/MANIFEST to view the manifest, cool! META.yml holds the data. It is generated by Module::Build (as a side note I like Module::Build, if you write modules you should have a look at it).

Loading mod_perl based apps

PAR makes it really easy to install web-based software:

6 lines to install Slash! (instead of more than 6000 words in the Slash book)

 - get the par file
 - run slash.par
 - add a couple of lines to your http.conf
 - restart apache

Done!

To get PAR to work with Mason, you can use MasonX::Resolver::PAR

Loading modules on demand

You can also make sure you are always using the latest version of a module. You need LWP installed, but that dependency will be removed in the future.

  use PAR;
  use lib 'http://my.org/par/DBI-latest.par'; # par is cached if you're off-line
  use DBI;                                    # loaded when updated

BTW DBI-latest.par can be automagically generated from CPAN, and if the tests fail the new module is not installed. Binaries, can be downloaded too without having to recompile them.

Binaries are not included on CPAN at this time, as including them would bloat CPAN (a lot!), so the exact distribution mechanism is not yet set. PAR is designed so it can choose the binary for the platform, from the testers, with a reputation system to choose the best one, security is also taken into account through signatures.

The bottom line is that you don't need an installer for large-scale deployment, everything is downloaded on demand.

You can also use this to install modules on web hosts that do not allow module installation, provided they have LWP (Hey! Now we have yet another answer for all those people who want to solve a problem without using a module!)

Code Obfuscation

A hotly debated topic!

PAR will support pluggable input filters. There is already one, that strips POD. In any case it will be quite fragile, <B::Deparse> or <B::Deobfuscate> (created to piss-off the Stunnix guy!) will most likely defeat it.

So the hooks are provided by PAR, but it is not encouraged ;--)

Packing GUI applications

GUI apps usually use shared libraries. You can include the library in the PAR file (works for Tk and Qt and most likely for others). This way the application will use the specific version of the library sent with the PAR file. The user also doesn't have to install the library.

Module::Install

Utilities for module authors, so they can check the user config, get files through the network using the best possible way, ask for user input etc...

The necessary parts of Module::Install are shipped with the module, so the user doesn't even have to install it.

PAR uses Module::Install

Platform-specific tips

To save bandwith and disk space use UPX over zlib

Conclusion

PAR looks very cool, the tutorial is very lively and enthusiastic about it.

State of the Onion

Larry is all over the place. I really can't do justice to his talk.

He warns us that he will have to find a proper job soon, with proper health coverage... but there will be an important announcement at the end of the talk...

Ponie!

Perl 5.10 sur Parrot, Arthur Bergman will port Perl5 on Parrot, courtesy of Fotango! http://oscon.kwiki.org/index.cgi?Ponie opensource.fotango.com/ponie/

White Camel: Robert Spier, Jarkko Hietaniemy and Andre Koenig (for CPAN),

State of the Snake

Guido is very boring


Wednesday

Keynotes: Tim O'Reilly

I must say I was a little foggy this morning (see picture of last night ;--( so I don't have much to say.

Lightning Talks

Stop using XML everywhere! Damn It!
See the slides
Robert Spier and Ask
A new perl.org! Cool!
Beecheek
A talk I already saw at YAPC, about a regular small company that switched to Open Source, notably replacing Access. A nice success story.
CPAN, The Next Generation
Autrijus Trang

Describes how CPAN works so far. The Mac has resolution problems, niark niark! Lets start again... So he shows us a huge graphic describing the really complex process needed to put a module on CPAN.

New tools available: CPANPLUS (replaces CPAN), Test::* (Simple, More...), Module::Build and Module::Install (replace Extutil::MakeMaker), Module::Release, Test::Reporter, rt.cpan.org, PAR, Meta.yml...

Cool! I did not know about Module::Release

How to get hired
Andy Lester (petdance)

His advice: make sure you apply for jobs you will like, or you will hate your life later on. Ignore Monster.com and al, no one gets hired through them, perl jobs works though (he got his last 2 jobs this way). Dress properly, do not be afraid of asking questions.

The Perl date and Time Project
Dave Rolsky
Part-of-speech tagging
Aaron Coburn

He has problems with his Mac (niark...)

He describes a software that analyzes English text and tags words with their role in the sentence, based on a Markov Model of the language. It's on CPAN

  • 5 Damian Modules in 5 minutes
  • Walt Mankowski

    5 modules from Damian that are actually useful:

      Lingua::EN::Inflect : convert singular to plural
      Switch              : a last the C C<switch> in Perl... except this one 
                            is on steroids
      Class::Multimethods : lets you do C++ style overloading
      Attribute::Types    : lets you type variables
      NEXT                : allows a class to redispatch methods (it does not
                            stop when the first appropriate method is found in
                            the inheritance tree, but lets you look for a method
                            by that name in an other part of this tree)
    
    New Syntax for Links in POD
    Ronald Kimball

    YIPEE!!!!

    He proposes to have the target of the link marked using attributes:

      L< p<perlre> Perl Regular Expression>
      L< u<http://www.perl.com>; Perl.com>
    

    Neat

    My Favourite CPAN modules
    A rap in Chinese by Autrijus Trang

    Draws a standing ovation from a stunned crowed!

    Saving the World with Inline.m
    Schuyler Erle

    How to use Inline to embed an ugly C library into Perl to project data in lat/Long on a Dymaxion projection

    Complexity Mangement
    A song by Piers Crawley

    The power (and danger!) of just.

    • Great non-O'Reilly Books
    • OO Perl (Damian Conway - Manning) Effective Perl Programming (Joseph Hall) Network Programing in Perl (Lincoln Stein) mod_perl Developer's Cookbook

      Code Complete (Steve McConnell) The Pragmatic Programmer (A Hunt and D Thomas) The Practice of System and Network Administration

      The Elements of Style (Strunk/White) 7 habits of highly effective people (I don't know about this one) The Brand New 50 (Tom Peters)

      Shell::Posix::select
      Tim Maher

      A module that implements the shell select loop, except better. It allows you very simply to write little interactive tools.

      Allison's Restaurant
      A song by Allison Randal

      What's new in Parrot

      Dan Sugalski

      The most often used word in the talk: cool

      The JIT works, and is is really really cool, IMCCs work, C library-call-outs... without having to write ANY C code, Objects are done (single-inheritance so far), exceptions

      Little Languages in Parrot

      Acme

      Pretty pictures (a parrot, a cool Question logo), introduction to Parrot, presentation of all the potential little languages that could use Parrot (regexps, sed/awk, SQL), "real" languages that actually use it: brainfuck, Ook, befunge, Basic, lua.

      And now... Perl 5, with Ponie

      Clinton Pierce wrote a Basic on Parrot.

      Exceptions: not resumable by default but it can be done with continuations

      Threads are not implemented but the design takes them into account

      According to Dan Leo is at least 3 people, maybe 4.

      No Z-machine yet though :--(


      Thursday

      45 minutes by MJD

      The Quilt project

      (for the French speakers a quilt is un "patchwork")

      MJD wrote a program to generate quilt patterns to impress his girlfriend. He eventually got married to her.

      The project was declared highly successful, actually the most successful he ever did.

      Text::Template

      Text::Template is stable, functional, and no new version needs to be released. So users complain and worry that it has been abandonned. Should he just update it every month? (He blames Microsoft for this deplorable state of mind).

      He also found out that people do not want to subclass the module and instead want him to add features to the module. He describes how to subclass it and urges not to be afraid of doing it.

      Getting help from strangers

      How to increase your chances to get help when asking help (from him or other strangers).

      Put your name in the email, give context, if it is important, explain why, be polite.

      How to progress

      (technically)

      • Read books that other people are not reading (so he cannot tell us which books to read or we would all be eading the same books ;--)
      • Read original source material (Einstein book, Gallilleo's books)
      • Read actively, ask yourself questions while reading
      • Take notes

      NP-Complete Problems

      The Holy Grail of computing Science is trying to solve NP-Complete Problems (if you solve 1 you can solve all of them)

      Even if you cannot find the optimal solution to the problem that does not mean that you can't find a close enough solution. So don't give up.

      Intermission

      A (very short) song by MJD

      On Fish

      When people ask basic questions he tends to just give the answer, instead of answering perldoc foo (he gives a fish instead of teaching them how to fish) He explains why.

      Why Lisp is never going to win

      Basically because of the community attitude. He shows a very funny usenet post from

      A message for the Aliens

      He shows a real message sent to aliens. It is a REALLY weird message!

      Perl6 Design Philosophy

      Allison Randal

      Simple is better, but not all problems are simple (the simplest language is just no language, but that's not really useful).

      The waterbed theory: you will need some complexity, and if you simplify an aspect of the language, then an other aspect will have to be complex.

      The talk summarizes the design discussion in Perl6 Essentials which I was reading this morning.

      Template Toolkit 3

      Andy Wardley

      Lots of cool new syntax features, speed improvements.

      Andy plans to split up TT in 2: generic template processing tools in Template::Toolkit and Template::TT2,Template::TT3 (and even Template::Mason, Template::HTML)

      He wants the tookit to be more modular, so you can swap out some parts, replace them, or use equivalent parts from another toolkit. This will also make debugging easier. It will also allow for the creation of custom tags, processed by the templating system.

      You could also mix templating systems (have some Mason templates, some custom ones and some TT3 ones).

      Randall throws in the idea that TT4 could compile to Parrot bytecode, which would open the door to interesting extensions.

      TT3 will be available.... real soon now! Hopefully by the end of the year.

      TIPS for Learning XSLT

      Adam "Ziggy" Turoff

      XSLT IS a programing language. And a weird one at that. It actually includes 3 languages: XPath, XSLT and the output language (HTML, or RSS or whatever)

      It encourages incremental development: change -> test. Start small and grow the program from there. Use XPath as much as possible. XPath is extremely powerful. It appears in match templates, in select expressions, in attribute values (att="{count(ancestor::)"})

      Use the default behavior (visit children for all element and emit text or attribte nodes)

      If multiple templates match, the last one is used: place more specific matches after generic ones.

      You can loop using recursion... but try to avoid it. Iterate over lists of nodes returned by XPath.

      Re-use is done with importing and including, which are slightly different: include overrides the local definitions, while import doesn't.

      Look at the Docbook stylesheets

      Use empty template rules to remove elements

      Push vs Pull

      push
      process current element, then let templates apply to children.This is rule based programing.
      pull
      grab current element and the relevant children, process. This is closer to procedural programing.

      Choose wisely which one to use. Roughly pull is often good for data (very structured) while push is better for documents.

      Cool, that's one distinction that I have always found very important, but that I rarely see mentioned, at least that clearly.

      5 things we do wrong with XML

      RJ Ray

      This sounds like a talk for me!

      This talk is honor of XML's 5th Anniversary

      People are too quick to use XML

      Often people just want to be buzzword compliant.

      People are too slow to use it

      Sometimes XML IS the right solution. Then don't hesitate.

      These things should have schemas

      Software change logs, diffs, cooking recipes

      Then my laptop died... no more battery

      Actually the talk was not great, I think it lacked focus and I did not agree with some of the opinions of the author.

      Writing about Perl

      Randal Schwartz and Tom Phoenix

      To really learn the subject answer questions, on usent or Perlmonks. Teaching is great because of the immediate feedback.

      Authoring formats: pod, FrameMaker, troff, LateX (in increasing order of uglyness)

      Writing a book takes a LONG time: a whole day of work per page. It doesn't really pay either (less than minimum wages). The book's main point is to raise your profile.

      Write the introduction last (when you know what the rest of the book is).


      Friday

      Perl6::Rules

      Damian Conway

      No, Perl6::Rules is not just a statement, it is also a Perl5 statement.

      It is the successor of Parse::RecDescent, allowing the use of Perl6 regular expressions in Perl5.

      The more I hear about them, the more I think that Perl6 new regexps (renamed rules) are amazingly powerful. I will not go over the complete syntax here.

      Damian shows us the test suite for the module.