Perl-XML Frequently Asked Questions

This work is distributed under the terms of Perl's Artistic License. Distribution and modification is allowed provided all of the original copyright notices are preserved.

All code examples in these files are hereby placed into the public domain. You are permitted and encouraged to use this code in your own programs for fun or for profit as you see fit.

Last updated: March 18, 2008

Abstract

This document aims to provide answers to questions that crop up regularly on the 'perl-xml' mailing list. In particular it addresses the most common question for beginners - "Where do I start?"

The official home for this document on the web is: http://perl-xml.sourceforge.net/faq/. The official source for this document is in CVS on SourceForge at http://perl-xml.cvs.sourceforge.net/perl-xml/perl-xml-faq/


1. Tutorial and Reference Sources
1.1. Where can I get a gentle introduction to XML and Perl?
1.2. Where can I find an XML tutorial?
1.3. Where can I find reference documentation for the various XML Modules?
2. Selecting a Parser Module
2.1. Don't select a parser module.
2.2. The Quick Answer
2.3. Tree versus stream parsers
2.4. Pros and cons of the tree style
2.5. Pros and cons of the stream style
2.6. How to choose a parser module
2.7. Rolling your own parser
3. CPAN Modules
3.1. XML::Parser
3.2. XML::LibXML
3.3. XML::XPath
3.4. XML::DOM
3.5. XML::Simple
3.6. XML::Twig
3.7. Win32::OLE and MSXML.DLL
3.8. XML::PYX
3.9. XML::SAX
3.10. XML::SAX::Expat
3.11. XML::SAX::ExpatXS
3.12. XML::SAX::Writer
3.13. XML::SAX::Machines
3.14. XML::XPathScript
3.15. How can I install XML::Parser under Windows?
3.16. How can I install other binary modules under Windows?
3.17. What if a module is not available in PPM format?
3.18. "could not find ParserDetails.ini"
4. XSLT Support
4.1. XML::LibXSLT
4.2. XML::Sablotron
4.3. XML::XSLT
4.4. XML::Filter::XSLT
4.5. AxKit
5. Encodings
5.1. Why do we need encodings?
5.2. What is UTF-8?
5.3. What can I do with a UTF-8 string?
5.4. What can Perl do with a UTF-8 string?
5.5. What can Perl 5.8 do with a UTF-8 string?
5.6. How can I convert from UTF-8 to another encoding?
5.7. What does 'use utf8;' do?
5.8. What are some commonly encountered problems with encodings?
6. Validation
6.1. DTD Validation Using XML::Checker
6.2. DTD Validation Using XML::LibXML
6.3. W3C Schema Validation With XML::LibXML
6.4. W3C Schema Validation With XML::Xerces
6.5. W3C Schema Validation With XML::Validator::Schema
6.6. Simple XML Validation with Perl
6.7. XML::Schematron
7. Common Coding Problems
7.1. How should I handle errors?
7.2. Why is my character data split into multiple events?
7.3. How can I split a huge XML file into smaller chunks
8. Common XML Problems
8.1. 'xml processing instruction not at start of external entity'
8.2. 'junk after document element'
8.3. 'not well-formed (invalid token)'
8.4. 'undefined entity'
8.5. 'reference to invalid character number'
8.6. Embedding Arbitrary Text in XML
8.7. Using XPath with Namespaces
9. Miscellaneous
9.1. Is there a mailing list for Perl and XML?
9.2. How do I unsubscribe from the perl-xml mailing list?
9.3. What happened to Enno?

1. Tutorial and Reference Sources

1.1.Where can I get a gentle introduction to XML and Perl?

Kip Hampton has written a number of articles on the subject of Perl and XML, which are available at www.xml.com. Here are a few links to get you started:

1.2.Where can I find an XML tutorial?

For the official executive summary, try the W3C's XML in 10 points.

If you're a complete XML newbie and struggling with jargon like 'element', 'entity', 'DTD', 'well formed' etc, you could try XML101.com. The site has no Perl content and a strong Microsoft/IE bias but you can come back here when you're finished :-)

On the other hand if you've worked with XML a bit and think you pretty much know it, the tutorial at skew.org will test the boundaries of your knowledge.

Another great source of information is the XML FAQ.

1.3.Where can I find reference documentation for the various XML Modules?

The reference documentation is embedded in the Perl code of each module in 'POD' (Plain Old Documentation) format. There are a number of ways you might gain access to this documentation:

  • The perldoc command will locate the module file, extract the POD text and format it for reading on screen. For example, if you want to read the documentation for the XML::Parser module, you would type: perldoc XML::Parser

  • Some Perl distributions (notably ActiveState Perl) include the documentation in HTML format. Under Windows, you should find this under: Start->Programs->ActiveState ActivePerl. If your distribution does not include the HTML files, you can create them using pod2html

  • HTML documentation for various Perl modules is also provided on various Internet sites. You can try searching for XML on Perldoc.com or on search.cpan.org for a list of XML documentation.

  • If all else fails, you can locate the module and open it directly in a text editor. Once again, using XML::Parser as an example, you would look for a file called Parser.pm in a directory called XML under lib. Once you have opened the file you can search for '=head' to locate POD.

2. Selecting a Parser Module

2.1.Don't select a parser module.

If you want to use Perl to solve a specific problem, it's possible that someone has already solved it and published their module on CPAN. This will allow you to ignore the details of the XML layer and start working at a higher level. Here's a random selection of CPAN modules which work with XML data but provide a higher level API:

  • If you want to use XML to transmit data across a network to use or provide 'web services', take a good look at SOAP::Lite (forget about the 'Lite' moniker, this is a serious piece of work).

  • Perhaps you've played around with the Glade GUI builder and discovered it uses XML to store the interface definitions. The Gtk2::GladeXML module already knows how to read those files and turn them into a working GUI with only a few lines of Perl code (see this article for an intro).

  • Maybe you've had the brilliant idea that you could serialise your Perl objects to XML format to support your new killer RPC over SMTP protocol. Well before you start coding, try installing SPOPS (Simple Perl Object Persistence with Security) you might find it already does exactly what you need (actually to go one step further, the aforementioned SOAP::Lite actually supports SMTP as a transport).

There are dozens of other examples of existing Perl modules which work with XML data in domain-specific formats and allow you to get on with the job of using that data. Remember, search.cpan.org is your friend.

2.2.The Quick Answer

For general purpose XML processing with Perl, XML::LibXML is usually the best choice. It is stable, fast and powerful. To make the most of the module you need to learn and use XPath expressions. The documentation for XML::LibXML is its biggest weakness.

Other modules may be better suited to particular niches - as discussed below.

2.3.Tree versus stream parsers

If you really do need to work with data in XML format, you need to select a parser module and there are many to chose from. Most modules can be classified as using either a 'tree' or a 'stream' model.

A tree based parser will typically parse the whole XML document and return you a data structure made up of 'nodes' representing elements, attributes, text content and other components of the document.

A stream based parser on the other hand, sends the data to your program in a stream of 'events' as the XML is parsed.

A tree based module will typically provide you with an API of functions for searching and manipulating the tree. The Document Object Model (DOM) is a standard API implemented by a number of modules. Other modules use non-standard APIs to take advantage of the many conveniences available to Perl programmers.

To use a stream based module, you typically write some handler or 'callback' functions and register them with the parser. As each component of the XML document is read in and recognised, the parser will call the appropriate handler function to give you the data. SAX (the Simple API for XML) is a standard object-oriented API implemented by all the stream-based parsers (except parsers written before SAX existed).

2.4.Pros and cons of the tree style

Programmers new to XML may find it easier to get started with a tree based parser - with one method call your document is parsed and available to your code.

Portability may be an issue for tree style code. Even the modules which support a DOM API differ enough that you will generally have to change your code if you need to switch to another parser module. The DOM itself is language-neutral, which may be an advantage if you're coming from C or Java but Perl programmers may find some of the constructs clumsy.

The memory requirements of a tree based parser can be surprisingly high. Because each node in the tree needs to keep track of links to ancestor, sibling and child nodes, the memory required to build a tree can easily reach 10-30 times the size of the source document. You probably don't need to worry about that though unless your documents are multi-megabytes (or you're running on lower spec hardware).

Some of the DOM modules support XPath - a powerful expression language for selecting nodes to extract data from your tree. The full power of XPath simply cannot be supported by stream based parsers since they only hold a portion of the document in memory.

2.5.Pros and cons of the stream style

Stream based parsers can (but don't always) offer better memory efficiency than tree based parsers, since the whole document does not have to be parsed before you can work with it.

SAX parsers also score well for portability. If you use the SAX API with one parser module, you can almost certainly swap to another SAX parser module without changing a line of your code.

The SAX approach encourages a very modular coding style. You can chain SAX handlers together to form a processing pipeline - similar in spirit to a Unix command pipeline. Each link (or 'filter') in the chain performs a well-defined function. The individual components tend to have a high degree of reusability.

SAX also has applications which are not tied to XML. Modules exist that can generate SAX event streams from the results of database queries or the contents of spreadsheet files. Downstream filter modules neither know nor care whether the original source data was in XML format.

One 'gotcha' with the stream based approach is that you can't be sure that a document is error-free until the end of the parse. For this reason, you may want to confirm that a document is well-formed before you pass it through your SAX pipeline

2.6.How to choose a parser module

Choice is a good thing - there are many parser modules to choose from simply because no one solution will be appropriate in all cases. Get stuck in, if you should discover you made the wrong choice, it's probably not going to be hard to change and you'll have some experience on which to base your next choice. The following advice is one person's view - your mileage may vary:

First of all, make sure you have XML::Parser installed - but don't plan to use it. Other modules provide layers on top of XML::Parser - use them. Further justification for this apparently contradictory advice can be found in the XML::Parser description below.

If your needs are simple, try XML::Simple. It's loosely classified as a tree based parser although the 'tree' is really just nested Perl hashes and arrays. You may need to swot up on Perl references (see: perldoc perlreftut) to take advantage of this module.

If you're looking for a more powerful tree based approach, try XML::LibXML for a standards compliant DOM or XML::Twig for a more 'Perl-like' API. Both of these modules support XPath.

If you've decided to use a stream based approach, head directly for SAX. The XML::SAX distribution includes a base class you can use for your filters as well as a very portable parser module written entirely in Perl (XML::SAX::PurePerl). You will probably also want to install XML::SAX::Expat which uses the same C-based parser library ('expat' by James Clark) as XML::Parser, for faster parsing.

Finally, the latest trendy buzzword in Java and C# circles is 'pull' parsing (see www.xmlpull.org). Unlike SAX, which 'pushes' events at your code, the pull paradigm allows your code to ask for the next bit when it's ready. This approach is reputed to allow you to structure your code more around the data rather than around the API. Eric Bohlman's XML::TokeParser offers a simple but powerful pull-based API on top of XML::Parser. There are currently no Perl implementations of the XMLPULL API.

2.7.Rolling your own parser

You may be tempted to develop your own Perl code for parsing XML. After all, XML is text and Perl is a great language for working with text. But before you go too far down that track, here are some points to consider:

  • Smart people don't. (Actually a number of really smart people have - that's why there's a range of existing parsers to chose from).

  • It's harder than you think. The first major hurdle is encodings. Then you'll have to handle DTDs - even if you're not doing validation. The feature list will also need to include numeric and named entities, CDATA sections, processing instructions and well-formedness checks. You probably should support namespaces too.

  • If you haven't done all that, it's not XML. It might work for that subset of XML that you deem to be important, but if you can't exchange documents with other parties, what's the point?

  • Even if it works it will be slow.

If none of the existing modules have an API that suits your needs, write your own wrapper module to extend the one that comes closest.

3. CPAN Modules

This section attempts to summarise the most commonly used XML modules available on CPAN. Many of the modules require that you have certain libraries installed and have a compiler available to build the Perl wrapper for the libraries (binary builds are available for some platforms). Except where noted, the parsers are non-validating.

You can find more in-depth comparisons of the modules and example source code in the Ways to Rome articles maintained by Michel Rodriguez.

3.1.XML::Parser

XML::Parser is a Perl wrapper around James Clark's 'expat' library - an XML parser written in C. This module was originally written by Larry Wall and maintenance was taken over by Clark Cooper. It is fast, complete, widely used and reliable. XML::Parser offers both tree and stream interfaces. The stream interface is not SAX, so don't use it for any new code. The tree interfaces are not a lot of fun to work with either. They're non-standard (no DOM support), not OO and offer no real API. The reason you might want XML::Parser is because it provides a solid base which is used by other modules with better APIs.

Before you rush off and try to install XML::Parser, make sure that you haven't got it already - it comes standard with ActiveState's Perl on all supported platforms. If you do need to install it, you'll need to install the expat library first (which you can get from expat.sourceforge.net) and you will need a compiler.

Most of the documentation you need for the XML::Parser API can be accessed using perldoc XML::Parser, however some of the low-level functionality is documented in perldoc XML::Parser::Expat.

3.2.XML::LibXML

XML::LibXML provides a Perl wrapper around the GNOME Project's libxml2 library. This module was originally written by Matt Sergeant and Christian Glahn and is now actively maintained by Petr Pajas. It is very fast, complete and stable. It can run in validating or non-validating modes and offers a DOM with XPath support. The DOM and associated memory management is implemented in C which offers significant performance advantages over DOM trees built from Perl datatypes. The XML::LibXML::SAX::Builder module allows a libxml2 DOM to be constructed from SAX events. XML::LibXML::SAX is a SAX parser based on the libxml2 library.

XML::LibXML can also be used to parse HTML files into DOM structures - which is especially useful when converting other formats to XML or using XPath to 'scrape' data from web pages.

The libxml2 library is not part of the XML::LibXML distribution. Precompiled distributions of the libxml2 library and the XML::LibXML Perl wrapper are available for most operating systems. The library is a standard package in most Linux distributions; it can be compiled on numerous other platforms; and it is bundled with PPM packages of XML::LibXML for Windows.

For early access to upcoming features such as W3C Schema and RelaxNG validation, you can access the CVS version of XML::LibXML at:

cvs -d :pserver:anonymous@axkit.org:/home/cvs co XML-LibXML
      
3.3.XML::XPath

Matt Sergeant's XML::XPath module was the first Perl DOM implementation to support XPath. It has largely been supplanted by XML::LibXML which is better maintained and more powerful.

3.4.XML::DOM

Enno Derksen's XML::DOM implements the W3C DOM Level 1 tree structure and API. It should not be your first choice of DOM module however, since it lacks XPath and namespace support and it is significantly slower than libxml.

TJ Mather is currently the maintainer of this package.

3.5.XML::Simple

Grant McLean's XML::Simple was originally designed for working with configuration files in XML format but it can be used for more general purpose XML work too. The 'simple tree' data structure is nothing more than standard Perl hashrefs and arrays - there is no API for finding or transforming nodes. This module is not suitable for working with 'mixed content'. XML::Simple has it's own frequently asked questions document.

Although XML::Simple uses a tree-style, the module also supports building the tree from SAX events or using a simple Perl data structure to drive a SAX pipeline.

If you are using XML::Simple, you should read "Does your XML::Simple code pass the strict test?" for a discussion of common pitfalls and ways to avoid them.

If you are becoming frustrated by the limitations of XML::Simple, see: "Stepping up from XML::Simple to XML::LibXML".

3.6.XML::Twig

Although DOM modules can be very convenient, they can also be memory hungry. If you need to work with very large documents, you may find XML::Twig by Michel Rodriguez to be a good solution. Rather than parsing your whole document and returning one large 'tree', this module allows you to define elements which can be parsed as discrete units and passed to your code as 'twigs' (small branches of a tree). You can also define whether the 'uninteresting bits' between the twigs should be discarded or streamed to STDOUT as they are processed.

Another advantage of XML::Twig is that it is not constrained by the tyranny of DOM compliance. Instead, it offers a number of conveniences to help the experienced Perl programmer feel right at home. XML::Twig also supports XPath expressions. The module's official home page for is http://www.xmltwig.com/.

3.7.Win32::OLE and MSXML.DLL

If you're using a Windows platform and you're not worried about portability, Microsoft's MSXML provides a DOM implementation with optional validation and support for both XPath and XSLT. MSXML is a COM component and as such can be accessed from Perl using Win32::OLE. Unfortunately this means you can't get at the documentation with the usual perldoc command, so here's a code snippet to get you started:

use Win32::OLE qw(in with);

my $xml_file  = 'your file name here';
my $node_name = 'element name or XPath expression';

my $dom = Win32::OLE->new('MSXML2.DOMDocument.3.0') or die "new() failed";

$dom->{async} = "False";
$dom->{validateOnParse} = "False";
$dom->Load($xml_file) or die "Parse failed";

my $node_list = $dom->selectNodes($node_name);
foreach my $node (in $node_list) {
  print $node->{Text}, "\n";
}
	

Shawn Ribordy has written an article about using MSXML from Perl at www.perl.com. You can find reference documentation for the MSXML API on MSDN.

3.8.XML::PYX

Although written in Perl, Matt Sergeant's XML::PYX is really designed for working with XML files using shell command pipelines. The PYX notation allows you to apply commands like grep and sed to specific parts of the XML document (eg: element names, attribute values, text content). For example, this one-liner provides a report of how many times each type of element is used in a document:

pyx doc.xml | sed -n 's/^(//p' | sort | uniq -c
      

This one creates a copy of an XML document with all attributes stripped out:

pyx doc1.xml | grep -v '^A' | pyxw > doc2.xml
      

And this one spell checks the text content of a document skipping over markup text such as element names and attributes:

pyx talk.xml | sed -ne 's/^-//p' | ispell -l | sort -u
      

Eric Bohlman's XML::TiePYX is an alternative Perl PYX implementation which employs tied filehandles. One of the aims of the design was to work around limitiations of the Win9x architecture which doesn't really support pipes. Using this module you can print PYX format data to a filehandle and well-formed XML will actually be written. Conversely, you can read from an XML file and <FILEHANDLE> will return PYX data.

Sean McGrath has written an article introducing PYX on XML.com. PYX can be addictive - especially if you're an awk or sed wizard, but if you find you're using Perl in your pipelines you should consider switching to SAX.

3.9.XML::SAX

The XML::SAX distribution includes a number of things you're likely to need if you're working with SAX.

XML::SAX::ParserFactory is used to get a parser object without having to know which parsers are installed.

Any SAX filters you develop should inherit from XML::SAX::Base. This will save you time as well as ensuring your filters 'play nicely' with other SAX components.

XML::SAX::PurePerl is a SAX parser module written entirely in Perl. It's performance isn't great but if you don't have a compiler on your system it may be your only option.

Authors of the Perl SAX spec and the modules which implement it include Ken MacLeod, Matt Sergeant, Kip Hampton and Robin Berjon.

3.10.XML::SAX::Expat

This module implements a SAX2 interface around the expat library, so it's fast, stable and complete. XML::SAX::Expat doesn't link expat directly but through XML::Parser. Hence, this module requires XML::Parser, and doesn't compile any XS code on installation. If you have XML::SAX and XML::Parser installed, you'll want to install this module to improve on the speed and encoding support offered by XML::SAX::PurePerl.

Robin Berjon is the author of this module although he claims to have stolen the code from Ken MacLeod.

3.11.XML::SAX::ExpatXS

This module, analogous to XML::SAX::Expat, implements a SAX2 interface around the expat library but it links expat directly. This is why XML::SAX::ExpatXS is even faster than XML::SAX::Expat. There is no dependence on XML::Parser but either some XS code must be compiled or a binary package installed.

This module was started by Matt Sergeant and completed by Petr Cimprich.

3.12.XML::SAX::Writer

The XML::SAX::Writer module is used to generate XML output from a SAX2 pipeline. This module can receive pluggable consumer and encoder objects. The default encoder is based on Text::Iconv.

XML::SAX::Writer was created by Robin Berjon.

3.13.XML::SAX::Machines

Once you start accumulating SAX filter modules and fitting them together into SAX pipelines, you may find writing the glue code is a little tedious and error prone. Barrie Slaymaker found this and built XML::SAX::Machines to solve the problem so you don't have to. Using this module, you can reduce this code:

use XML::SAX::ParserFactory;
use MyFilter::One;
use MyFilter::Two;
use MyFilter::Three;
use XML::SAX::Writer;

my $writer  = XML::SAX::Writer->new(Output => $output_file);
my $filter3 = MyFilter::Three->new(Handler => $writer);
my $filter2 = MyFilter::Two->new(Handler => $filter3);
my $filter1 = MyFilter::One->new(Handler => $filter2);
my $parser  = XML::SAX::ParserFactory->parser(Handler => $filter1);

$parser->parse_uri($input_file);
	

to this:

use XML::SAX::Machines qw( Pipeline );

Pipeline(
  MyFilter::One => MyFilter::Two => MyFilter::Three => ">$output_file"
)->parse_uri($input_file);
	

There are lots of other goodies as well. You can read about some of them in the following articles by Kip Hampton:

3.14.XML::XPathScript

XPathScript is a stylesheet language comparable to XSLT, for transforming XML from one format to another (possibly HTML, but XPathScript also shines for non-XML-like output).

Like XSLT, XPathScript offers a dialect to mix verbatim portions of documents and code. Also like XSLT, it leverages the powerful "templates/apply-templates" and "cascading stylesheets" design patterns, that greatly simplify the design of stylesheets for programmers. The availability of the XPath query language inside stylesheets promotes the use of a purely document-dependent, side-effect-free coding style. But unlike XSLT which uses its own dedicated control language with an XML-compliant syntax, XPathScript uses Perl which is terse and highly extendable.

As of version 0.13 of XML::XPathScript, the module can use either XML::LibXML or XML::XPath as its parsing engine. Transformations can be performed either using a shell-based script or, in a web environment, within AxKit.

3.15.How can I install XML::Parser under Windows?

The ActiveState Perl distribution includes many CPAN modules in addition to the core Perl module set. XML::Parser is one of these additional modules, so you've already got it.

3.16.How can I install other binary modules under Windows?

ActiveState Perl includes the 'Perl Package Manager' (PPM) utility for installing modules. You use PPM from a command window (DOS prompt) like this:

C:\> ppm
ppm> install XML::Twig
	

You must be connected to the Internet to use PPM as it connects to ActiveState's web site to download the packages you select. Refer to the HTML documentation which accompanies ActiveState Perl for instructions on using PPM through a firewall.

One disadvantage of PPM is that you can only install packages that have been packaged in PPM format. You don't have to wait for ActiveState to package modules though as you can tell PPM to use other package repositories. For example many of the XML modules are available through Randy Kobe's archive which can be accessed like this:

C:\> ppm
ppm> repository add RK http://theoryx5.uwinnipeg.ca/cgi-bin/ppmserver?urn:/PPMServer58
ppm> set save
ppm> install XML::LibXML
	

The 'set save' command means that PPM will remember this additional repository should you need to install another module later.

The location specified above assumes you're using a Perl 5.8 build. If you're still running Perl 5.6, use this command instead:

ppm> set repository RK http://theoryx5.uwinnipeg.ca/cgi-bin/ppmserver?urn:/PPMServer
	
3.17.What if a module is not available in PPM format?

Many of the CPAN modules are written entirely in Perl and don't require a compiler, so you can use the CPAN.pm module/shell which comes with Perl. The only thing you need is nmake - a windows version of make which you can download from:

http://download.microsoft.com/download/vc15/Patch/1.52/W95/EN-US/Nmake15.exe

It's a self extracting archive so run it and move the resulting files into your windows (or winnt) directory.

Then go to a command window (DOS prompt) and run:

perl -MCPAN -e shell
	

The first time you run the CPAN shell, you will be asked a number of questions by the automatic configuration process. Accepting the default is generally pretty safe. You'll be asked where various programs are on your system (eg: gzip, tar, ftp etc). Don't worry if you don't have them since CPAN.pm will use the Compress::Zlib, Archive::Tar and Net::FTP modules if they are installed - and they are part of the ActiveState Perl distribution. Also don't worry if you make a mistake, you can repeat the configuration process at any time by typing this command at the 'cpan>' prompt:

o conf init
	

If you're behind a firewall, when you're asked for an FTP or HTTP proxy enter it's URL like this:

http://your.proxy.address:port/
	

You can probably use http:// for both FTP and HTTP (depending on your proxy).

After you've selected a CPAN archive near you, you will finally get a 'cpan>' prompt. Then you can type:

install XML::SAX
	

and sit back while CPAN.pm downloads, unpacks, tests and installs all the relevant code in all the right places.

3.18."could not find ParserDetails.ini"

A number of people have reported encountering the error "could not find ParserDetails.ini in ..." when installing or attempting to use XML::SAX. ParserDetails.ini is used by XML::SAX::ParserFactory to determine which SAX parser modules are installed. It should be created by the XML::SAX installation script and should be updated automatically by the install script for each SAX parser module.

  • If you are installing XML::SAX manually you must run Makefile.PL. Unpacking the tarball and copying the files into your Perl lib directory will not work.

  • During the initial installation, if you are asked whether ParserDetails.ini should be updated, always say yes. If you say no, the file will not be created.

  • If you are using ActivePerl, the following command should resolve the problem:

    ppm install http://theoryx5.uwinnipeg.ca/ppms/XML-SAX.ppd
    	

Once you have successfully installed XML::SAX, you should consider installing a module such as XML::SAX::Expat or XML::LibXML to replace the slower pure-Perl parser bundled with SAX.

If you are packaging XML::SAX in an alternative distribution format (such as RPM), your post-install script should check if ParserDetails.ini exists and if it doesn't, run this command:

perl -MXML::SAX -e "XML::SAX->add_parser(q(XML::SAX::PurePerl))->save_parsers()"
      

Don't unconditionally run this command, or users who re-install XML::SAX may find that any fast SAX parser they have installed will be replaced as the default by the pure-Perl parser.

4. XSLT Support

This section attempts to summarise the current state of Perl-related XSLT solutions.

If you're looking for an introduction to XSLT, take a look at Chapter 17 of the XML Bible.

4.1.XML::LibXSLT

Matt Sergeant's XML::LibXSLT is a Perl wrapper for the GNOME project's libxslt library. The XSLT implementation is almost complete and the project is under active development. The library is written in C and uses libxml2 for XML parsing. Matt's testing found that it was about twice as fast as Sablotron.

4.2.XML::Sablotron

Sablotron is an XML toolkit implementing XSLT 1.0, DOM Level2 and XPath 1.0. It is written in C++ and developed as an open source project by Ginger Alliance. Since the XSLT engine is written in C++ and uses expat for XML parsing, it's pretty quick. XML::Sablotron is a Perl module which provides full access to the Sablotron API (including a DOM with XPath support).

4.3.XML::XSLT

This module aims to implement XSLT in Perl, so as long as you have XML::Parser working you won't need to compile anything to install it. The implementation is not complete, but work is continuing and you can join the fun at the project's SourceForge page. The XML::XSLT distribution includes a script you can use from the command line like this:

xslt-parser -s toc-links.xsl perl-xml-faq.xml > toc.html
	

Egon Willighagen has written An Introduction to Perl's XML::XSLT module at linuxfocus.org.

4.4.XML::Filter::XSLT

Matt has also written XML::Filter::XSLT which allows you to do XSLT transformations in a SAX pipeline. It currently requires XML::LibXSLT but it is intended to work with other XSLT processors in the future. In case you're wondering, yes it does build a DOM of your complete document which it transforms and then serialises back to SAX events. For this reason, it might not be appropriate for multi-gigabyte documents.

4.5.AxKit

If you're doing a lot of XML transformations (particularly for web-based clients), you should take a long hard look at AxKit. AxKit is a Perl-based (actually mod_perl-based) XML Application server for Apache. Here are some of AxKit's key features:

  • Data can come from XML or any SAX data source (such as a database query using XML::Generator::DBI)

  • stylesheets can be selected based on just about anything (file suffix, UserAgent, QueryString, cookies, phase of the moon ...)

  • transformations can be specified using a variety of languages including XSLT (LibXSLT or Sablotron), XPathScript (a Perl-based transformation language) and XSP (a tag-based language)

  • output formats can be anything you want (including HTML, WAP, PDF etc)

  • caching of transformed documents can be handled automatically or using your own custom scheme

5. Encodings

5.1.Why do we need encodings?

Text documents have long been encoded in ASCII - a 7 bit code comprising 128 unique characters of which 32 are reserved for non-printable control functions. The remaining 96 characters are barely sufficient for variants of English, less than ideal for western european languages and totally inadequate for just about anything else. 'Point solutions' have been applied with different 8 or 16 bit codes used in different regions. One obvious limitation of such solutions is that a single document cannot contain text in two or more languages from different regions. The recognised solution to these problems is Unicode/ISO-10646 - two separate but compatible standards which aim to provide standard character mappings for all the world's languages in a single character set.

All XML parsers are required to implement Unicode, but we can't get away from the fact that most electronic documents in existence today are not in Unicode. Even documents that have been produced recently are unlikely to be Unicode. Therefore, XML parsers are also able to work with non-Unicode documents - as long as each document contains an encoding declaration which the parser can use to map characters to the Unicode character set.

One particularly smart thing the Unicode designers did was to make the first 128 characters of Unicode the same as ASCII, so pure ASCII documents are also Unicode. The 'Latin 1' character set (ISO-8859-1) is a popular 8 bit code which adds a further 96 printable characters to ASCII and is commonly used in Western Europe. The extra 96 characters are also mapped to the identical character numbers in Unicode (ie: ASCII is a subset of ISO-8859-1 which is a subset of Unicode). Note however that although Unicode provides a number of ways to encode characters above 0x7F, none are quite the same as ISO-8859-1.

Further reference material on encodings can be found at The ISO 8859 Alphabet Soup.

5.2.What is UTF-8?

Since Unicode supports character positions higher than 256, a representation of those characters will obviously require more than one 8-bit byte. There is more than one system for representing Unicode characters as byte sequences. UTF-8 is one such system. It uses a variable number of bytes (from 1 to 4 according to RFC3629) to represent each character. This means that the most common characters (ie: 7 bit ASCII) only require one byte.

In UTF-8 encoded data, the most significant bit of each byte will be 0 for single byte characters and 1 for each byte of a multibyte character. This is obviously not compatible with 8-bit codes such as Latin1 in which all characters are 8 bits and all characters beyond 127 have the high bit set. Parsers assume their data is UTF-8 unless another encoding is declared, so if you feed Latin1 data, into an XML parser without declaring an encoding, the parser will most likely choke on the first character greater than 0x7F. If you are interested in the gory details, read on...

The number of leading 1 bits in the first byte of a multi-byte sequence is equal to the total number of bytes. Each of the follow-on bytes will have the first bit set to 1 and the second to zero. All remaining bits (shown as 'x' below) are used to respresent the character number.

1 byte character 0xxxxxxx
2 byte character 110xxxxx 10xxxxxx
3 byte character 1110xxxx 10xxxxxx 10xxxxxx
4 byte character 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
	

UTF-16 encoding is an alternative byte representation of Unicode which for most cases amounts to a fixed-width 16 bit code. ASCII and Latin1 characters (the first 256 characters) are represented as normal but with a preceding 0x00 byte. Although UTF-16 is conceptually simpler than UTF-8 (and is the native encoding used by Java), two major drawbacks mean it is not the preferred format for C or Perl:

  • ASCII text doubles in size when encoded in UTF-16.

  • The 0x00 bytes are not compatible with C's null terminated strings.

For more information, visit the UTF-8 and Unicode FAQ for Unix/Linux.

5.3.What can I do with a UTF-8 string?

You could obviously convert a UTF-8 encoded string to some other encoding, but before we get on to that, let's look at what you can do with it in its 'natural state'.

If you wish to display the string in a web browser, no conversion is necessary. Modern browsers can understand UTF-8 directly, as can be seen on this page on the kermit project web site (some characters in the page will not display correctly without the correct fonts installed but that's a font issue rather than an encoding issue).

To use UTF-8 encoded HTML, simply append this 'charset' modifier to your Content-type header:

Content-type: text/html; charset=utf-8
      

Or if you can't control the headers, include this meta tag in the document:

<meta http-equiv="Content-type" content="text/html; charset=utf-8">
      

For a 'low tech' alternative, you might find that UTF-8 text is readable on your display if you simply print it to STDOUT. XFree86 version 4.0 supports Unicode fonts and Xterm supports UTF-8 multibyte characters (assuming your locale is set correctly).

A growing number of editors support UTF-8 files and you can even write your Perl scripts in UTF-8 (since 5.6) using your native language for identifier names.

For more information, you may wish to visit the Perl, Unicode and i18N FAQ.

5.4.What can Perl do with a UTF-8 string?

Perl versions prior to 5.6 had no knowledge of UTF-8 encoded characters. You can still work with UTF-8 data in these older Perl versions but you'll probably need the help of a module like Unicode::String to deal with the non-ASCII characters.

The built-in functions in Perl 5.6 and later are UTF-8 aware so for example length will return the number of characters rather than the number of bytes in a string, and ord can return values greater than 255. The regular expression engine will also correctly match against multi-byte characters and character classes have been extended to include Unicode properties and block ranges.

None of this added functionality comes at the expense of support for binary data. Perl's internal SV data structure (used to represent scalar values) includes a flag to indicate whether the string value is UTF-8 encoded. If this flag is not set, byte semantics will be used by all functions that operate on the string, eg: length will return the number of bytes regardless of the bit patterns in the data.

You can include include Unicode characters in your string literals using the hex character number in an extended \x sequence. For example, this declaration includes the Euro symbol:

my $price_label = "\x{20AC}9.99";
	

length reports that this string contains 5 characters and under 'use bytes' it will report a length of 7 bytes.

This regular expression will match a string which starts with the Euro symbol:

my $euro = "\x{20AC}";
/^$euro/ && print;
	

And here's a regular expression that will match any upper case character - not just A-Z, but any character with the Unicode upper case property:

/\p{IsUpper}/
	

You can read more in perldoc perlunicode and perldoc utf8.

5.5.What can Perl 5.8 do with a UTF-8 string?

The Unicode support in Perl 5.6 had a number of omissions and bugs. Many of the shortcomings were fixed in Perl 5.8 and 5.8.1. One major leap forward in 5.8 was the move to Perl IO and 'layers' which allows translations to take place transparently as file handles are read from or written to. A built-in layer called ':encoding' will automatically translate data to UTF-8 as it is read, or to some other encoding as it is written. For example, given a UTF-8 string, this code will write it out to a file as ISO-8859-1:

open my $fh, '>:encoding(iso-8859-1)', $path or die "open($path): $!";
print $fh $utf_string;
      

The new core module 'Encode' can be used to translate between encodings (but since that usually only makes sense during IO, you might as well just use layers) and also provides the 'is_utf8' function for accessing the UTF-8 flag on a string.

5.6.How can I convert from UTF-8 to another encoding?

This being Perl, there's more than one way to do it ...



XML numeric character entities

If you are outputting XML, but for some reason do not wish to use UTF-8 (perhaps your editor does not support it), you can convert all characters beyond position 127 to numeric entities with a regular expression like this:

use utf8;   # Only needed for 5.6, not 5.8 or later

s/([\x{80}-\x{FFFF}])/'&#' . ord($1) . ';'/gse;
	

Andreas Koenig has supplied an alternative regular expression:

s/([^\x20-\x7F])/'&#' . ord($1) . ';'/gse;
	

This version does not require 'use utf8' with Perl 5.6; does not require a version of Perl which recognises \x{NN} and handles characters outside the 0x80-0xFFFF range.

Even if you are outputting Latin1, you will need to use a technique like this for all characters beyond position 255 (eg: the Euro symbol) since there is no other way to represent them in Latin1.

This technique can be used for the character content of elements and attribute values. It cannot be used for the element or attribute names since the result would not be well-formed XML.

Note

Remember, in XML the number in a numeric character entity represents the Unicode character position regardless of the document encoding.



Perl 5.8 IO layers

You can specify an encoding translation layer as you open a file like this:

open my $fh, '>:encoding(iso-8859-1)', $path or die "open($path): $!";
print $fh $utf_string;
      

You can also push an encoding layer onto an already open filehandle like this:

binmode(STDOUT, ':encoding(windows-1250)');
      



Perl 5.6 tr/// operator (deprecated)

Perl 5.6 offers a way of converting between UTF-8 and Latin1 8 bit byte strings using the 'tr' operator. This will no longer work in Perl 5.8 and later. To quote from the 5.8 release notes:

 

The tr///C and tr///U features have been removed and will not return; the interface was a mistake. Sorry about that.

 
 --perldelta

Just to make quite sure that you know exactly which code to avoid using, here is an example of translating from UTF-8 ('U') to 8 bit Latin1 ('C'):

$string =~ tr/\0-\x{FF}//UC;      # Don't do this
	



Perl 5.6 and later: pack()/unpack()

All versions of Perl from 5.6 on support 'U' and 'C' in pack/unpack templates. You can use this to split a UTF-8 string into characters and reassemble them into a Latin1-style byte string. For example:

use utf8;  # Not required with 5.8 or later

my $u_city = "S\x{E3}o Paulo";
my $l_city = pack("C*", unpack('U*', $u_city));
	

The first assignment creates a UTF-8 string 9 characters long (but 10 bytes long). The second assignment creates a Latin-1 encoded version of the string.



Unicode::String

The Unicode::String module pre-dates Perl 5.6 and works with older and newer Perl versions. You turn your string into a Unicode::String object (which is represented internally in UTF-16) and then call methods on the object to convert it to alternative encodings. For example:

use Unicode::String;

$ustr = Unicode::String::utf8($string);
$latin1 = $ustr->latin1();
	



Text::Iconv

The Text::Iconv module is a wrapper around the iconv library common on Unix systems (and available for Windows too). You can use this module to create a converter object which maps from one encoding to another and then pass the object a string to convert:

use Text::Iconv;

$converter = Text::Iconv->new('UTF-8', 'ISO8859-1');
print $converter->convert($string);
	

The biggest hurdle with using Text::Iconv is knowing which conversions the iconv library on your system can handle (the module offers no way to list supported encodings). Another problem is that even if your system supports the encoding you require, it may give it a non-standard name. For example, the code above worked on both Linux and Solaris 8.0 but Solaris 2.6 required '8859-1' and Win32 required 'iso-8859-1'.



XML::SAX::Writer

If you're using SAX to generate or transform XML, you can tell XML::SAX::Writer which output encoding to use like this:

my $writer  = XML::SAX::Writer->new(EncodeTo => 'ISO8859-1');
	

Internally, XML::SAX::Writer uses Text::Iconv to do the conversion so the same caveats about portability of encoding names apply here too.

5.7.What does 'use utf8;' do?

In Perl 5.8 and later, the sole use of the 'use utf8;' pragma is to tell Perl that your script is written in UTF-8 (ie: any non-ASCII or multibyte characters should be interpreted as UTF-8). So if your code is plain ASCII, you don't need the pragma.

The original UTF8 support in Perl 5.6 required the pragma to enable wide character support for builtin functions (such as length) and the regular expression engine. This is no longer necessary in 5.8 since Perl automatically uses character rather than byte semantics with strings that have the utf8 flag set.

You can find out more about how Unicode handling changed in Perl 5.8 from the perl58delta.pod file that ships with Perl.

5.8.What are some commonly encountered problems with encodings?



Surprise - it's UTF-8!

By far the most common problem people have is that they didn't expect the parsing process to translate their data into UTF-8. Whether this is an actual problem or merely a perceived problem is the subject of some debate. Sure, you may need to change the encoding when you output your data, but doing all your processing in Unicode will lead to less pain in the long term. The preceding section gives you plenty of ways to deal with UTF-8.



Windows code pages

Many Windows users assume that since they use Latin1 characters they should specify an encoding of 'iso-8859-1'. It's more likely however that their characters are encoded using Microsoft's code page 1252. This is an extension to ISO-Latin1 which replaces some of the control characters with printable characters. For strict Latin1 text it shouldn't matter, but if your text contains 'smart quotes', daggers, bullet characters, the Trade Mark or the Euro symbols it's not iso-8859-1. XML::Parser version 2.32 and later include a CP1252 mapping which can be used with documents bearing this declaration:

<?xml version='1.0' encoding='WINDOWS-1252' ?>
	



Web Forms

If your Perl script accepts text from a web form, you are at the mercy of the client browser as to what encoding is used - if you save the data to an XML file, random high characters in the data may cause the file to not be 'well-formed'. A common convention is for browsers to look at the encoding on the page which contains the form and to translate data into that encoding before posting. You declare an encoding by using a 'charset' parameter on the Content-type declaration, either in the header:

print CGI->header('text/html; charset=utf-8');
	

or in a meta tag in the document itself:

<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
	

If you find you've received characters in the range 0x80-0x9F, they are unlikely to be ISO Latin1. This commonly results from users preparing text in Microsoft Word and copying/pasting it into a web form. If they have the 'smart quotes' option enabled, the text may contain WinLatin1 characters. The following routine can be used to 'sanitise' the data by replacing 'smart' characters with their common ASCII equivalents and discarding other troublesome characters.

sub sanitise  {
  my $string = shift;

  $string =~ tr/\x91\x92\x93\x94\x96\x97/''""\-\-/;
  $string =~ s/\x85/.../sg;
  $string =~ tr/\x80-\x9F//d;

  return($string);
}
	

Note: It might be safer to simply reject any input with characters in the above range since it implies the browser ignored your charset declaration and guessing the encoding is risky at best.

6. Validation

The XML Recommendation says that an XML document is 'valid' if it has an associated document type declaration and if the document complies with the constraints expressed in it.

At the time the recommendation was written, the SGML Document Type Definition (DTD) was the established method for defining a class of documents; and validation was the process of confirming that a document conformed to its declared DTD.

These days, there are a number of alternatives to the DTD and the term validation has assumed a broader meaning than simply DTD conformance. The most visible alternative to the DTD is the W3C's own XML Schema. Relax NG is a popular alternative developed by OASIS.

If you design your own class of XML document, you are perfectly free to select the system for defining and validating document conformance, which suits you best. You may even chose to develop your own system. The following paragraphs describe Perl tools to consider.

6.1.DTD Validation Using XML::Checker

Enno Derksen's XML::Checker module implements DTD validation in Perl on top of XML::Parser. The recommended way to use XML::Checker is via one of the two convenience modules included in the distribution:

XML::DOM::ValParser can be used anywhere you would use XML::DOM. It works the same way except that it performs DTD validation.

XML::Checker::Parser can be used anywhere you would use XML::Parser. It works the same way except that it performs DTD validation.

Here's a short example to get you going. Here's a DTD saved in a file called /opt/xml/xcard.dtd:

<!ELEMENT xcard (firstname,lastname,email?)>
<!ELEMENT firstname (#PCDATA)>
<!ELEMENT lastname (#PCDATA)>
<!ELEMENT email (#PCDATA)>
	

Here's an XML document that refers to the DTD:

<?xml version="1.0" ?>
<!DOCTYPE xcard SYSTEM "file:/opt/xml/xcard.dtd" >
<xcard>
<firstname>Joe</firstname>
<lastname>Bloggs</lastname>
<email>joe@bloggs.com</email>
</xcard> 
	

And here's a code snippet to validate the document:

use XML::Checker::Parser;

my $xp = new XML::Checker::Parser ( Handlers => { } );

eval {
  $xp->parsefile($xml_file);
};
if ($@) {
  # ... your error handling code here ...
  print "$xml_file failed validation!\n";
  die "$@";
}
print "$xml_file passed validation\n";
	

You can play around with adding and removing elements from the document to get a idea of what happens when validation errors occur. You'll also want to refer to the documentation for the 'SkipExternalDTD' option for more robust handling of external DTDs.

6.2.DTD Validation Using XML::LibXML

The libxml library supports DTD validation although this is turned off by default in XML::LibXML. Once you have created an XML::LibXML object, you can enable validation like this:

$parser->validation(1);
	

Validation using XML::LibXML is much faster than with XML::Checker but if you want to know why a document fails validation you'll find that XML::LibXML's diagnostic messages are not as helpful.

The libxml2 distribution (which underlies XML::LibXML) includes a command line validation tool, written in C, called xmllint. You can use it like this:

xmllint --valid --noout filename.xml
	
6.3.W3C Schema Validation With XML::LibXML

XML::LibXML provides undocumented support for validating against a W3C schema. Here's an example of how you might use it (contributed by Dan Horne):

use XML::LibXML;

my $schema_file = 'po.xsd';
my $document    = 'po.xml';

my $schema = XML::LibXML::Schema->new(location => $schema_file);

my $parser = XML::LibXML->new;
my $doc    = $parser->parse_file($document);

eval { $schema->validate($doc) };
die $@ if $@;

print "$document validated successfully\n";
	

The referenced XSD schema file and sample XML document can be downloaded from the W3C Schema Primer.

The xmllint command line validator included in the libxml2 distribution can also do W3C schema validation. You can use it like this:

xmllint --noout --schema po.xsd po.xml
	
6.4.W3C Schema Validation With XML::Xerces

XML::Xerces provides a wrapper around the Apache project's Xerces parser library. Installing Xerces can be challenging and the documentation for the Perl API is not great, but it's the most complete offering for Schema validation from Perl.

6.5.W3C Schema Validation With XML::Validator::Schema

Sam Tregar's XML::Validator::Schema allows you to validate XML documents against a W3C XML Schema. It does not implement the full W3C XML Schema recommendation, but a useful subset. Here's an example:

use XML::SAX::ParserFactory;
use XML::Validator::Schema;

my $schema_file = 'po.xsd';
my $document    = 'po.xml';

my $validator = XML::Validator::Schema->new(file => $schema_file);

my $parser = XML::SAX::ParserFactory->parser(Handler => $validator);

eval { $parser->parse_uri($document); };
die $@ if $@;

print "$document validated successfully\n";
	
6.6.Simple XML Validation with Perl

Kip Hampton has written an article describing how a combination of Perl and XPath can provide a quick, lightweight solution for validating documents.

6.7.XML::Schematron

Kip has also written the XML::Schematron module which can be used with either XML::XPath or XML::Sablotron to implement validation based on Rick Jelliffe's Schematron.

7. Common Coding Problems

7.1.How should I handle errors?

Most of the Perl parsing tools will simply call die if they encounter an error (eg: an XML file which is not well-formed). You can trap these failures using eval. Once the eval has completed, you can check the special variable '$@'. This will contain the text of the error message if a failure occurred or will be undefined otherwise. For example:

use XML::Simple;

my $ref = eval {
  XMLin('<bad>not well formed');
};

if($@) {
  print "An error occurred: $@";
}
else {
  print "It worked!";
}
	

Don't forget the semi-colon after the code block passed to eval.

The '$@' variable contains the scalar value which was passed to die. In some cases, this value will be a reference to an 'exception' object. For simple error handling, you can still just print the value out in a double quoted string, but for more complex handling you might want to check the type of the exception or in some cases call methods on it to get the error code etc. XML::SAX::Exception is one such implementation.

7.2.Why is my character data split into multiple events?

If you parsed this XML file ...

<menu>Bubble &amp; Squeak</menu>
      

... with this code ...

use XML::Parser;

my $xp = new XML::Parser(Handlers => { Char => \&char_handler });

$xp->parsefile('menu.xml');

sub char_handler {
  my($xp, $data) = @_;
  print "Character data: '$data'\n";
}
      

... you might expect this output ...

Character data: 'Bubble & Squeak'
      

... in fact you'd probably get this ...

Character data: 'Bubble '
Character data: '&'
Character data: ' Squeak'
      

The reason is that parsers are not required to give you all of an element's character data in one chunk. The number of characters you get in each chunk may depend on the parser's internal buffer sizes, newline characters in the data, or (as in our example) embedded entities. It doesn't really matter what causes the data to be split - you just have to be prepared to handle it.

The usual approach is to accumulate data in the character event and defer processing it until the end element event. Here's a sample implementation using XML::Parser:

use XML::Parser;

my $xp = new XML::Parser(Handlers => { Start => \&start_handler,
				       Char  => \&char_handler,
                                       End   => \&end_handler  });
$xp->parsefile('menu.xml');

sub start_handler {
  my($xp) = @_;
  $xp->{cdata_buffer} = '';
}

sub char_handler {
  my($xp, $data) = @_;
  $xp->{cdata_buffer} .= $data;
}

sub end_handler {
  my($xp) = @_;
  print "Character data: '$xp->{cdata_buffer}'\n";
}
      

Of course you probably won't be coding to the XML::Parser API. It's more likely you'll be using SAX. In which case, the answer is much simpler. Just include Robin Berjon's XML::Filter::BufferText in your pipeline and stop worrying.

7.3.How can I split a huge XML file into smaller chunks

When your document is too large to slurp into memory, the DOM, XPath and XSLT tools can't really help you. You could write your own SAX filter fairly easily, but Michel Rodriguez has written a general solution so you don't have to. You'll find it bundled with XML::Twig from version 3.16.

8. Common XML Problems

The error messages and questions listed in this section are not really Perl-specific problems, but they are commonly encountered by people new to XML:

8.1.'xml processing instruction not at start of external entity'

If you include an XML declaration, it must be the very first thing in the document - it cannot even be preceded by whitespace or blank lines. For example, this would be 'well formed' XML as long as the '<' and the '?' are the first and second characters in the file:

<?xml version='1.0' standalone='yes'?>
<doc>
  <title>Test Document</title>
</doc>
      
8.2.'junk after document element'

A well formed XML document can contain only one root element. So, for example, this would be well formed:

<para>Paragraph 1</para>
      

while this would not:

 
<para>Paragraph 1</para>
<para>Paragraph 2</para>
      
8.3.'not well-formed (invalid token)'

There are a number of causes of this error, here are some common ones:



Unquoted attributes

All attribute values must always be quoted in XML. For example, this would be well formed:

<item name="widget"></item>
      

while this would not:

<item name=widget></item>
      



Bad encoding declaration

An incorrect or missing encoding declaration can cause this. By default the encoding is assumed to be UTF-8 so if your data is (say) ISO-8859-1 encoded then you must include an encoding declaration. For example:

<?xml version='1.0' encoding='iso-8859-1'?>
      
8.4.'undefined entity'

XML only pre-defines the following named character entities:

&lt;    <
&gt;    >
&amp;   &
&quot;  "
&apos;  '
	

If your XML includes HTML-style named character entities (eg: &nbsp; or &uuml;) you have two choices:

You could replace the named entities with numeric entities. For example the non-breaking space character is at position 160 (hex A0) so you could represent it with: &#160; (or &#x00A0;). Similarly, you could represent a lower case U-umlaut as &#252; (or &#x00FC;).

Alternatively, you could define your own named character entities in your DTD or in an 'internal subset' of a DTD. For example:

<!DOCTYPE doc [
    <!ENTITY eacute "&#233;" >
    <!ENTITY euro   "&#8364;" >
]>
<doc>Combien avez-vous pay&eacute;? 125 &euro;</doc>
	

You can find the definitions for HTML Latin 1 character entities on the W3C Site.

You can include all these character entities into your DTD, so that you won't have to worry about it anymore:

<!DOCTYPE doc [
<!ENTITY % HTMLlat1 PUBLIC
        "-//W3C//ENTITIES Latin 1 for XHTML//EN"
        "http://www.w3.org/TR/xhtml1/DTD/xhtml-lat1.ent">
%HTMLlat1;

<!ENTITY % HTMLspecial PUBLIC
        "-//W3C//ENTITIES Special for XHTML//EN"
        "http://www.w3.org/TR/xhtml1/DTD/xhtml-special.ent">
%HTMLspecial;

<!ENTITY % HTMLsymbol PUBLIC
        "-//W3C//ENTITIES Symbols for XHTML//EN"
        "http://www.w3.org/TR/xhtml1/DTD/xhtml-symbol.ent">
%HTMLsymbol;
]>
        
8.5.'reference to invalid character number'

The XML spec defines legal characters as tab (0x09), carriage return (0x0D), line feed (0x0A) and the legal graphic characters of Unicode. This specifically excludes control characters, so this would not be well-formed:

<char>&#3;</char>
      

Their really is no easy or standard way to include control characters in XML - binary data must be encoded (for example using MIME::Base64).

8.6.Embedding Arbitrary Text in XML

Any text you include in your XML documents must not contain the angle brackets or ampersand characters in an unescaped form. Manually escaping these characters can be tedious when you want to include a block of program code or HTML. You can use a CDATA section to indicate to the parser that the text within it should not be parsed for markup. For example, this XML document ...

<code><![CDATA[
  if($qty < 1) {
    print "<p>Invalid quantity!</p>";
  }
]]></code>
	

is equivalent to this document ...

<code>
  if($qty &lt; 1) {
    print &quot;&lt;p&gt;Invalid quantity!&lt;/p&gt;&quot;;
  }
</code>
	

When you parse a document, your code has no way of knowing if a particular piece of text came from a CDATA section - and you probably shouldn't care.

Note

CDATA is for character data - not binary data. If you need to include binary data in your document, you should encode it (perhaps using MIME::Base64) when you write the document and decode it during parsing.

8.7.Using XPath with Namespaces

People often experience difficulty getting their XPath expressions to match when they first use XML::LibXML to process an XML document containing namespaces. For example, consider this XHTML document with an embedded SVG section:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
  <title>Sample Document</title>
</head>
<body>

<h1>An HTML Heading</h1>

<s:svg xmlns:s="http://www.w3.org/2000/svg" width="300" height="200">
  <s:rect style="fill: #eeeeee; stroke: #000000; stroke-width: 1;"
    width="80" height="30" x="60" y="50" />
  <s:text style="font-size: 12px; fill: #000066; font-family: sans-serif;"
    x="70" y="70">Label One</s:text>
</s:svg>

</body>
</html>
      

The elements in the SVG section each use the namespace prefix 's' which is bound to the URI 'http://www.w3.org/2000/svg'. The prefix 's' is completely arbitrary and is merely a mechanism for associating the URI with the elements. As a programmer, you will perform matches against namespace URIs not prefixes.

The elements in the XHTML wrapper do not have namespace prefixes, but are bound to the URI 'http://www.w3.org/1999/xhtml' by way of the default namespace declaration on the opening <html> tag.

You might expect that you could match all the 'h1' elements using this XPath expression ...

//h1
      

... however, that won't work since the namespace URI is effectively part of the name of the element you're trying to match.

One approach would be to fashion an XPath query which ignored the namespace portion of element names and matched only on the 'local name' portion. For example:

//*[local-name() = 'h1']
      

A better approach is to match the namespace portion as well. To achieve that, the first step is to use XML::LibXML::XPathContext to declare a namespace prefix. Then, the prefix can be used in the XPath expression:

my $parser = XML::LibXML->new();
my $doc    = $parser->parse_file('sample.xhtml');

my $xpc = XML::LibXML::XPathContext->new($doc);
$xpc->registerNs(xhtml => 'http://www.w3.org/1999/xhtml');

foreach my $node ($xpc->findnodes('//xhtml:h1')) {
  print $node->to_literal, "\n";
}
      

The same technique can be used to match 'text' elements in the SVG section:

$xpc->registerNs(svg => 'http://www.w3.org/2000/svg');
foreach my $node ($xpc->findnodes('//svg:text')) {
  print $node->to_literal, "\n";
}
      

Note

The XML::LibXML::XPathContext module has been included in the XML::LibXML distribution since version 1.61. Prior to that it was in its own separate distribution on CPAN.

9. Miscellaneous

9.1.Is there a mailing list for Perl and XML?

Yes, the perl-xml mailing list is kindly hosted by ActiveState. The list info page has links to the searchable list archive as well as a form for subscribing. The list has moderate traffic levels (no messages some days, a dozen messages on a busy day) with a knowledgeable and helpful band of subscribers.

9.2.How do I unsubscribe from the perl-xml mailing list?

The list info page links through to an unsubscribe function. Every message sent to the list also includes an 'unsubscribe' link which makes it all the more mystifying that this really is a frequently asked question.

9.3.What happened to Enno?

This is one of the great mysteries of Perl/XML and no answer is available here.

Enno Derksen wrote a number of XML related Perl modules (including XML::DOM and XML::Checker::Parser) which were packaged into a distribution called 'libxml-enno'. No one has heard from Enno in quite a while and TJ Mather has assumed the role of maintainer for some of the modules. Many of us have benefitted from his work so if you're out there Enno - thanks.

Corrections, Contributions and Acknowledgements

This document is a 'work in progress'. A number of questions are still being worked on and will be added when they are complete.

If you wish to report an error or contribute information for inclusion in this document, please email the author at: .

I wish to gratefully acknowledge the assistance of the community of subscribers to the 'perl-xml' mailing list. Their knowledge and advice has been invaluable in preparing this document.