Copyright © 2002 - 2008 Grant McLean
This work is distributed under the terms of Perl's Artistic License. Distribution and modification is allowed provided all of the original copyright notices are preserved.
All code examples in these files are hereby placed into the public domain. You are permitted and encouraged to use this code in your own programs for fun or for profit as you see fit.
Last updated: March 18, 2008
Abstract
This document aims to provide answers to questions that crop up regularly on the 'perl-xml' mailing list. In particular it addresses the most common question for beginners - "Where do I start?"
The official home for this document on the web is: http://perl-xml.sourceforge.net/faq/. The official source for this document is in CVS on SourceForge at http://perl-xml.cvs.sourceforge.net/perl-xml/perl-xml-faq/
1. Tutorial and Reference Sources | |||||||
1.1. | Where can I get a gentle introduction to XML and Perl? | ||||||
Kip Hampton has written a number of articles on the subject of Perl and XML, which are available at www.xml.com. Here are a few links to get you started: | |||||||
1.2. | Where can I find an XML tutorial? | ||||||
For the official executive summary, try the W3C's XML in 10 points. If you're a complete XML newbie and struggling with jargon like 'element', 'entity', 'DTD', 'well formed' etc, you could try XML101.com. The site has no Perl content and a strong Microsoft/IE bias but you can come back here when you're finished :-) On the other hand if you've worked with XML a bit and think you pretty much know it, the tutorial at skew.org will test the boundaries of your knowledge. Another great source of information is the XML FAQ. | |||||||
1.3. | Where can I find reference documentation for the various XML Modules? | ||||||
The reference documentation is embedded in the Perl code of each module in 'POD' (Plain Old Documentation) format. There are a number of ways you might gain access to this documentation:
| |||||||
2. Selecting a Parser Module | |||||||
2.1. | Don't select a parser module. | ||||||
If you want to use Perl to solve a specific problem, it's possible that someone has already solved it and published their module on CPAN. This will allow you to ignore the details of the XML layer and start working at a higher level. Here's a random selection of CPAN modules which work with XML data but provide a higher level API:
There are dozens of other examples of existing Perl modules which work with XML data in domain-specific formats and allow you to get on with the job of using that data. Remember, search.cpan.org is your friend. | |||||||
2.2. | The Quick Answer | ||||||
For general purpose XML processing with Perl, Other modules may be better suited to particular niches - as discussed below. | |||||||
2.3. | Tree versus stream parsers | ||||||
If you really do need to work with data in XML format, you need to select a parser module and there are many to chose from. Most modules can be classified as using either a 'tree' or a 'stream' model. A tree based parser will typically parse the whole XML document and return you a data structure made up of 'nodes' representing elements, attributes, text content and other components of the document. A stream based parser on the other hand, sends the data to your program in a stream of 'events' as the XML is parsed. A tree based module will typically provide you with an API of functions for searching and manipulating the tree. The Document Object Model (DOM) is a standard API implemented by a number of modules. Other modules use non-standard APIs to take advantage of the many conveniences available to Perl programmers. To use a stream based module, you typically write some handler or 'callback' functions and register them with the parser. As each component of the XML document is read in and recognised, the parser will call the appropriate handler function to give you the data. SAX (the Simple API for XML) is a standard object-oriented API implemented by all the stream-based parsers (except parsers written before SAX existed). | |||||||
2.4. | Pros and cons of the tree style | ||||||
Programmers new to XML may find it easier to get started with a tree based parser - with one method call your document is parsed and available to your code. Portability may be an issue for tree style code. Even the modules which support a DOM API differ enough that you will generally have to change your code if you need to switch to another parser module. The DOM itself is language-neutral, which may be an advantage if you're coming from C or Java but Perl programmers may find some of the constructs clumsy. The memory requirements of a tree based parser can be surprisingly high. Because each node in the tree needs to keep track of links to ancestor, sibling and child nodes, the memory required to build a tree can easily reach 10-30 times the size of the source document. You probably don't need to worry about that though unless your documents are multi-megabytes (or you're running on lower spec hardware). Some of the DOM modules support XPath - a powerful expression language for selecting nodes to extract data from your tree. The full power of XPath simply cannot be supported by stream based parsers since they only hold a portion of the document in memory. | |||||||
2.5. | Pros and cons of the stream style | ||||||
Stream based parsers can (but don't always) offer better memory efficiency than tree based parsers, since the whole document does not have to be parsed before you can work with it. SAX parsers also score well for portability. If you use the SAX API with one parser module, you can almost certainly swap to another SAX parser module without changing a line of your code. The SAX approach encourages a very modular coding style. You can chain SAX handlers together to form a processing pipeline - similar in spirit to a Unix command pipeline. Each link (or 'filter') in the chain performs a well-defined function. The individual components tend to have a high degree of reusability. SAX also has applications which are not tied to XML. Modules exist that can generate SAX event streams from the results of database queries or the contents of spreadsheet files. Downstream filter modules neither know nor care whether the original source data was in XML format. One 'gotcha' with the stream based approach is that you can't be sure that a document is error-free until the end of the parse. For this reason, you may want to confirm that a document is well-formed before you pass it through your SAX pipeline | |||||||
2.6. | How to choose a parser module | ||||||
Choice is a good thing - there are many parser modules to choose from simply because no one solution will be appropriate in all cases. Get stuck in, if you should discover you made the wrong choice, it's probably not going to be hard to change and you'll have some experience on which to base your next choice. The following advice is one person's view - your mileage may vary: First of all, make sure you have If your needs are simple, try If you're looking for a more powerful tree based approach, try
If you've decided to use a stream based approach, head
directly for SAX. The Finally, the latest trendy buzzword in Java and C# circles is
'pull' parsing (see www.xmlpull.org). Unlike SAX, which 'pushes' events at your
code, the pull paradigm allows your code to ask for the next bit when
it's ready. This approach is reputed to allow you to structure your code
more around the data rather than around the API. Eric Bohlman's
| |||||||
2.7. | Rolling your own parser | ||||||
You may be tempted to develop your own Perl code for parsing XML. After all, XML is text and Perl is a great language for working with text. But before you go too far down that track, here are some points to consider:
If none of the existing modules have an API that suits your needs, write your own wrapper module to extend the one that comes closest. | |||||||
3. CPAN Modules | |||||||
This section attempts to summarise the most commonly used XML modules available on CPAN. Many of the modules require that you have certain libraries installed and have a compiler available to build the Perl wrapper for the libraries (binary builds are available for some platforms). Except where noted, the parsers are non-validating. You can find more in-depth comparisons of the modules and example source code in the Ways to Rome articles maintained by Michel Rodriguez. | |||||||
3.1. | XML::Parser | ||||||
Before you rush off and try to install
Most of the documentation you need for the
| |||||||
3.2. | XML::LibXML | ||||||
The libxml2 library is not part of the
For early access to upcoming features such as W3C Schema and RelaxNG
validation, you can access the CVS version of cvs -d :pserver:anonymous@axkit.org:/home/cvs co XML-LibXML | |||||||
3.3. | XML::XPath | ||||||
Matt Sergeant's | |||||||
3.4. | XML::DOM | ||||||
Enno Derksen's TJ Mather is currently the maintainer of this package. | |||||||
3.5. | XML::Simple | ||||||
Grant McLean's Although If you are using If you are becoming frustrated by the limitations of
| |||||||
3.6. | XML::Twig | ||||||
Although DOM modules can be very convenient, they can also be
memory hungry. If you need to work with very large documents, you may
find Another advantage of | |||||||
3.7. | Win32::OLE and MSXML.DLL | ||||||
If you're using a Windows platform and you're not worried about
portability, Microsoft's MSXML provides a DOM implementation with
optional validation and support for both XPath and XSLT. MSXML is a
COM component and as such can be accessed from Perl using
use Win32::OLE qw(in with); my $xml_file = 'your file name here'; my $node_name = 'element name or XPath expression'; my $dom = Win32::OLE->new('MSXML2.DOMDocument.3.0') or die "new() failed"; $dom->{async} = "False"; $dom->{validateOnParse} = "False"; $dom->Load($xml_file) or die "Parse failed"; my $node_list = $dom->selectNodes($node_name); foreach my $node (in $node_list) { print $node->{Text}, "\n"; } Shawn Ribordy has written an article about using MSXML from Perl at www.perl.com. You can find reference documentation for the MSXML API on MSDN. | |||||||
3.8. | XML::PYX | ||||||
Although written in Perl, Matt Sergeant's
pyx doc.xml | sed -n 's/^(//p' | sort | uniq -c This one creates a copy of an XML document with all attributes stripped out: pyx doc1.xml | grep -v '^A' | pyxw > doc2.xml And this one spell checks the text content of a document skipping over markup text such as element names and attributes: pyx talk.xml | sed -ne 's/^-//p' | ispell -l | sort -u Eric Bohlman's Sean McGrath has written an article introducing PYX on XML.com. PYX can be addictive - especially if you're an awk or sed wizard, but if you find you're using Perl in your pipelines you should consider switching to SAX. | |||||||
3.9. | XML::SAX | ||||||
The
Any SAX filters you develop should inherit from
Authors of the Perl SAX spec and the modules which implement it include Ken MacLeod, Matt Sergeant, Kip Hampton and Robin Berjon. | |||||||
3.10. | XML::SAX::Expat | ||||||
This module implements a SAX2 interface around the expat library,
so it's fast, stable and complete. XML::SAX::Expat doesn't link expat
directly but through Robin Berjon is the author of this module although he claims to have stolen the code from Ken MacLeod. | |||||||
3.11. | XML::SAX::ExpatXS | ||||||
This module, analogous to This module was started by Matt Sergeant and completed by Petr Cimprich. | |||||||
3.12. | XML::SAX::Writer | ||||||
The
| |||||||
3.13. | XML::SAX::Machines | ||||||
Once you start accumulating SAX filter modules and fitting them
together into SAX pipelines, you may find writing the glue code is a
little tedious and error prone. Barrie Slaymaker found this and built
use XML::SAX::ParserFactory; use MyFilter::One; use MyFilter::Two; use MyFilter::Three; use XML::SAX::Writer; my $writer = XML::SAX::Writer->new(Output => $output_file); my $filter3 = MyFilter::Three->new(Handler => $writer); my $filter2 = MyFilter::Two->new(Handler => $filter3); my $filter1 = MyFilter::One->new(Handler => $filter2); my $parser = XML::SAX::ParserFactory->parser(Handler => $filter1); $parser->parse_uri($input_file); to this: use XML::SAX::Machines qw( Pipeline ); Pipeline( MyFilter::One => MyFilter::Two => MyFilter::Three => ">$output_file" )->parse_uri($input_file); There are lots of other goodies as well. You can read about some of them in the following articles by Kip Hampton: | |||||||
3.14. | XML::XPathScript | ||||||
XPathScript is a stylesheet language comparable to XSLT, for transforming XML from one format to another (possibly HTML, but XPathScript also shines for non-XML-like output). Like XSLT, XPathScript offers a dialect to mix verbatim portions of documents and code. Also like XSLT, it leverages the powerful "templates/apply-templates" and "cascading stylesheets" design patterns, that greatly simplify the design of stylesheets for programmers. The availability of the XPath query language inside stylesheets promotes the use of a purely document-dependent, side-effect-free coding style. But unlike XSLT which uses its own dedicated control language with an XML-compliant syntax, XPathScript uses Perl which is terse and highly extendable. As of version 0.13 of XML::XPathScript, the module can use either
| |||||||
3.15. | How can I install XML::Parser under Windows? | ||||||
The ActiveState Perl distribution includes many CPAN modules in
addition to the core Perl module set. | |||||||
3.16. | How can I install other binary modules under Windows? | ||||||
ActiveState Perl includes the 'Perl Package Manager' (PPM) utility for installing modules. You use PPM from a command window (DOS prompt) like this: C:\> ppm ppm> install XML::Twig You must be connected to the Internet to use PPM as it connects to ActiveState's web site to download the packages you select. Refer to the HTML documentation which accompanies ActiveState Perl for instructions on using PPM through a firewall. One disadvantage of PPM is that you can only install packages that have been packaged in PPM format. You don't have to wait for ActiveState to package modules though as you can tell PPM to use other package repositories. For example many of the XML modules are available through Randy Kobe's archive which can be accessed like this: C:\> ppm ppm> repository add RK http://theoryx5.uwinnipeg.ca/cgi-bin/ppmserver?urn:/PPMServer58 ppm> set save ppm> install XML::LibXML The 'set save' command means that PPM will remember this additional repository should you need to install another module later. The location specified above assumes you're using a Perl 5.8 build. If you're still running Perl 5.6, use this command instead: ppm> set repository RK http://theoryx5.uwinnipeg.ca/cgi-bin/ppmserver?urn:/PPMServer | |||||||
3.17. | What if a module is not available in PPM format? | ||||||
Many of the CPAN modules are written entirely in Perl and don't
require a compiler, so you can use the http://download.microsoft.com/download/vc15/Patch/1.52/W95/EN-US/Nmake15.exe It's a self extracting archive so run it and move the resulting files into your windows (or winnt) directory. Then go to a command window (DOS prompt) and run: perl -MCPAN -e shell The first time you run the CPAN shell, you will be asked a number
of questions by the automatic configuration process. Accepting the
default is generally pretty safe. You'll be asked where various programs
are on your system (eg: gzip, tar, ftp etc). Don't worry if you don't
have them since CPAN.pm will use the
o conf init If you're behind a firewall, when you're asked for an FTP or HTTP proxy enter it's URL like this: http://your.proxy.address:port/ You can probably use http:// for both FTP and HTTP (depending on your proxy). After you've selected a CPAN archive near you, you will finally get a 'cpan>' prompt. Then you can type: install XML::SAX and sit back while CPAN.pm downloads, unpacks, tests and installs all the relevant code in all the right places. | |||||||
3.18. | "could not find ParserDetails.ini" | ||||||
A number of people have reported encountering the error "could not
find ParserDetails.ini in ..." when installing or attempting to use
Once you have successfully installed
If you are packaging perl -MXML::SAX -e "XML::SAX->add_parser(q(XML::SAX::PurePerl))->save_parsers()" Don't unconditionally run this command, or users who re-install
| |||||||
4. XSLT Support | |||||||
This section attempts to summarise the current state of Perl-related XSLT solutions. If you're looking for an introduction to XSLT, take a look at Chapter 17 of the XML Bible. | |||||||
4.1. | XML::LibXSLT | ||||||
Matt Sergeant's | |||||||
4.2. | XML::Sablotron | ||||||
Sablotron
is an XML toolkit implementing XSLT 1.0, DOM Level2 and XPath 1.0. It is
written in C++ and developed as an open source project by Ginger Alliance. Since the XSLT
engine is written in C++ and uses | |||||||
4.3. | XML::XSLT | ||||||
This module aims to implement XSLT in Perl, so as long as you have
xslt-parser -s toc-links.xsl perl-xml-faq.xml > toc.html Egon Willighagen has written An Introduction to Perl's XML::XSLT module at linuxfocus.org. | |||||||
4.4. | XML::Filter::XSLT | ||||||
Matt has also written | |||||||
4.5. | AxKit | ||||||
If you're doing a lot of XML transformations (particularly for web-based clients), you should take a long hard look at AxKit. AxKit is a Perl-based (actually mod_perl-based) XML Application server for Apache. Here are some of AxKit's key features:
| |||||||
5. Encodings | |||||||
5.1. | Why do we need encodings? | ||||||
Text documents have long been encoded in ASCII - a 7 bit code comprising 128 unique characters of which 32 are reserved for non-printable control functions. The remaining 96 characters are barely sufficient for variants of English, less than ideal for western european languages and totally inadequate for just about anything else. 'Point solutions' have been applied with different 8 or 16 bit codes used in different regions. One obvious limitation of such solutions is that a single document cannot contain text in two or more languages from different regions. The recognised solution to these problems is Unicode/ISO-10646 - two separate but compatible standards which aim to provide standard character mappings for all the world's languages in a single character set. All XML parsers are required to implement Unicode, but we can't get away from the fact that most electronic documents in existence today are not in Unicode. Even documents that have been produced recently are unlikely to be Unicode. Therefore, XML parsers are also able to work with non-Unicode documents - as long as each document contains an encoding declaration which the parser can use to map characters to the Unicode character set. One particularly smart thing the Unicode designers did was to make the first 128 characters of Unicode the same as ASCII, so pure ASCII documents are also Unicode. The 'Latin 1' character set (ISO-8859-1) is a popular 8 bit code which adds a further 96 printable characters to ASCII and is commonly used in Western Europe. The extra 96 characters are also mapped to the identical character numbers in Unicode (ie: ASCII is a subset of ISO-8859-1 which is a subset of Unicode). Note however that although Unicode provides a number of ways to encode characters above 0x7F, none are quite the same as ISO-8859-1. Further reference material on encodings can be found at The ISO 8859 Alphabet Soup. | |||||||
5.2. | What is UTF-8? | ||||||
Since Unicode supports character positions higher than 256, a representation of those characters will obviously require more than one 8-bit byte. There is more than one system for representing Unicode characters as byte sequences. UTF-8 is one such system. It uses a variable number of bytes (from 1 to 4 according to RFC3629) to represent each character. This means that the most common characters (ie: 7 bit ASCII) only require one byte. In UTF-8 encoded data, the most significant bit of each byte will be 0 for single byte characters and 1 for each byte of a multibyte character. This is obviously not compatible with 8-bit codes such as Latin1 in which all characters are 8 bits and all characters beyond 127 have the high bit set. Parsers assume their data is UTF-8 unless another encoding is declared, so if you feed Latin1 data, into an XML parser without declaring an encoding, the parser will most likely choke on the first character greater than 0x7F. If you are interested in the gory details, read on... The number of leading 1 bits in the first byte of a multi-byte sequence is equal to the total number of bytes. Each of the follow-on bytes will have the first bit set to 1 and the second to zero. All remaining bits (shown as 'x' below) are used to respresent the character number. 1 byte character 0xxxxxxx 2 byte character 110xxxxx 10xxxxxx 3 byte character 1110xxxx 10xxxxxx 10xxxxxx 4 byte character 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx UTF-16 encoding is an alternative byte representation of Unicode which for most cases amounts to a fixed-width 16 bit code. ASCII and Latin1 characters (the first 256 characters) are represented as normal but with a preceding 0x00 byte. Although UTF-16 is conceptually simpler than UTF-8 (and is the native encoding used by Java), two major drawbacks mean it is not the preferred format for C or Perl:
For more information, visit the UTF-8 and Unicode FAQ for Unix/Linux. | |||||||
5.3. | What can I do with a UTF-8 string? | ||||||
You could obviously convert a UTF-8 encoded string to some other encoding, but before we get on to that, let's look at what you can do with it in its 'natural state'. If you wish to display the string in a web browser, no conversion is necessary. Modern browsers can understand UTF-8 directly, as can be seen on this page on the kermit project web site (some characters in the page will not display correctly without the correct fonts installed but that's a font issue rather than an encoding issue). To use UTF-8 encoded HTML, simply append this 'charset' modifier to your Content-type header: Content-type: text/html; charset=utf-8 Or if you can't control the headers, include this meta tag in the document: <meta http-equiv="Content-type" content="text/html; charset=utf-8"> For a 'low tech' alternative, you might find that UTF-8 text is readable on your display if you simply print it to STDOUT. XFree86 version 4.0 supports Unicode fonts and Xterm supports UTF-8 multibyte characters (assuming your locale is set correctly). A growing number of editors support UTF-8 files and you can even write your Perl scripts in UTF-8 (since 5.6) using your native language for identifier names. For more information, you may wish to visit the Perl, Unicode and i18N FAQ. | |||||||
5.4. | What can Perl do with a UTF-8 string? | ||||||
Perl versions prior to 5.6 had no knowledge of UTF-8 encoded
characters. You can still work with UTF-8 data in these older Perl
versions but you'll probably need the help of a module like
The built-in functions in Perl 5.6 and later are UTF-8 aware so for
example None of this added functionality comes at the expense of support
for binary data. Perl's internal SV data structure (used to represent
scalar values) includes a flag to indicate whether the string value is
UTF-8 encoded. If this flag is not set, byte semantics will be used by
all functions that operate on the string, eg: You can include include Unicode characters in your string literals using the hex character number in an extended \x sequence. For example, this declaration includes the Euro symbol: my $price_label = "\x{20AC}9.99";
This regular expression will match a string which starts with the Euro symbol: my $euro = "\x{20AC}"; /^$euro/ && print; And here's a regular expression that will match any upper case character - not just A-Z, but any character with the Unicode upper case property: /\p{IsUpper}/ You can read more in perldoc perlunicode and perldoc utf8. | |||||||
5.5. | What can Perl 5.8 do with a UTF-8 string? | ||||||
The Unicode support in Perl 5.6 had a number of omissions and bugs. Many of the shortcomings were fixed in Perl 5.8 and 5.8.1. One major leap forward in 5.8 was the move to Perl IO and 'layers' which allows translations to take place transparently as file handles are read from or written to. A built-in layer called ':encoding' will automatically translate data to UTF-8 as it is read, or to some other encoding as it is written. For example, given a UTF-8 string, this code will write it out to a file as ISO-8859-1: open my $fh, '>:encoding(iso-8859-1)', $path or die "open($path): $!"; print $fh $utf_string; The new core module 'Encode' can be used to translate between encodings (but since that usually only makes sense during IO, you might as well just use layers) and also provides the 'is_utf8' function for accessing the UTF-8 flag on a string. | |||||||
5.6. | How can I convert from UTF-8 to another encoding? | ||||||
This being Perl, there's more than one way to do it ... | |||||||
use utf8; # Only needed for 5.6, not 5.8 or later s/([\x{80}-\x{FFFF}])/'&#' . ord($1) . ';'/gse; Andreas Koenig has supplied an alternative regular expression: s/([^\x20-\x7F])/'&#' . ord($1) . ';'/gse; This version does not require 'use utf8' with Perl 5.6; does not require a version of Perl which recognises \x{NN} and handles characters outside the 0x80-0xFFFF range. Even if you are outputting Latin1, you will need to use a technique like this for all characters beyond position 255 (eg: the Euro symbol) since there is no other way to represent them in Latin1. This technique can be used for the character content of elements and attribute values. It cannot be used for the element or attribute names since the result would not be well-formed XML. NoteRemember, in XML the number in a numeric character entity represents the Unicode character position regardless of the document encoding. | |||||||
open my $fh, '>:encoding(iso-8859-1)', $path or die "open($path): $!"; print $fh $utf_string; You can also push an encoding layer onto an already open filehandle like this: binmode(STDOUT, ':encoding(windows-1250)'); | |||||||
Just to make quite sure that you know exactly which code to avoid using, here is an example of translating from UTF-8 ('U') to 8 bit Latin1 ('C'): $string =~ tr/\0-\x{FF}//UC; # Don't do this | |||||||
use utf8; # Not required with 5.8 or later my $u_city = "S\x{E3}o Paulo"; my $l_city = pack("C*", unpack('U*', $u_city)); The first assignment creates a UTF-8 string 9 characters long (but 10 bytes long). The second assignment creates a Latin-1 encoded version of the string. | |||||||
use Unicode::String; $ustr = Unicode::String::utf8($string); $latin1 = $ustr->latin1(); | |||||||
use Text::Iconv; $converter = Text::Iconv->new('UTF-8', 'ISO8859-1'); print $converter->convert($string); The biggest hurdle with using | |||||||
my $writer = XML::SAX::Writer->new(EncodeTo => 'ISO8859-1'); Internally, | |||||||
5.7. | What does 'use utf8;' do? | ||||||
In Perl 5.8 and later, the sole use of the 'use utf8;' pragma is to tell Perl that your script is written in UTF-8 (ie: any non-ASCII or multibyte characters should be interpreted as UTF-8). So if your code is plain ASCII, you don't need the pragma. The original UTF8 support in Perl 5.6 required the pragma to enable wide character support for builtin functions (such as length) and the regular expression engine. This is no longer necessary in 5.8 since Perl automatically uses character rather than byte semantics with strings that have the utf8 flag set. You can find out more about how Unicode handling changed in Perl 5.8 from the perl58delta.pod file that ships with Perl. | |||||||
5.8. | What are some commonly encountered problems with encodings? | ||||||
| |||||||
<?xml version='1.0' encoding='WINDOWS-1252' ?> | |||||||
print CGI->header('text/html; charset=utf-8'); or in a meta tag in the document itself: <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> If you find you've received characters in the range 0x80-0x9F, they are unlikely to be ISO Latin1. This commonly results from users preparing text in Microsoft Word and copying/pasting it into a web form. If they have the 'smart quotes' option enabled, the text may contain WinLatin1 characters. The following routine can be used to 'sanitise' the data by replacing 'smart' characters with their common ASCII equivalents and discarding other troublesome characters. sub sanitise { my $string = shift; $string =~ tr/\x91\x92\x93\x94\x96\x97/''""\-\-/; $string =~ s/\x85/.../sg; $string =~ tr/\x80-\x9F//d; return($string); } Note: It might be safer to simply reject any input with characters in the above range since it implies the browser ignored your charset declaration and guessing the encoding is risky at best. | |||||||
6. Validation | |||||||
The XML Recommendation says that an XML document is 'valid' if it has an associated document type declaration and if the document complies with the constraints expressed in it. At the time the recommendation was written, the SGML Document Type Definition (DTD) was the established method for defining a class of documents; and validation was the process of confirming that a document conformed to its declared DTD. These days, there are a number of alternatives to the DTD and the term validation has assumed a broader meaning than simply DTD conformance. The most visible alternative to the DTD is the W3C's own XML Schema. Relax NG is a popular alternative developed by OASIS. If you design your own class of XML document, you are perfectly free to select the system for defining and validating document conformance, which suits you best. You may even chose to develop your own system. The following paragraphs describe Perl tools to consider. | |||||||
6.1. | DTD Validation Using XML::Checker | ||||||
Enno Derksen's XML::Checker module implements DTD validation in
Perl on top of
Here's a short example to get you going. Here's a DTD saved
in a file called <!ELEMENT xcard (firstname,lastname,email?)> <!ELEMENT firstname (#PCDATA)> <!ELEMENT lastname (#PCDATA)> <!ELEMENT email (#PCDATA)> Here's an XML document that refers to the DTD: <?xml version="1.0" ?> <!DOCTYPE xcard SYSTEM "file:/opt/xml/xcard.dtd" > <xcard> <firstname>Joe</firstname> <lastname>Bloggs</lastname> <email>joe@bloggs.com</email> </xcard> And here's a code snippet to validate the document: use XML::Checker::Parser; my $xp = new XML::Checker::Parser ( Handlers => { } ); eval { $xp->parsefile($xml_file); }; if ($@) { # ... your error handling code here ... print "$xml_file failed validation!\n"; die "$@"; } print "$xml_file passed validation\n"; You can play around with adding and removing elements from the document to get a idea of what happens when validation errors occur. You'll also want to refer to the documentation for the 'SkipExternalDTD' option for more robust handling of external DTDs. | |||||||
6.2. | DTD Validation Using XML::LibXML | ||||||
The $parser->validation(1); Validation using The libxml2 distribution (which underlies
xmllint --valid --noout filename.xml | |||||||
6.3. | W3C Schema Validation With XML::LibXML | ||||||
use XML::LibXML; my $schema_file = 'po.xsd'; my $document = 'po.xml'; my $schema = XML::LibXML::Schema->new(location => $schema_file); my $parser = XML::LibXML->new; my $doc = $parser->parse_file($document); eval { $schema->validate($doc) }; die $@ if $@; print "$document validated successfully\n"; The referenced XSD schema file and sample XML document can be downloaded from the W3C Schema Primer. The xmllint command line validator included in the libxml2 distribution can also do W3C schema validation. You can use it like this: xmllint --noout --schema po.xsd po.xml | |||||||
6.4. | W3C Schema Validation With XML::Xerces | ||||||
| |||||||
6.5. | W3C Schema Validation With XML::Validator::Schema | ||||||
Sam Tregar's use XML::SAX::ParserFactory; use XML::Validator::Schema; my $schema_file = 'po.xsd'; my $document = 'po.xml'; my $validator = XML::Validator::Schema->new(file => $schema_file); my $parser = XML::SAX::ParserFactory->parser(Handler => $validator); eval { $parser->parse_uri($document); }; die $@ if $@; print "$document validated successfully\n"; | |||||||
6.6. | Simple XML Validation with Perl | ||||||
Kip Hampton has written an article describing how a combination of Perl and XPath can provide a quick, lightweight solution for validating documents. | |||||||
6.7. | XML::Schematron | ||||||
Kip has also written the | |||||||
7. Common Coding Problems | |||||||
7.1. | How should I handle errors? | ||||||
Most of the Perl parsing tools will simply call
use XML::Simple; my $ref = eval { XMLin('<bad>not well formed'); }; if($@) { print "An error occurred: $@"; } else { print "It worked!"; } Don't forget the semi-colon after the code block passed to eval. The '$@' variable contains the scalar value which was passed to
| |||||||
7.2. | Why is my character data split into multiple events? | ||||||
If you parsed this XML file ... <menu>Bubble & Squeak</menu> ... with this code ... use XML::Parser; my $xp = new XML::Parser(Handlers => { Char => \&char_handler }); $xp->parsefile('menu.xml'); sub char_handler { my($xp, $data) = @_; print "Character data: '$data'\n"; } ... you might expect this output ... Character data: 'Bubble & Squeak' ... in fact you'd probably get this ... Character data: 'Bubble ' Character data: '&' Character data: ' Squeak' The reason is that parsers are not required to give you all of an element's character data in one chunk. The number of characters you get in each chunk may depend on the parser's internal buffer sizes, newline characters in the data, or (as in our example) embedded entities. It doesn't really matter what causes the data to be split - you just have to be prepared to handle it. The usual approach is to accumulate data in the character event
and defer processing it until the end element event. Here's a sample
implementation using use XML::Parser; my $xp = new XML::Parser(Handlers => { Start => \&start_handler, Char => \&char_handler, End => \&end_handler }); $xp->parsefile('menu.xml'); sub start_handler { my($xp) = @_; $xp->{cdata_buffer} = ''; } sub char_handler { my($xp, $data) = @_; $xp->{cdata_buffer} .= $data; } sub end_handler { my($xp) = @_; print "Character data: '$xp->{cdata_buffer}'\n"; } Of course you probably won't be coding to the
| |||||||
7.3. | How can I split a huge XML file into smaller chunks | ||||||
When your document is too large to slurp into memory, the DOM, XPath and XSLT tools can't really help you. You could write your own SAX filter fairly easily, but Michel Rodriguez has written a general solution so you don't have to. You'll find it bundled with XML::Twig from version 3.16. | |||||||
8. Common XML Problems | |||||||
The error messages and questions listed in this section are not really Perl-specific problems, but they are commonly encountered by people new to XML: | |||||||
8.1. | 'xml processing instruction not at start of external entity' | ||||||
If you include an XML declaration, it must be the very first thing in the document - it cannot even be preceded by whitespace or blank lines. For example, this would be 'well formed' XML as long as the '<' and the '?' are the first and second characters in the file: <?xml version='1.0' standalone='yes'?> <doc> <title>Test Document</title> </doc> | |||||||
8.2. | 'junk after document element' | ||||||
A well formed XML document can contain only one root element. So, for example, this would be well formed: <para>Paragraph 1</para> while this would not: <para>Paragraph 1</para> <para>Paragraph 2</para> | |||||||
8.3. | 'not well-formed (invalid token)' | ||||||
There are a number of causes of this error, here are some common ones: | |||||||
<item name="widget"></item> while this would not: <item name=widget></item> | |||||||
<?xml version='1.0' encoding='iso-8859-1'?> | |||||||
8.4. | 'undefined entity' | ||||||
XML only pre-defines the following named character entities: < < > > & & " " ' ' If your XML includes HTML-style named character entities (eg: or ü) you have two choices: You could replace the named entities with numeric entities. For example the non-breaking space character is at position 160 (hex A0) so you could represent it with:   (or  ). Similarly, you could represent a lower case U-umlaut as ü (or ü). Alternatively, you could define your own named character entities in your DTD or in an 'internal subset' of a DTD. For example: <!DOCTYPE doc [ <!ENTITY eacute "é" > <!ENTITY euro "€" > ]> <doc>Combien avez-vous payé? 125 €</doc> You can find the definitions for HTML Latin 1 character entities on the W3C Site. You can include all these character entities into your DTD, so that you won't have to worry about it anymore: <!DOCTYPE doc [ <!ENTITY % HTMLlat1 PUBLIC "-//W3C//ENTITIES Latin 1 for XHTML//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml-lat1.ent"> %HTMLlat1; <!ENTITY % HTMLspecial PUBLIC "-//W3C//ENTITIES Special for XHTML//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml-special.ent"> %HTMLspecial; <!ENTITY % HTMLsymbol PUBLIC "-//W3C//ENTITIES Symbols for XHTML//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml-symbol.ent"> %HTMLsymbol; ]> | |||||||
8.5. | 'reference to invalid character number' | ||||||
The XML spec defines legal characters as tab (0x09), carriage return (0x0D), line feed (0x0A) and the legal graphic characters of Unicode. This specifically excludes control characters, so this would not be well-formed: <char></char> Their really is no easy or standard way to include control
characters in XML - binary data must be encoded (for example using
| |||||||
8.6. | Embedding Arbitrary Text in XML | ||||||
Any text you include in your XML documents must not contain the angle brackets or ampersand characters in an unescaped form. Manually escaping these characters can be tedious when you want to include a block of program code or HTML. You can use a CDATA section to indicate to the parser that the text within it should not be parsed for markup. For example, this XML document ... <code><![CDATA[ if($qty < 1) { print "<p>Invalid quantity!</p>"; } ]]></code> is equivalent to this document ... <code> if($qty < 1) { print "<p>Invalid quantity!</p>"; } </code> When you parse a document, your code has no way of knowing if a particular piece of text came from a CDATA section - and you probably shouldn't care. NoteCDATA is for character data - not binary
data. If you need to include binary data in your document, you should
encode it (perhaps using | |||||||
8.7. | Using XPath with Namespaces | ||||||
People often experience difficulty getting their XPath expressions
to match when they first use <?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <title>Sample Document</title> </head> <body> <h1>An HTML Heading</h1> <s:svg xmlns:s="http://www.w3.org/2000/svg" width="300" height="200"> <s:rect style="fill: #eeeeee; stroke: #000000; stroke-width: 1;" width="80" height="30" x="60" y="50" /> <s:text style="font-size: 12px; fill: #000066; font-family: sans-serif;" x="70" y="70">Label One</s:text> </s:svg> </body> </html> The elements in the SVG section each use the namespace prefix 's' which is bound to the URI 'http://www.w3.org/2000/svg'. The prefix 's' is completely arbitrary and is merely a mechanism for associating the URI with the elements. As a programmer, you will perform matches against namespace URIs not prefixes. The elements in the XHTML wrapper do not have namespace prefixes, but are bound to the URI 'http://www.w3.org/1999/xhtml' by way of the default namespace declaration on the opening <html> tag. You might expect that you could match all the 'h1' elements using this XPath expression ... //h1 ... however, that won't work since the namespace URI is effectively part of the name of the element you're trying to match. One approach would be to fashion an XPath query which ignored the namespace portion of element names and matched only on the 'local name' portion. For example: //*[local-name() = 'h1'] A better approach is to match the namespace portion as well. To
achieve that, the first step is to use
my $parser = XML::LibXML->new(); my $doc = $parser->parse_file('sample.xhtml'); my $xpc = XML::LibXML::XPathContext->new($doc); $xpc->registerNs(xhtml => 'http://www.w3.org/1999/xhtml'); foreach my $node ($xpc->findnodes('//xhtml:h1')) { print $node->to_literal, "\n"; } The same technique can be used to match 'text' elements in the SVG section: $xpc->registerNs(svg => 'http://www.w3.org/2000/svg'); foreach my $node ($xpc->findnodes('//svg:text')) { print $node->to_literal, "\n"; } NoteThe | |||||||
9. Miscellaneous | |||||||
9.1. | Is there a mailing list for Perl and XML? | ||||||
Yes, the perl-xml mailing list is kindly hosted by ActiveState. The list info page has links to the searchable list archive as well as a form for subscribing. The list has moderate traffic levels (no messages some days, a dozen messages on a busy day) with a knowledgeable and helpful band of subscribers. | |||||||
9.2. | How do I unsubscribe from the perl-xml mailing list? | ||||||
The list info page links through to an unsubscribe function. Every message sent to the list also includes an 'unsubscribe' link which makes it all the more mystifying that this really is a frequently asked question. | |||||||
9.3. | What happened to Enno? | ||||||
This is one of the great mysteries of Perl/XML and no answer is available here. Enno Derksen wrote a number of XML related Perl modules (including
|
This document is a 'work in progress'. A number of questions are still being worked on and will be added when they are complete.
If you wish to report an error or contribute information for
inclusion in this document, please email the author at:
<grantm@cpan.org>
.
I wish to gratefully acknowledge the assistance of the community of subscribers to the 'perl-xml' mailing list. Their knowledge and advice has been invaluable in preparing this document.