Perl SAX 2.1 Binding

Perl SAX (Simple API for XML) is a common parser interface for XML parsers. It allows application writers to write applications that use XML parsers, but are independent of which parser is actually used.

This document describes the version of SAX used by Perl modules. The original version of SAX 2.0, for Java, is described at http://sax.sourceforge.net/.

There are two basic interfaces in the Perl version of SAX, the parser interface and the handler interface. The parser interface creates new parser instances, starts parsing, and provides additional information to handlers on request. The handler interface is used to receive parse events from the parser. This pattern is also commonly called "Producer and Consumer" or "Generator and Sink".

All handler methods have a single argument; a hash reference. Hash values are Unicode strings (scalars with UTF-8 flag on).

Note that the parser doesn't have to be an XML parser, all it needs to do is provide a stream of events to the handler as if it were parsing XML. But the actual data from which the events are generated can be anything, a Perl object, a CSV file, a database table...

SAX is typically used like this:

    my $handler = MyHandler->new();
    my $parser = AnySAXParser->new( Handler => $handler );
    $parser->parse_uri($uri);

Handlers are typically written like this:

    package MyHandler;

    sub new {
        my $type = shift;
        return bless {}, $type;
    }

    sub start_element {
        my ($self, $element) = @_;

        print "Starting element $element->{Name}\n";
    }

    sub end_element {
        my ($self, $element) = @_;

        print "Ending element $element->{Name}\n";
    }

    sub characters {
        my ($self, $characters) = @_;

        print "characters: $characters->{Data}\n";
    }

    1;

Basic SAX Parser

These methods and options are the most commonly used with SAX parsers and event generators.

Applications may not invoke a parse() method again while a parse is in progress (they should create a new SAX parser instead for each nested XML document). Once a parse is complete, an application may reuse the same parser object, possibly with a different input source.

During the parse, the parser will provide information about the XML document through the registered event handlers. Note that an event that hasn't been registered (ie that doesn't have its corresponding method in the handler's class) will not be called. This allows one to only get the events one is interested in.

If you generate SAX events, data are required to be passed to handler methods with all properties defined in this document unless otherwise specified.

parse([options])
This is a generic method that calls one of the following methods based on the Source option. options can be a list of options, value pairs or a hash (reference). Options include Handler, features and properties, and advanced SAX parser options.

parse_uri(uri [, options])
Parses the XML instance identified by uri (a system identifier). options are the same as for parse(). parse_uri() returns the result of calling the end_document() handler. The options supported by parse_uri() may vary slightly if what is being "parsed" isn't XML.

parse_file(stream [, options])
Parses the XML instance in the already opened stream. The stream can be either a file handle, a glob reference or a sub-class of IO::Handle. In addition to streams, the parse_file method also accepts system paths (as parse_uri method does) to prevent possible confusion caused by the name of this method. options are the same as for parse(). parse_file() returns the result of calling the end_document() handler.

parse_string(string [, options])
Parses the XML instance in string. options are the same as for parse(). parse_string() returns the result of calling the end_document() handler.

Handler
The default handler object to receive all events from the parser. Applications may change Handler in the middle of the parse and the SAX parser will begin using the new handler immediately. The Advanced SAX document lists a number of more specialized handlers that can be used should you wish to dispatch different types of events to different objects.

Basic SAX Handler

These methods are the most commonly used by SAX handlers.

start_document(document)
Receive notification of the beginning of a document.

The SAX parser will invoke this method only once, before any other methods (except for set_document_locator() in advanced SAX handlers). No properties are defined for this event (document is an empty hash reference).

end_document(document)
Receive notification of the end of a document.

The SAX parser will invoke this method only once, and it will be the last method invoked during the parse. The parser shall not invoke this method until it has either abandoned parsing (because of an unrecoverable error) or reached the end of input.

No properties are defined for this event (document is an empty hash reference).

The return value of end_document() is returned by the parser's parse() methods.

start_element(element)
Receive notification of the start of an element.

The Parser will invoke this method at the beginning of every element in the XML document; there will be a corresponding end_element() event for every start_element() event (even when the element is empty). All of the element's content will be reported, in order, before the corresponding end_element() event.

element is a hash reference with these properties:
Name The element type name (including prefix).
Attributes The attributes attached to the element, if any.
NamespaceURI The namespace of this element.
Prefix The namespace prefix used on this element.
LocalName The local name of this element.
Attributes is a reference to hash keyed by JClark namespace notation. That is, the keys are of the form "{NamespaceURI}LocalName". If the attribute has no NamespaceURI, then it is simply "{}LocalName". Each attribute is a hash reference with these properties:
Name The attribute name (including prefix).
Value The normalized value of the attribute.
NamespaceURI The namespace of this attribute.
Prefix The namespace prefix used on this attribute.
LocalName The local name of this attribute.

The values of element and attribute hashes can be changed by turning the namespace processing off. See Namespace Processing for details.

end_element(element)
Receive notification of the end of an element.

The SAX parser will invoke this method at the end of every element in the XML document; there will be a corresponding start_element() event for every end_element() event (even when the element is empty).

element is a hash reference with these properties:
Name The element type name (including prefix).
If namespace processing is turned on (which is the default), these properties are also available:
NamespaceURI The namespace of this element.
Prefix The namespace prefix used on this element.
LocalName The local name of this element.

characters(characters)
Receive notification of character data.

The Parser will call this method to report each chunk of character data. SAX parsers may return all contiguous character data in a single chunk, or they may split it into several chunks (however, all of the characters in any single event must come from the same external entity so that the Locator provides useful information).

characters is a hash reference with this property:

Data The characters from the XML document.

ignorable_whitespace(characters)
Receive notification of ignorable whitespace in element content.

Validating Parsers must use this method to report each chunk of ignorable whitespace (see the W3C XML 1.0 recommendation, section 2.10): non-validating parsers may also use this method if they are capable of parsing and using content models.

SAX parsers may return all contiguous whitespace in a single chunk, or they may split it into several chunks; however, all of the characters in any single event must come from the same external entity, so that the Locator provides useful information.

characters is a hash reference with this property:

Data The whitespace characters from the XML document.

Namespace Processing

All Perl SAX 2.1 parsers must support namespace processing. This is signaled by having the feature "http://xml.org/sax/features/namespaces" set on by default. Perl SAX 2.1 parsers may allow users to turn this feature off explicitly. However, the parsers are not required to provide this option.

If the namespace processing is turned off, element and attribute hashes (see start_element) still contain all keys. Values of NamespaceURI, Prefix and LocalName are undef. Attributes keys are prefixed with {} (for example {}pfx:lname). Namespace declarations are treated as common attributes. start_prefix_mapping and end_prefix_mapping methods are never called.

Exceptions

Conformant XML parsers are required to abort processing when well-formedness or validation errors occur. In Perl, SAX parsers use die() to signal these errors. To catch these errors and prevent them from killing your program, use eval{}:

    eval { $parser->parse($uri) };
    if ($@) {
        # handle error
    }

Exceptions can also be thrown when setting features or properties on the SAX parser (see advanced SAX below).

Exception values ($@) in SAX are hash references blessed into the package that defines their type, and have the following properties:

Message A detail message for this exception.
Exception The embedded exception, or undef if there is none.
If the exception is raised due to parse errors, these properties are also available:
ColumnNumber The column number of the end of the text where the exception occurred.
LineNumber The line number of the end of the text where the exception occurred.
PublicId The public identifier of the entity where the exception occurred.
SystemId The system identifier of the entity where the exception occurred.


Advanced SAX