Perl SAX (Simple API for XML) is a common parser interface for XML parsers. It allows application writers to write applications that use XML parsers, but are independent of which parser is actually used.
This document describes the version of SAX used by Perl modules. The original version of SAX 2.0, for Java, is described at http://sax.sourceforge.net/.
There are two basic interfaces in the Perl version of SAX, the parser interface and the handler interface. The parser interface creates new parser instances, starts parsing, and provides additional information to handlers on request. The handler interface is used to receive parse events from the parser. This pattern is also commonly called "Producer and Consumer" or "Generator and Sink".
All handler methods have a single argument; a hash reference. Hash values are Unicode strings (scalars with UTF-8 flag on).
Note that the parser doesn't have to be an XML parser, all it needs to do is provide a stream of events to the handler as if it were parsing XML. But the actual data from which the events are generated can be anything, a Perl object, a CSV file, a database table...
SAX is typically used like this:
my $handler = MyHandler->new(); my $parser = AnySAXParser->new( Handler => $handler ); $parser->parse_uri($uri);
Handlers are typically written like this:
package MyHandler; sub new { my $type = shift; return bless {}, $type; } sub start_element { my ($self, $element) = @_; print "Starting element $element->{Name}\n"; } sub end_element { my ($self, $element) = @_; print "Ending element $element->{Name}\n"; } sub characters { my ($self, $characters) = @_; print "characters: $characters->{Data}\n"; } 1;
These methods and options are the most commonly used with SAX parsers and event generators.
Applications may not invoke a parse() method again while a parse is in progress (they should create a new SAX parser instead for each nested XML document). Once a parse is complete, an application may reuse the same parser object, possibly with a different input source.
During the parse, the parser will provide information about the XML document through the registered event handlers. Note that an event that hasn't been registered (ie that doesn't have its corresponding method in the handler's class) will not be called. This allows one to only get the events one is interested in.
If you generate SAX events, data are required to be passed to handler methods with all properties defined in this document unless otherwise specified.
These methods are the most commonly used by SAX handlers.
The SAX parser will invoke this method only once, before any other methods (except for set_document_locator() in advanced SAX handlers). No properties are defined for this event (document is an empty hash reference).
The SAX parser will invoke this method only once, and it will be the last method invoked during the parse. The parser shall not invoke this method until it has either abandoned parsing (because of an unrecoverable error) or reached the end of input.
No properties are defined for this event (document is an empty hash reference).
The return value of end_document() is returned by the parser's parse() methods.The Parser will invoke this method at the beginning of every element in the XML document; there will be a corresponding end_element() event for every start_element() event (even when the element is empty). All of the element's content will be reported, in order, before the corresponding end_element() event.
element is a hash reference with these properties:Attributes is a reference to hash keyed by JClark namespace notation. That is, the keys are of the form "{NamespaceURI}LocalName". If the attribute has no NamespaceURI, then it is simply "{}LocalName". Each attribute is a hash reference with these properties:
Name The element type name (including prefix). Attributes The attributes attached to the element, if any. NamespaceURI The namespace of this element. Prefix The namespace prefix used on this element. LocalName The local name of this element.
Name The attribute name (including prefix). Value The normalized value of the attribute. NamespaceURI The namespace of this attribute. Prefix The namespace prefix used on this attribute. LocalName The local name of this attribute.
The values of element and attribute hashes can be changed by turning the namespace processing off. See Namespace Processing for details.
The SAX parser will invoke this method at the end of every element in the XML document; there will be a corresponding start_element() event for every end_element() event (even when the element is empty).
element is a hash reference with these properties:If namespace processing is turned on (which is the default), these properties are also available:
Name The element type name (including prefix).
NamespaceURI The namespace of this element. Prefix The namespace prefix used on this element. LocalName The local name of this element.
The Parser will call this method to report each chunk of character data. SAX parsers may return all contiguous character data in a single chunk, or they may split it into several chunks (however, all of the characters in any single event must come from the same external entity so that the Locator provides useful information).
characters is a hash reference with this property:
Data The characters from the XML document.
Validating Parsers must use this method to report each chunk of ignorable whitespace (see the W3C XML 1.0 recommendation, section 2.10): non-validating parsers may also use this method if they are capable of parsing and using content models.
SAX parsers may return all contiguous whitespace in a single chunk, or they may split it into several chunks; however, all of the characters in any single event must come from the same external entity, so that the Locator provides useful information.
characters is a hash reference with this property:
Data The whitespace characters from the XML document.
All Perl SAX 2.1 parsers must support namespace processing. This is signaled by having the feature "http://xml.org/sax/features/namespaces" set on by default. Perl SAX 2.1 parsers may allow users to turn this feature off explicitly. However, the parsers are not required to provide this option.
If the namespace processing is turned off, element and attribute hashes (see start_element) still contain all keys. Values of NamespaceURI, Prefix and LocalName are undef. Attributes keys are prefixed with {} (for example {}pfx:lname). Namespace declarations are treated as common attributes. start_prefix_mapping and end_prefix_mapping methods are never called.
Conformant XML parsers are required to abort processing when well-formedness or validation errors occur. In Perl, SAX parsers use die() to signal these errors. To catch these errors and prevent them from killing your program, use eval{}:
eval { $parser->parse($uri) }; if ($@) { # handle error }
Exceptions can also be thrown when setting features or properties on the SAX parser (see advanced SAX below).
Exception values ($@) in SAX are hash references blessed into the package that defines their type, and have the following properties:
If the exception is raised due to parse errors, these properties are also available:
Message A detail message for this exception. Exception The embedded exception, or undef if there is none.
ColumnNumber The column number of the end of the text where the exception occurred. LineNumber The line number of the end of the text where the exception occurred. PublicId The public identifier of the entity where the exception occurred. SystemId The system identifier of the entity where the exception occurred.