Advanced SAX

The classes, methods, and features described below are not commonly used in most applications and can be ignored by most users. If however you find that you are not getting the granularity you expect from Basic SAX, this would be the place to look for more. Advanced SAX isn't advanced in the sense that it is harder, or requires better programming skills. It is simply more complete, and has been separated to keep Basic SAX simple in terms of the number of events one would have to deal with.

SAX Parsers

SAX supports several classes of event handlers: content handlers, declaration handlers, DTD handlers, error handlers, entity resolvers, and other extensions. For each class of events, a separate handler can be used to handle those events. If a handler is not defined for a class of events, then the default handler, Handler, is used. Each of these handlers is described in the sections below. Applications may change an event handler in the middle of the parse and the SAX parser will begin using the new handler immediately.

SAX's basic interface defines methods for parsing system identifiers (URIs), open files, and strings. Behind the scenes, though, SAX uses a Source hash reference that contains that information, plus encoding, system and public identifiers if available. These are described below under the Source option.

SAX parsers accept all features as options to the parse() methods and on the parser's constructor. Features are described in the next section.

parse(options)
Parses the XML instance identified by the Source option. options can be a list of options, value pairs or a hash (reference). parse() returns the result of calling the end_document() handler.

ContentHandler
Object to receive document content events. The ContentHandler, with additional events defined below, is the class of events described in Basic SAX Handler. If the application does not register a content handler or content event handlers on the default handler, content events reported by the SAX parser will be silently ignored.

DTDHandler
Object to receive basic DTD events. If the application does not register a DTD handler or DTD event handlers on the default handler, DTD events reported by the SAX parser will be silently ignored.

EntityResolver
Object to resolve external entities. If the application does not register an entity resolver or entity events on the default handler, the SAX parser will perform its own default resolution.

ErrorHandler
Object to receive error-message events. If the application does not register an error handler or error event handlers on the default handler, all error events reported by the SAX parser will be silently ignored; however, normal processing may not continue. It is highly recommended that all SAX applications implement an error handler to avoid unexpected bugs.

LexicalHandler
Object to receive lexical events. If the application does not register a lexical handler or lexical event handlers on the default handler, lexical events reported by the SAX parser will be silently ignored.

DeclHandler
Object to receive information about DTD declarations. If the application does not register a declaration handler or declaration event handlers on the default handler, declaration events reported by the SAX parser will be silently ignored.

Source
A hash reference containing information about the XML instance to be parsed. See Input Sources below. Note that Source cannot be changed during the parse

Features
A hash reference containing Feature information, as described below. Features can be set at runtime but not directly on the Features hash (at least, not reliably. You can do it, but the results might not be what you expect as it doesn't give the parser a chance to look at what you've set so that it can't react properly to errors, or Features that it doesn't support). You should use the set_feature() method instead.

Features

Features are as defined in SAX2: Features and Properties, but not of course limited to those. You may add your own Features. Also, Java has an artificial distinction between Features and Properties which is unnecessary. In Perl, both have been merged under the same name.

Features can be passed as options when creating a parser or calling a parse() method. They may also be set using the set_feature() method.

    $parser = AnySAXParser->new(
                                Features => {
                                             'http://xml.org/sax/features/namespaces' => 0,
                                             },
                                );
    $parser->parse(
                   Features => {
                               'http://xml.org/sax/features/namespaces' => 0,
                               },
                   );
    $parser->set_feature('http://xml.org/sax/properties/xml-string', 1);
    $string = $parser->get_feature('http://xml.org/sax/properties/xml-string');

Perl SAX supports Features described in the SAX spec as well as new ones defined for Perl. The features taken over from Java SAX have URLs startting with http://xml.org/sax/features/ and Perl-specific features start with http://xmlns.perl.org/sax/. Note that features are things that are supposed to be turned on, and thus should normally be off by default, especially if the parser doesn't support turning them off. There may be, for good reasons, exceptions to this rule, for example the http://xml.org/sax/features/namespaces feature.

These are some features Perl SAX drivers may support:

http://xml.org/sax/features/namespaces
A value of 1 indicates namespace URIs and unprefixed local names for element and attribute names will be available. This feature is on by default and a number of parsers may not be able to turn it off. Thus, a parser claiming to support this feature (and all SAX2 parsers must support it) may in fact only support turning it on. See Namespace Processing for more details.
http://xml.org/sax/features/namespace-prefix
This feature is ignored as Perl SAX parsers always provide both the raw tag name in Name and the namespace names in NamespaceURI, LocalName, and Prefix.
http://xmlns.perl.org/sax/xmlns-uris
This feature controls how the parser treats namespace declaration attributes. When set on, xmlns and xmlns:* attributes are put into namespaces in a Perl SAX traditional way; xmlns attributes are in no namespace while xmlns:* attributes are in the C namespace. This feature is set to 1 by default.
http://xml.org/sax/features/xmlns-uris
This feature applies if and only if the http://xmlns.perl.org/sax/xmlns-uris feature is off. Then, it controls whether the parser treats all namespace declaration attributes as being in the http://www.w3.org/2000/xmlns/ namespace. By default, Perl SAX conforms to the original "Namespaces in XML" Recommendation, which explicitly states that such attributes are not in any namespace. Setting this optional flag to 1 makes the Perl SAX events conform to a later backwards-incompatible revision of that recommendation, placing those attributes in a namespace.
http://xmlns.perl.org/sax/version-2.1
This read-only feature is used to tell if a driver supports version 2.1 of Perl SAX interface. Only drivers supporting the version of Perl SAX described in this document have this feture set on.
http://xmlns.perl.org/sax/lexicalHandler
Tells if a driver supports optional lexical handler. Parsers may not be able to set this feature.
http://xmlns.perl.org/sax/declHandler
Tells if a driver supports optional declaration handler. Parsers may not be able to set this feature.

The following methods are used to get and set features:

get_feature(name)
Look up the value of a feature.

The feature name is any fully-qualified URI. It is possible for an SAX parser to recognize a feature name but to be unable to return its value; this is especially true in the case of an adapter for a SAX1 Parser, which has no way of knowing whether the underlying parser is validating, for example.

Some feature values may be available only in specific contexts, such as before, during, or after a parse.

get_feature() returns the value of the feature, which is usually either a boolean or an object, and will throw XML::SAX::Exception::NotRecognized when the SAX parser does not recognize the feature name and XML::SAX::Exception::NotSupported when the SAX parser recognizes the feature name but cannot determine its value at this time.

set_feature(name, value)
Set the state of a feature.

The feature name is any fully-qualified URI. It is possible for an SAX parser to recognize a feature name but to be unable to set its value; this is especially true in the case of an adapter for a SAX1 Parser, which has no way of affecting whether the underlying parser is validating, for example.

Some feature values may be immutable or mutable only in specific contexts, such as before, during, or after a parse.

set_feature() will throw XML::SAX::Exception::NotRecognized when the SAX parser does not recognize the feature name and XML::SAX::Exception::NotSupported when the SAX parser recognizes the feature name but cannot set the requested value.

get_features()
Look up all Features that this parser claims to support.

This method returns a hash of Features which the parser claims to support. The value of the hash is currently unspecified though it may be used later. This method is meant to be inherited so that Features supported by the base parser class (XML::SAX::Base) are declared to be supported by subclasses.

Calling this method is probably only moderately useful to end users. It is mostly meant for use by XML::SAX, so that it can query parsers for Feature support and return an appropriate parser depending on the Features that are required.

Input Sources

Input sources may be provided to parser objects or are returned by entity resolvers. An input source is a hash reference with these properties:

PublicId
The public identifier of this input source.

The public identifier is always optional: if the application writer includes one, it will be provided as part of the location information.

SystemId
The system identifier (URI) of this input source.

The system identifier is optional if there is a byte stream, a character stream or a string, but it is still useful to provide one, since the application can use it to resolve relative URIs and can include it in error messages and warnings (the parser will attempt to open a connection to the URI only if there is no byte stream, character stream or string specified).

If the application knows the character encoding of the object pointed to by the system identifier, it can register the encoding using the Encoding property.

String
The character or byte string of this input source.

The SAX parser will ignore this if there is also a byte stream or a character stream specified, but it will use the string in preference to opening a URI connection itself.

If the UTF-8 flag of the string is turned on, the effect is as if the Encoding property is set to UTF-8.

ByteStream
The byte stream for this input source.

The SAX parser will ignore this if there is also a character stream specified, but it will use a byte stream in preference to using a string or opening a URI connection itself.

If the application knows the character encoding of the byte stream, it should set the Encoding property.

CharacterStream
The character stream for this input source.

If there is a character stream specified, the SAX parser will ignore any byte stream or string and will not attempt to open a URI connection to the system identifier.

Note: A CharacterStream is a file-handle that does not need any encoding translation done on it. This is implemented as a regular file-handle and only works under Perl 5.7.2 or higher using PerlIO. To get a single character, or number of characters from it, use the perl core read() function. To get a single byte from it (or number of bytes), you can use sysread().

Encoding
The character encoding, if known.

The encoding must be a string acceptable for an XML encoding declaration (see section 4.3.3 of the XML 1.0 recommendation). If provided, the value of Encoding property has a higher priority than encoding specified in the XML declaration.

This property has no effect when the application provides a character stream.

SAX Handlers

SAX supports several classes of event handlers: content handlers, declaration handlers, DTD handlers, error handlers, entity resolvers, and other extensions. This section defines each of these classes of events.

Content Events

This is the main interface that most SAX applications implement: if the application needs to be informed of basic parsing events, it implements this interface and registers an instance with the SAX parser using the ContentHandler property. The parser uses the instance to report basic document-related events like the start and end of elements and character data.

The order of events in this interface is very important, and mirrors the order of information in the document itself. For example, all of an element's content (character data, processing instructions, and/or subelements) will appear, in order, between the start_element event and the corresponding end_element event.

set_document_locator(locator)
Receive an object for locating the origin of SAX document events.

SAX parsers are strongly encouraged (though not absolutely required) to supply a locator: if it does so, it must supply the locator to the application by invoking this method before invoking any of the other methods in the ContentHandler interface.

If possible, a Perl SAX driver should provide the line and column position of the last character of the text associated with the current document event. The first line is line 1; the first column in each line is column 1.

The locator allows the application to determine the end position of any document-related event, even if the parser is not reporting an error. Typically, the application will use this information for reporting its own errors (such as character content that does not match an application's business rules). The information provided by the locator is probably not sufficient for use with a search engine.

Note that the locator will provide correct information only during the invocation of the events in this interface. The application should not attempt to use it at any other time. ???Expanded internal entities.

The locator is a hash reference with these properties:

ColumnNumber The column number of the end of the current event text.
LineNumber The line number of the end of the current event text.
PublicId The public identifier of the current entity.
SystemId The system identifier of the current entity.
Encoding The character encoding the parser uses to decode the current entity if it decodes the current entity. Otherwise (the entity has been provided as a character stream) this property is undef.
XMLVersion The XML version of the current entity.

start_prefix_mapping(mapping)
Begin the scope of a prefix-URI Namespace mapping.

The information from this event is not necessary for normal Namespace processing: the SAX XML reader will automatically replace prefixes for element and attribute names when the "http://xml.org/sax/features/namespaces" feature is true (the default).

There are cases, however, when applications need to use prefixes in character data or in attribute values, where they cannot safely be expanded automatically; the start/end_prefix_mapping event supplies the information to the application to expand prefixes in those contexts itself, if necessary.

Note that start/end_prefix_mapping() events are not guaranteed to be properly nested relative to each-other: all start_prefix_mapping() events will occur before the corresponding start_element() event, and all end_prefix_mapping events will occur after the corresponding end_element() event, but their order is not guaranteed.

mapping is a hash reference with these properties:

Prefix The Namespace prefix being declared.
NamespaceURI The Namespace URI the prefix is mapped to.

end_prefix_mapping(mapping)
End the scope of a prefix-URI mapping.

See start_prefix_mapping() for details. This event will always occur after the corresponding end_element event, but the order of end_prefix_mapping events is not otherwise guaranteed.

mapping is a hash reference with this property:

Prefix The Namespace prefix that was being mapped.

processing_instruction(pi)
Receive notification of a processing instruction.

The Parser will invoke this method once for each processing instruction found: note that processing instructions may occur before or after the main document element.

A SAX parser should never report an XML declaration (XML 1.0, section 2.8) or a text declaration (XML 1.0, section 4.3.1) using this method.

pi is a hash reference with these properties:

Target The processing instruction target.
Data The processing instruction data, or undef if none was supplied.

skipped_entity(entity)
Receive notification of a skipped entity.

The Parser will invoke this method once for each entity skipped. Non-validating processors may skip entities if they have not seen the declarations (because, for example, the entity was declared in an external DTD subset). All processors may skip external entities, depending on the values of the "http://xml.org/sax/features/external-general-entities" and the "http://xml.org/sax/features/external-parameter-entities" Features.

entity is a hash reference with these properties:

Name The name of the skipped entity. If it is a parameter entity, the name will begin with '%'.

Declaration Events

This is an optional extension handler for SAX2 to provide information about DTD declarations in an XML document. XML readers are not required to support this handler. Readers supporting this handler have the http://xml.org/sax/handlers/DeclHandler feature set to 1. Otherwise, this feature has value 0.

Note that data-related DTD declarations (unparsed entities and notations) are already reported through the DTDHandler interface.

If you are using the declaration handler together with a lexical handler, all of the events will occur between the start_dtd and the end_dtd events.

To set a DeclHandler for an XML reader, use the DeclHandler option of the reader constructor or parse()/parse_x() method. If the reader does not support declaration events, it ignores the DeclHandler option.

element_decl(element)
Report an element type declaration.

The content model will consist of the string "EMPTY", the string "ANY", or a parenthesised group, optionally followed by an occurrence indicator. The model will be normalized so that all whitespace is removed, and will include the enclosing parentheses.

element is a hash reference with these properties:

Name The element type name.
Model The content model as a normalized string.

attribute_decl(attribute)
Report an attribute type declaration.

Only the effective (first) declaration for an attribute will be reported. The type will be one of the strings "CDATA", "ID", "IDREF", "IDREFS", "NMTOKEN", "NMTOKENS", "ENTITY", "ENTITIES", or "NOTATION", or a parenthesized token group with the separator "|" and all whitespace removed.

attribute is a hash reference with these properties:

eName The name of the associated element.
aName The name of the attribute.
Type A string representing the attribute type.
Mode A string representing the attribute defaulting mode ("#IMPLIED", "#REQUIRED", or "#FIXED") or undef if none of these applies.
Value A string representing the attribute's default value, or undef if there is none.

internal_entity_decl(entity)
Report an internal entity declaration.

Only the effective (first) declaration for each entity will be reported.

entity is a hash reference with these properties:

Name The name of the entity. If it is a parameter entity, the name will begin with '%'.
Value The replacement text of the entity.

external_entity_decl(entity)
Report a parsed external entity declaration.

Only the effective (first) declaration for each entity will be reported.

entity is a hash reference with these properties:

Name The name of the entity. If it is a parameter entity, the name will begin with '%'.
PublicId The public identifier of the entity, or undef if none was declared.
SystemId The system identifier of the entity.

DTD Events

If a SAX application needs information about notations and unparsed entities, then the application implements this interface. The parser uses the instance to report notation and unparsed entity declarations to the application.

The SAX parser may report these events in any order, regardless of the order in which the notations and unparsed entities were declared; however, all DTD events must be reported after the document handler's start_document() event, and before the first start_element() event.

It is up to the application to store the information for future use (perhaps in a hash table or object tree). If the application encounters attributes of type "NOTATION", "ENTITY", or "ENTITIES", it can use the information that it obtained through this interface to find the entity and/or notation corresponding with the attribute value.

notation_decl(notation)
Receive notification of a notation declaration event.

It is up to the application to record the notation for later reference, if necessary.

If a system identifier is present, and it is a URL, the SAX parser must resolve it fully before passing it to the application.

notation is a hash reference with these properties:

Name The notation name.
PublicId The public identifier of the entity, or undef if none was declared.
SystemId The system identifier of the entity, or undef if none was declared.

unparsed_entity_decl(entity)
Receive notification of an unparsed entity declaration event.

Note that the notation name corresponds to a notation reported by the notation_decl() event. It is up to the application to record the entity for later reference, if necessary.

If the system identifier is a URL, the parser must resolve it fully before passing it to the application.

entity is a hash reference with these properties:

Name The unparsed entity's name.
PublicId The public identifier of the entity, or undef if none was declared.
SystemId The system identifier of the entity.
Notation The name of the associated notation.

Entity Resolver

If a SAX application needs to implement customized handling for external entities, it must implement this interface.

The parser will then allow the application to intercept any external entities (including the external DTD subset and external parameter entities, if any) before including them.

Many SAX applications will not need to implement this interface, but it will be especially useful for applications that build XML documents from databases or other specialised input sources, or for applications that use URI types that are either not URLs, or that have schemes unknown to the parser.

resolve_entity(entity)
Allow the application to resolve external entities.

The Parser will call this method before opening any external entity except the top-level document entity (including the external DTD subset, external entities referenced within the DTD, and external entities referenced within the document element): the application may request that the parser resolve the entity itself, that it use an alternative URI, or that it use an entirely different input source.

This method returns an Input Source hash reference.

Application writers can use this method to redirect external system identifiers to secure and/or local URIs, to look up public identifiers in a catalogue, or to read an entity from a database or other input source (including, for example, a dialog box).

If the system identifier is a URL, the SAX parser must resolve it fully before reporting it to the application.

entity is a hash reference with these properties:

PublicId The public identifier of the entity being referenced, or undef if none was declared.
SystemId The system identifier of the entity being referenced.

Error Events

If a SAX application needs to implement customized error handling, it must implement this interface. The parser will then report all errors and warnings through this interface.

The parser shall use this interface to report errors instead or in addition to throwing an exception: for errors and warnings the recommended approach is to leave the application throw its own exceptions and to not throw them in the parser. For fatal errors however, it is not uncommon that the parser will throw an exception after having reported the error as it renders any continuation of parsing impossible.

All error handlers receive a hash reference, exception, with the properties defined in Exceptions.

warning(exception)
Receive notification of a warning.

SAX parsers will use this method to report conditions that are not errors or fatal errors as defined by the XML 1.0 recommendation. The default behavior is to take no action.

The SAX parser must continue to provide normal parsing events after invoking this method: it should still be possible for the application to process the document through to the end.

error(exception)
Receive notification of a recoverable error.

This corresponds to the definition of "error" in section 1.2 of the W3C XML 1.0 Recommendation. For example, a validating parser would use this callback to report the violation of a validity constraint. The default behavior is to take no action.

The SAX parser must continue to provide normal parsing events after invoking this method: it should still be possible for the application to process the document through to the end. If the application cannot do so, then the parser should report a fatal error even if the XML 1.0 recommendation does not require it to do so.

fatal_error(exception)
Receive notification of a non-recoverable error.

This corresponds to the definition of "fatal error" in section 1.2 of the W3C XML 1.0 Recommendation. For example, a parser would use this callback to report the violation of a well-formedness constraint.

The application must assume that the document is unusable after the parser has invoked this method, and should continue (if at all) only for the sake of collecting addition error messages: in fact, SAX parsers are free to stop reporting any other events once this method has been invoked.

Lexical Events

This is an optional extension handler for SAX2 to provide lexical information about an XML document, such as comments and CDATA section boundaries; XML readers are not required to support this handler. Readers supporting this handler have the http://xml.org/sax/handlers/LexicalHandler feature set to 1. Otherwise, this feature has value 0.

The events in the lexical handler apply to the entire document, not just to the document element, and all lexical handler events must appear between the content handler's start_document() and end_document() events.

To set a LexicalHandler for an XML reader, use the LexicalHandler option of the reader constructor or parse()/parse_x() method. If the reader does not support lexical events, it ignores the LexicalHandler option.

start_dtd(dtd)
Report the start of DTD declarations, if any.

Any declarations are assumed to be in the internal subset unless otherwise indicated by a start_entity event.

Note that the start/end_dtd() events will appear within the start/end_document() events from Content Handler and before the first start_element() event.

dtd is a hash reference with these properties:

Name The document type name.
PublicId The declared public identifier for the external DTD subset, or undef if none was declared.
SystemId The declared system identifier for the external DTD subset, or undef if none was declared.

end_dtd(dtd)
Report the end of DTD declarations.

No properties are defined for this event (dtd is empty).

start_entity(entity)
Report the beginning of an entity in content.

NOTE: entity references in attribute values -- and the start and end of the document entity -- are never reported.

The start and end of the external DTD subset are reported using the pseudo-name "[dtd]". All other events must be properly nested within start/end entity events.

Note that skipped entities will be reported through the skipped_entity() event, which is part of the ContentHandler interface.

entity is a hash reference with these properties:

Name The name of the entity. If it is a parameter entity, the name will begin with '%'.

end_entity(entity)
Report the end of an entity.

entity is a hash reference with these properties:

Name The name of the entity that is ending.

start_cdata(cdata)
Report the start of a CDATA section.

The contents of the CDATA section will be reported through the regular characters event.

No properties are defined for this event (cdata is empty).

end_cdata(cdata)
Report the end of a CDATA section.

No properties are defined for this event (cdata is empty).

comment(comment)
Report an XML comment anywhere in the document.

This callback will be used for comments inside or outside the document element, including comments in the external DTD subset (if read).

comment is a hash reference with these properties:

Data The comment characters.

SAX Filters

An XML filter is like an XML event generator, except that it obtains its events from another XML event generator rather than a primary source like an XML document or database. Filters can modify a stream of events as they pass on to the final application.

Parent
The parent reader.

This Feature allows the application to link the filter to a parent event generator (which may be another filter).

See the XML::SAX::Base module for more on filters. It is meant to be used as a base class for filters and drivers, and makes them much easier to implement.

Java Compatibility

The Perl SAX 2.1 binding differs from the Java binding in these ways:

If compatibility is a problem for you consider writing a Filter that converts from this style to the one you want. It is likely that such a Filter will be available from CPAN in the not distant future.