Perl SAX 2.1 Changes and Issues

The development of Perl SAX 2.1 API is in progress. Comments should be sent to Perl-XML mailing list (perl-xml@listserv.ActiveState.com).

Open issues: none

Changes from Perl SAX 2.0

Change C1
XMLVersion and Encoding fields added to document locator (as in Locator2 interface of SAX 2.0 Ext. 1.1)

Change C2 [resolves issue I4]
The definition of parse() unified in the Basic and Advanced documents. parse_uri() added. The new definition fits to the current XML::SAX::Base implementation (as of XML::SAX v0.12).

Change C3 [resolves issue I6]
Changes in attribute_decl(): ValueDefault renamed to Mode. The new name is less confusing and corresponds to SAX Java API.

Change C4 [resolves issue I12]
The following text has been added to define what the document locator is supposed to return: "If possible, a Perl SAX driver should provide the line and column position of the last character of the text associated with the current document event. The first line is line 1; the first column in each line is column 1."

Change C5 [resolves issue I11]
The spec defines that a stream argument that can be provided to parse_file() method can be either a file handle, a glob reference, or a IO::Handle sub-classe.

Change C6 [resolves issue I7]
New section "Namespace Processing" has been added. It describes the behavior of a parser for NS processing turned off: Element/attribute hash keys are always present, NamespaceURI, Prefix and LocalName are undef. Attributes keys are prefixed with {}. NS declarations are treated as common attributes. start_prefix_mapping and end_prefix_mapping are never called.

Change C7 [resolves issue I8]
The spec defines explicitly that values of callback argument hashrefs are Unicode strings (scalars with UTF-8 flag on).

Change C8 [resolves issue I1]
The section "Features" defines a read-only 'http://xmlns.perl.org/sax/version-2.1' feature which returns 1 for a driver supporting Perl SAX 2.1.

Change C9 [resolves issue I2]
LexicalHandler and DeclHandler are set using the parser options with the same name. The two read-only features, (http://xmlns.perl.org/sax/lexicalHandler, declHandler) return 0 or 1 to indicate whether the parser supports these two interfaces.

Change C10 [resolves issue I13]
The spec states explicitly that the value of input source Encoding property has a higher priority than encoding specified in an XML declaration.

Change C11 [resolves issue I9]
The following text has been added to the InputSource definition: "String - The character or byte string of this input source. The SAX parser will ignore this if there is also a byte stream or a character stream specified, but it will use the string in preference to opening a URI connection itself. If the UTF-8 flag of the string is turned on, the effect is as if the Encoding property is set to UTF-8." The order properties are checked in agrees with the current XML::SAX::Base implementation (v0.12): CharacterStream, ByteStream, String, SystemId.

Change C12 [resolves issue I10]
In addition to streams, the parse_file() method also accepts system paths to prevent possible confusion arising from the name of this method.

Change C13
The Features section has changed. All features have values of 1 or 0; the section lists some common features.

Issues

Issue I1 status: closed, resolution: applied [resolved as change C8]
A parser should advertise SAX version it supports. There can be a new method ($parser->get_sax_version()) or a read-only feature (http://xmlns.perl.org/sax/version). This feature should be introduced also to Perl SAX 2.0 retrospectively to distinguish between 1.0, 2.0 and 2.1 drivers.
Suggestion: the read-only feature.

Issue I2 status: closed, resolution: applied [resolved as change C9]
"http://xml.org/sax/handlers/LexicalHandler" feature on the parser needs to be set to the object to receive lexical events currently. If the reader does not support lexical events, it will throw a XML::SAX::Exception::NotRecognized or a XML::SAX::Exception::NotSupported when you attempt to register the handler. DeclHandler works in the same way. Actually, this is a theory - XML::SAX::Base doesn't implement this currently.
This approach is very different from the common PerlSAX mechanism: look for a specific handler, then look for a handler method on the default handler, ignore the callback when not found. It would be more 'perlish' to apply this simple mechanism to LexicalHandler and DeclHandler too. If we want these two be extension handlers (compliant 2.1 parsers are not required to support them) there could be read-only features to let apps to know whether extension handlers are supported o not (http://xmlns.perl.org/sax/LexicalHandler, DeclHandler).
Suggestion: LexicalHandler and DeclHandler are set using the parser options with the same name. The two read-only features, (http://xmlns.perl.org/sax/LexicalHandler, DeclHandler) return 0 or 1 to indicate whether the parser support these two interfaces.

Issue I3 status: closed, resolution: denied
SAX 2.0 Ext. 1.1 has a new Attributes2 interface which extends attributes with new properties (Declared, Specified) to distinguish between attributed specified in an XML doc and those declared in DTD. This could be introduced into Perl SAX 2.1 as an optional extension (advertised by a feature).
Suggestion: not to apply.

Issue I4 status: closed, resolution: applied [resolved as change C2]
The parse() method is defined in different ways in the Basic and Advanced documents. Proposed solution: to add an explicit parse_uri() method, parse() would call either parse_uri(), parse_string(), or parse_stream() based on InputSource.

Issue I5 status: closed, resolution: denied
All hash references could be replaced with blessed classes. Need to clarify what would be benefits of such change.
Suggestion: not to apply.

Issue I6 status: closed, resolution: applied [resolved as change C3]
Changes in attribute_decl eName, aName, Type, Mode (was ValueDefault), Value

Issue I7 status: closed, resolution: applied [resolved as change C6]
The effect of turning off namespace processing is unclear in Perl SAX. The spec should state that all namespace-related processing is skipped, and no namespace-related information is made available.
Suggestion: All node keys are always present, NamespaceURI, Prefix and LocalName are undef. Attributes keys are prefixed with {} (for example {}pfx:lname). NS declarations are treated as common attributes.

Issue I8 status: closed, resolution: applied [resolved as change C7]
Perl SAX should require explicitly all event data to be Unicode strings (to have the UTF-8 flag on).

Issue I9 status: closed, resolution: applied [resolved as change C11]
Input sources don't have a String property defined though the parse_string() method exists and use it. Current XML::SAX::Base version (0.12) already implements String property. The properties are checked in this order: CharacterStream, ByteStream, String, SystemId.
Suggested solution: To add the following paragraph:
String - The character or byte string for this input source. If there is a string specified, the SAX parser will ignore any byte or character stream and will not attempt to open a URI connection to the system identifier. If the UTF-8 flag of the string is turned on, the effect is as if the Encoding property is set to UTF-8.
The order of properties to be checked has to be determined.

Issue I10 status: closed, resolution: applied [resolved as change C12]
parse_file() is meant to accept streams in Perl SAX, while other modules (such as XML::LibXML and XML::Parser) accept system paths for this method.
Suggested solution: To change parse_file() so that it accepts a system path in addition to the currently supported types.

Issue I11 status: closed, resolution: applied [resolved as change C5]
The specification should state explicitly what is meant be "streams", what are supported types for parse_file(): file handles, glob references, IO::Handle sub-classes, ...
Suggested solution: To support all of the above mentioned types.

Issue I12 status: closed, resolution: applied [resolved as change C4]
The spec should be more explicit about what a document locator is supposed to return. It could for example read:
If possible, a Perl SAX driver should provide the line and column position of the first character after the text associated with the document event. The first line is line 1; the first column in each line is column 1.

Issue I13 status: closed, resolution: applied [resolved as change C10]
The value of the Encoding property of Input Sources should be able to override the encoding value specified in the xml declaration. The spec should state this explicitly, if so.
Suggestion: to add the following text: If available, the value of Encoding property has a higher priority than encoding specified in an XML declaration.

Issue I14 status: closed, resolution: denied
Parsers report an error when the parse() method is called and no input source (Source option) is provided, regardless on the method's argument.
Suggestion: not to apply. It will be sufficient to provide a better error message from XML::SAX::Base (e.g. to suggest than one may likely want to call parse_uri() or parse_string() instead of parse().)