Quoting Sean Champ (gimmal@gmail.com):
I've noticed that the function CXML::P/DOCUMENT calls the function CXML::P/DOCTYPE-DECL whether or not CXML::P/DOCUMENT has received a non-nil VALIDATE argument. I'd thought to try and make a patch about that, myself, but I'm not sure how to address the matter.
This seems to be a bit of an FAQ lately.
If the input stream is not being validated, the contents of the doctype decl should not matter, for anything -- need not be put, in any part, as an argument to CXML::XSTREAM-OPEN-EXTID. Yet, when the parser is validating, the text of the doctype decl. still must be 'skipped' by the parser.
The DTD is parsed so that entity references can be resolved.
You cannot skip the doctypedecl entirely: The internal subset must always be processed.[1]
Looking at CXML::P/DOCTYPE-DECL, I'm not sure how to make the parser skip the text of the decl, or what it could possibly return when skipping it. I could appreciate advice on the matter.
It is true that we could skip the external subset.
The XML spec allows non-validating parsers to report but not resolve entity references. ("Note that non-validating processors are not obligated to to [sic] read and process entity declarations occurring in parameter entities or in the external subset [...]"[2] And later, "For example, a non-validating processor may fail to [...] include the replacement text of internal entities"[3]).
That would allow a change like this: * Add a new keyword argument to the parser, perhaps called RESOLVE-ENTITY-REFERENCES, defaulting to T. * NIL allowed only if VALIDATE is also NIL. * If NIL, skip parsing of the external subset and of external entities. * Invent a new SAX event, perhaps called SAX:GENERAL-ENTITY-REFERENCE to report such entity references instead of resolving them. * In the DOM builder, construct an EntityReference accordingly (assuming it is OK to create EntityReference nodes without children just because we do not -have- those children; see below).
The big, bad problem with this, however: * Extending the SAX event START-ELEMENT so that attribute values can contain unsolved entity references is not so attractive. * Even if we worked around that, when an Attr node has an EntityReference with unknown content, what is supposed to happen when reading the attribute value? According to the DOM spec, the attribute value is constructed by resolving entity references. Are we supposed to signal an error then? * And that's only the user-visible side of this problem, internally the parser assumes that attribute values are strings, too.
So, I am not at all convinced this would be worth it. And, as explained above, I do not see how to make it work with DOM or even SAX.
Most people asking about this so far were just too lazy to type "apt-get install w3c-dtd-xhtml" anyway...
It would, however, be interesting to learn about other standards-conforming XML parsers with support for SAX and/or DOM that do anything like this. Pointers appreciated.
David ---- [1] Well, up to the first reference to an external parameter entity. [2] http://www.w3.org/TR/REC-xml/#wf-entdeclared [3] http://www.w3.org/TR/REC-xml/#safe-behavior