Hello,
I've noticed that the function CXML::P/DOCUMENT calls the function CXML::P/DOCTYPE-DECL whether or not CXML::P/DOCUMENT has received a non-nil VALIDATE argument. I'd thought to try and make a patch about that, myself, but I'm not sure how to address the matter.
If the input stream is not being validated, the contents of the doctype decl should not matter, for anything -- need not be put, in any part, as an argument to CXML::XSTREAM-OPEN-EXTID. Yet, when the parser is validating, the text of the doctype decl. still must be 'skipped' by the parser.
Looking at CXML::P/DOCTYPE-DECL, I'm not sure how to make the parser skip the text of the decl, or what it could possibly return when skipping it. I could appreciate advice on the matter.
Thank you
-- Sean Champ
Quoting Sean Champ (gimmal@gmail.com):
I've noticed that the function CXML::P/DOCUMENT calls the function CXML::P/DOCTYPE-DECL whether or not CXML::P/DOCUMENT has received a non-nil VALIDATE argument. I'd thought to try and make a patch about that, myself, but I'm not sure how to address the matter.
This seems to be a bit of an FAQ lately.
If the input stream is not being validated, the contents of the doctype decl should not matter, for anything -- need not be put, in any part, as an argument to CXML::XSTREAM-OPEN-EXTID. Yet, when the parser is validating, the text of the doctype decl. still must be 'skipped' by the parser.
The DTD is parsed so that entity references can be resolved.
You cannot skip the doctypedecl entirely: The internal subset must always be processed.[1]
Looking at CXML::P/DOCTYPE-DECL, I'm not sure how to make the parser skip the text of the decl, or what it could possibly return when skipping it. I could appreciate advice on the matter.
It is true that we could skip the external subset.
The XML spec allows non-validating parsers to report but not resolve entity references. ("Note that non-validating processors are not obligated to to [sic] read and process entity declarations occurring in parameter entities or in the external subset [...]"[2] And later, "For example, a non-validating processor may fail to [...] include the replacement text of internal entities"[3]).
That would allow a change like this: * Add a new keyword argument to the parser, perhaps called RESOLVE-ENTITY-REFERENCES, defaulting to T. * NIL allowed only if VALIDATE is also NIL. * If NIL, skip parsing of the external subset and of external entities. * Invent a new SAX event, perhaps called SAX:GENERAL-ENTITY-REFERENCE to report such entity references instead of resolving them. * In the DOM builder, construct an EntityReference accordingly (assuming it is OK to create EntityReference nodes without children just because we do not -have- those children; see below).
The big, bad problem with this, however: * Extending the SAX event START-ELEMENT so that attribute values can contain unsolved entity references is not so attractive. * Even if we worked around that, when an Attr node has an EntityReference with unknown content, what is supposed to happen when reading the attribute value? According to the DOM spec, the attribute value is constructed by resolving entity references. Are we supposed to signal an error then? * And that's only the user-visible side of this problem, internally the parser assumes that attribute values are strings, too.
So, I am not at all convinced this would be worth it. And, as explained above, I do not see how to make it work with DOM or even SAX.
Most people asking about this so far were just too lazy to type "apt-get install w3c-dtd-xhtml" anyway...
It would, however, be interesting to learn about other standards-conforming XML parsers with support for SAX and/or DOM that do anything like this. Pointers appreciated.
David ---- [1] Well, up to the first reference to an external parameter entity. [2] http://www.w3.org/TR/REC-xml/#wf-entdeclared [3] http://www.w3.org/TR/REC-xml/#safe-behavior
On 09-09-06, David Lichteblau wrote:
Quoting Sean Champ (gimmal@gmail.com):
If the input stream is not being validated, the contents of the doctype decl should not matter, for anything -- need not be put, in any part, as an argument to CXML::XSTREAM-OPEN-EXTID. Yet, when the parser is validating, the text of the doctype decl. still must be 'skipped' by the parser.
The DTD is parsed so that entity references can be resolved.
You know, I might apologize about my initial proposal, which now appears as it having been somewhat naive. I had not recalled that entity declarations might be found in the content of a DTD.
You cannot skip the doctypedecl entirely: The internal subset must always be processed.[1]
Looking at CXML::P/DOCTYPE-DECL, I'm not sure how to make the parser skip the text of the decl, or what it could possibly return when skipping it. I could appreciate advice on the matter.
It is true that we could skip the external subset.
The XML spec allows non-validating parsers to report but not resolve entity references. ("Note that non-validating processors are not obligated to to [sic] read and process entity declarations occurring in parameter entities or in the external subset [...]"[2] And later, "For example, a non-validating processor may fail to [...] include the replacement text of internal entities"[3]).
That would allow a change like this:
- Add a new keyword argument to the parser, perhaps called RESOLVE-ENTITY-REFERENCES, defaulting to T.
- NIL allowed only if VALIDATE is also NIL.
- If NIL, skip parsing of the external subset and of external entities.
- Invent a new SAX event, perhaps called SAX:GENERAL-ENTITY-REFERENCE to report such entity references instead of resolving them.
- In the DOM builder, construct an EntityReference accordingly (assuming it is OK to create EntityReference nodes without children just because we do not -have- those children; see below).
The big, bad problem with this, however:
- Extending the SAX event START-ELEMENT so that attribute values can contain unsolved entity references is not so attractive.
Looking at the source text of method SAX:START-ELEMENT (DOM-BUILDER T T T T), it looks like an attribute's value would be carried-in directly in from each SAX::ATTRIBUTE instance.
Looking at CXML::P/ATT-VALUE, then along the call-tree in that function, I hope to inquire: Perhaps a non-dereferencing of entity references might be addressed in CXML::READ-ATT-VALUE, and/or maybe in CXML::RECURSE-ON-ENTITY ? Would either of those be a good place to start, about it?
I'll admit, I'm still largely unfamiliar with the CXML codebase, and with the operations of CXML.
- Even if we worked around that, when an Attr node has an EntityReference with unknown content, what is supposed to happen when reading the attribute value? According to the DOM spec, the attribute value is constructed by resolving entity references. Are we supposed to signal an error then?
I had expected if the DOM spec would *not* require a dereferncing of entity refs, at the time of construction of DOM objects. I have not been much familiar with any strictures of DOM, granted.
I had expected that an entity ref would be decoded as some sort of an object representing an entity ref, then that object would be stored, somehow -- could be in a sequence, between two strings, for instance.
Yet, indeed, I have expected such that does not match with what they have specified.
from http://www.w3.org/TR/2004/REC-DOM-Level-3-Core-20040407/core.html#ID-2216624... regarding interface 'Attribute' :
value of type DOMString On retrieval, the value of the attribute is returned as a string. Character and general entity references are replaced with their values. See also the method getAttribute on the Element interface.
On setting, this creates a Text node with the unparsed contents of the string, i.e. any characters that an XML processor would recognize as markup are instead treated as literal text. See also the method Element.setAttribute().
Some specialized implementations, such as some [SVG 1.1] implementations, may do normalization automatically, even after mutation; in such case, the value on retrieval may differ from the value on setting.
May I say, that is a body of silly requirements. An entity reference anwyhere in an input 'infoset' should be decoded as it being an entity reference, need be dereferenced only when the thing must be rendered as for the non-source text of the thing.
They require a DOM paser to loose meaningful information, such that would be represented in the input document -- loosing the fact, "this was an entity reference", replacing the thing at time of parsing.
I cannot take that as it being reasonable -- furthermore, it is inconsistent, if they do not require entity refs to be dereferenced, in the core spec to XML, but then would require else, in DOM.
I would denote that this may be an issue worth putting to one of the W3C xml-related lists, whichever and wherever might be the list most appropriate, for it. I cannot be certain it would not be futile for my to mention it, however. I wonder if I do not observe a phenomenon of some people proceeding to stick their fingers in their ears, and some of those, making annoying comments. I'm afraid that the common interest might be most served, if I will not be be in community to a W3C XML-related mailing list.
- And that's only the user-visible side of this problem, internally the parser assumes that attribute values are strings, too.
If I was certain of how to, I should be glad to endeavor to address it, but what of this matter of what the DOM spec specifies?
Per this proposal: The type of the 'value' property on an 'Att' interface would be no longer "DOMString". In fact, it would then cease being 100% compatible with DOM.
Given that DOM appears to include no VECTOR type, I suppose the value type for the 'Attr.value' property -- if made to keep in parallel with DOM -- would have to be revised to be of type NodeList. That, then, could amount to a bunch of mularkey in the system -- a *bunch* of single-element vectors of strings.
Beyond DOM, here, I would propose that the type of the value of an attribute's 'value' slot could be specified as being like so: (OR STRING (VECTOR (OR STRING DOM:ENTITY-REFERNCE)))
A value of that type could be coereced to a value of type DOMString, or to a value of type type NodeList, if and as would be required -- perhaps, using something of a generalization on CL:COERCE, viz.
(defgeneric coerce* (instance type) (:method ((instance t) (type symbol)) (coerce* instance (find-class type))) #+(or CMU SBCL) (:method ((instance t) (type t)) (coerce* instance #+CMU (kernel::specifier-type type) #+SBCL (sb-kernel::specifier-type type))) ; ... )
So, I am not at all convinced this would be worth it. And, as explained above, I do not see how to make it work with DOM or even SAX.
How I could regard it as being worth it:
1) to make an *accurate* rendering of an input doument, as might be via a DOM node that would be represented via a CLIM presentation method
2) to be able to store a DOM node as an *accurate* representation of what was in the input document, not with any entity refs derefed.
It appears that it cannot be made to fit with DOM, in how DOM has been defined, up at L3. I wonder if they might be cheered, if the inconsitency about the Attr.value slot would occur to their attention.
I cannot find a defintion of the SAX API, even at what is supposed to be the official homepage for it, http://www.saxproject.org/ . If they have not defined an event-signaling function that would be called on encountering an entity ref, perhaps it has been an oversight in design, such that they might want to address, also.
As for how it might be addressed into implementation in CXML, beside if CXML::RECURSE-ON-ENTITY might be modified for it -- if I have found the right function for it, there -- perhaps the parameters to the matter could be represented as a slot value to an instance of a modified DOM-builder class.
Perhaps there might be defined slots on RUNE-DOM::DOM-BUILDER like as so:
1) validation policy (slot validate-p ?)
2) entity-dereferencing policy (slot dereference-entity-p ?)
3) whitespace-normalization policy (slot normalize-whitespace-p ?)
4) namespace-handling policy (???-namesapce-???-p ??) *
5) documentary-schema ? (slot documentary-schema -- could reference a DTD or an XSD, if not either a Relax NG XML or Relax NG non-XML thing)
6) catalogue (?) (initform: CXML:*CATALOG* )
7) entity-resolver (entity-resolver ?) (nil or function?)
8) internal-subset handling policy (allow-internal-subset-p ?)
9) recoder ? I am not aware of what would be the affects contingent on the the RECODE argument to CXML::P/DOCUMENT, so I cannot map that to a slot, as here.
* At present, I'm not aware of what would be handled about namespaces, as with a CXML::NAMESPACE-NORMALIZER, in such that would not be handled with an instance directly of the class RUNE-DOM::DOM-BUILDER. So, I cannot map that to a slot, much.
With the class DOM-BUILDER being revised as so, then, CXML::P/DOCUMENT could be removed of the keyword args, then -- then, extracting the values of those args from the 'handler' object.
If a modification CXML::RECURSE-ON-ENTITY would be a part to it: I am not sure how the entity-deref policy would best be passed through, to that function; it might be done with an additional argument value, if not with a lexically scoped variable.
The ent-deref policy flag would have to be provided through such as CXML::READ-ATT-VALUE (and anything else calling RECURSE-ON-ENTITY) --- then, fristly, from within CXML::P/ATT-VALUE. The value could be made into that function, with a new, lexically scoped variable, I suppose, such that could be bound within a form in CXML::P/DOCUMENT -- however would be the selected name of the variable, something for it.
Either, I'm not sure of how the parsing would need to be revised, for it, as in CXML::RECURSE-ON-ENTITY and CXML::READ-ATT-VALUE; I hope I would be able to figure it out.
Certainly, some of the parser functions would have to be revised, per the proposed change -- anything parsing an attribute value, namely. If it should be helpful, then after I'd have a definite start about t, I'd be willing to fetch the W3C's tests, run through 'em, then revise the changed stuff, until I could be sure it would work out.
The slot-value type for SAX::ATTRIBUTE-VALUE would be, essentially, changed -- from, say, rod-or-string (?) to (vector (or rod-or-string entity-ref))
Most people asking about this so far were just too lazy to type "apt-get install w3c-dtd-xhtml" anyway...
I was trying to parse a document using the XBEL DTD, and to parse it without validation. When I was doing so, I noticed that the HTTP URI in the DTD decl was not handled. I thought that the decl. might be skipped entirely, then.
Granted, I do have an XBEL DTD locally, but - not everyone does - my Galeon bookmarks file would not be valid on the XBEL DTD, anyway - I had not realized the matter about that entity defs may occur in external DTD subsets - I thought that I could propose that the DTD decl might be skipped (without my having to modify /etc/xml/catalog first)
For what it's worth, it appears that -- in a Debian environment -- the shell command `update-xmlcatalog' would be the right command to, use for modifying /etc/xml/catalog or a similar file. That shell command is used in the w3c-dtd-xhtml package's postinst script. Now, I get to know that much.
If it would be preferred, I could go over this message body, again, if to make a really formal design proposal, on the matter. I've been using DocBook, a lot; it should be as simple as to make a refentry page about the propsal, and to submit it via this message list, here.
Danke
-- Sean Champ
[1] Well, up to the first reference to an external parameter entity. [2] http://www.w3.org/TR/REC-xml/#wf-entdeclared [3] http://www.w3.org/TR/REC-xml/#safe-behavior
Quoting Sean Champ (gimmal@gmail.com):
Granted, I do have an XBEL DTD locally, but
- not everyone does
[...]
- I thought that I could propose that the DTD decl might be skipped (without my having to modify /etc/xml/catalog first)
I am not familiar with XBEL, but reading over its DTD it does not seem to define general entities, so in this case you could just parse with :entity-resolver (lambda (p s) (make-concatenated-stream)) which pretends that the external subset and all external entities are empty.
(Another answer for the FAQ list cxml doesn't have.)
David
On 09-09-06, David Lichteblau wrote:
Quoting Sean Champ (gimmal@gmail.com):
Granted, I do have an XBEL DTD locally, but
- not everyone does
[...]
- I thought that I could propose that the DTD decl might be skipped (without my having to modify /etc/xml/catalog first)
I am not familiar with XBEL, but reading over its DTD it does not seem to define general entities, so in this case you could just parse with :entity-resolver (lambda (p s) (make-concatenated-stream)) which pretends that the external subset and all external entities are empty.
Ok. Thank you.
Should I regard the proposals I'd mentioned as their being all wholly on me, then?
-- Sean Champ