On Feb 6, 2009, at 10:21 , David Lichteblau wrote:
Quoting Marco Antoniotti (marcoxa@cs.nyu.edu):
I get all my actual RUNE-DOM::ELEMENTs interleaved with "bogus" TEXT elements containing just #\Newline and #\Tab (or more #\Tab). This is obviously an artifact of parsing. (See attached figure from a 15 minutes CXML browser I whipped up)
What I do not know it's (1) whether this makes sense or not, or (2) whether it is dependent on my platform (LWM).
Certainly -- XML preserves whitespace in character data, except for CRLF to LF normalization.
There are no universally correct rules for whitespace normalization in XML, and in general, any change to whitespace could change the meaning of the document.
One rule that is relatively common is to consider whitespace insignificant in "element content", e.g. in places where no non-whitespace text nodes are allowed by the DTD.
This rule is implemented by CXML:MAKE-WHITESPACE-NORMALIZER (see http://common-lisp.net/project/cxml/sax.html#misc), which may be helpful in your case.
However, note that the limitation to element content means that you actually need to write or find a DTD that matches your document. Without a DTD, this approach doesn't work.
Ok. I think I understand this. I'll try the CXML:MAKE-WHITESPACE- NORMALIZER (I need to understand how to use it first).
However, let me ask you this too. The SBML XML files start like this:
<?xml version="1.0" encoding="UTF-8"?> <!-- Created by Gepasi 3.30 on March 17, 2003, 12:57 --> <sbml xmlns="http://www.sbml.org/sbml/level1" level="1" version="1">
Am I correct in assuming that CXML would be able to forgo the DTD if it could access the www.sbml.org site and find a DTD or a XSD there?
Pardon the naïveté of my questions, but I really do not know enough about XML.
Cheers -- Marco
Other approaches are to use HTML rules for whitespace normalization (which are a more tricky to get right though, and cxml does not provide a ready-to-use function for this) or to discard all whitespace. It really depends on the schema and application.
(Note that we would like to have some support for this in cxml, because whitespace rules also matter for indentation, and at some point we would like to have more flexible/correct/useful indentation modes in our serializer. Whitespace stripping could be considered as a form of indentation, in the sense that it is a "removal of all indentation". But so far, I haven't found the time to implement anything in this direction.)
d.
-- Marco Antoniotti