Quoting Marco Antoniotti (
marcoxa@cs.nyu.edu):
I get all my actual RUNE-DOM::ELEMENTs interleaved with "bogus" TEXT
elements containing just #\Newline and #\Tab (or more #\Tab).
This is obviously an artifact of parsing. (See attached figure from a
15 minutes CXML browser I whipped up)
What I do not know it's (1) whether this makes sense or not, or (2)
whether it is dependent on my platform (LWM).
Certainly -- XML preserves whitespace in character data, except for CRLF
to LF normalization.
There are no universally correct rules for whitespace normalization in
XML, and in general, any change to whitespace could change the meaning
of the document.
One rule that is relatively common is to consider whitespace
insignificant in "element content", e.g. in places where no
non-whitespace text nodes are allowed by the DTD.
This rule is implemented by CXML:MAKE-WHITESPACE-NORMALIZER
(see
http://common-lisp.net/project/cxml/sax.html#misc), which may be
helpful in your case.
However, note that the limitation to element content means that you
actually need to write or find a DTD that matches your document.
Without a DTD, this approach doesn't work.