Re: [cxml-devel] Parsing yields TEXT elements with #\Newline #\Tab

1 Dec 2009

      On Dec 1, 2009, at 15:55 , David Lichteblau wrote:
...
Quoting Marco Antoniotti (marcoxa@cs.nyu.edu):
...
2 - since this is not CXML default behavior, is there a way to get
CXML to do the "obvious" thing?
There is no single obvious thing.  You need to define which kind of
whitespace stripping you want.
AFAIU the DTD specifies how to deal with whitespaces.  The examples in  
the documentation seem to say that.
...
a. Strip all text nodes, including those that have non-whitespace in
    them?
b. Strip all text nodes that are made up of whitespace exclusively?
c. Take text nodes that have non-whitespace and whitespace, and  
remove
    the whitespace from them while keeping the non-whitespace?
d. Same as c, but "compress" such whitespace rather than removing it
    entirely?
e. Choose between c and d depending on what the parent element is?
f. Do b only depending on what the parent element is?
Case study:
- XSLT basically does b, with a couple of customization features.
- HTML does e
- the DTD-based thing is f
...
I know that I could possibly remove the TEXT elements by hand, after
having built the internal structure; but it does not feel right.
There are two technical approaches to normalize whitespace with  
cxml's APIs:
 - Do it on the fly, either in a SAX handler or a KLACKS source
 - Do it after the fact in the object model or application
The DTD-based thing is implemented as a SAX handler (first approach),
see cxml/xml/space-normalizer.lisp
XSLT-style normalization is available in Xuriella XSLT, implemented
using STP; see the function STRIP-STYLESHEET in xuriella/space.lisp.
Note that both implementation types I listed above are done entirely  
in
user code.  You don't need to change cxml to implement yet another
variety of whitespace stripping.
Just copy&paste the code and change it to suit your needs -- or  
rewrite
it.  STRIP-STYLESHEET is a total of 23 lines of code long, I think.
Ok, that is a lot of work on my part AFAIAC.  I think I understand the  
mechanics of what you are saying, but you are not answering my question.

I gave you the first two lines of theSBML document.  SBML comes with a  
XSchema definition.  I am assuming that having the xsd will be  
equivalent to having the DTD (I think I am right on this) and  
therefore have the correct indication about what is what and how it  
should be parsed.

<?xml version="1.0" encoding="UTF-8"?>
<sbml xmlns="http://www.sbml.org/sbml/level1" level="1" version="1">
...
</sbml>

is what I have.

Can XML be coerced into accessing the xmlns="http://www.sbml.org/sbml/level1 
" (with DRAKMA), understanding it and using it or not?  (Thus -  
hopefully - stripping the TEXT elements automatically?)

If yes, how?

IMHO, it would be quite a plus to be able to deal with a case like  
this automatically (i.e., SBML) without much user intervention,  
especially as a post-processing step.

Cheers

--
Marco Antoniotti