DTD (Document Type Definition) was the original validation system for HTML & XML, it allows you to specify where you can put significant text, and the kinds of elements and attributes that can be used together. This was fine for HTML, because you aren't doing anything more than showing text to a user, and it isn't all that difficult to implement.
However, XML (unlike HTML) was intended for inter-application data transfer. Developers using XML this way soon had problems where you want to say that a particular bit of text in the document is a number, or date, etc.
There were also issues with making XML documents that wanted to contain elements from a variety of schema sources. If you had 2 DTDs that you wanted to use together then you would have unpleasant issues if they contained an element in common.
XML schema was one of the validation alternatives that was created in order to bridge this gap, and it became popular enough to be integrated in a number of the XML libraries. XML-schema is a superset of DTD, in effect, but it uses a very different syntax for its files, so a DTD is not a valid XML-schema, and an XML-schema is not a valid DTD.
All of the major java XML bindings will validate with XML-schema, so they will allow you to ignore irrelevant whitespace, if you decide to go that way.
Looking at http://sbml.org/Software/libSBML they claim to offer lisp (among other languages) bindings to their library, which probably makes more sense than using CXML (or any other generic XML library) in this case.
Finally, the question that made me join the list!
There are two technical approaches to normalize whitespace with cxml's APIs:
- Do it on the fly, either in a SAX handler or a KLACKS source
- Do it after the fact in the object model or application
I'm still new with lisp, but I can't see how to make a KLACKS source that will remove white-space. The docs mention bridging KLACKS and SAX, but the functions that it mentions only appear to allow you to send KLACKS events to SAX handlers, and not the other way around?
My problem is a large XML database (owned by another application) that I want to pull data out of (and not just once). Building the DOM tree is kinda slow, and while CXML does better than Firefox (address space exhaustion before it manages to render the file), it's less than ideal, particularly when there are large numbers of white-space nodes that I immediately want to skip past (which is complicating my KLACKS-based recursive descent parser).
On Dec 1, 2009, at 16:48 , David Lichteblau wrote:
On Dec 1, 2009, at 16:48 , David Lichteblau wrote:
Quoting Marco Antoniotti (marcoxa at cs.nyu.edu):
Ok, that is a lot of work on my part AFAIAC. I think I understand the mechanics of what you are saying, but you are not answering my question.
I gave you the first two lines of theSBML document. SBML comes with a XSchema definition. I am assuming that having the xsd will be equivalent to having the DTD (I think I am right on this) and therefore have the correct indication about what is what and how it should be parsed.
In general, you cannot translate XML Schema into a DTD.
Ok. It just strenghten my hunches about XML being quite messy.
<?xml version="1.0" encoding="UTF-8"?>
<sbml xmlns="http://www.sbml.org/sbml/level1" level="1" version="1"> ... </sbml>
is what I have.
Can XML be coerced into accessing the xmlns="http://www.sbml.org/sbml/level1" (with DRAKMA), understanding it and using it or not? (Thus - hopefully - stripping the TEXT elements automatically?)
If yes, how?
No, cxml does not do that. I has no support for XML Schema at all.
Ok. But what about xerces or the Java libraries; would they do it?
Cheers
-- Marco