[cxml-devel] Parsing yields TEXT elements with #\Newline #\Tab

newer
[cxml-devel] Problems parsing HTML...

older
[cxml-devel] Error 'The value NIL...

Marco Antoniotti

5 Feb 2009 5 Feb '09

9:50 p.m.

Hi I have been (successfully) parsed some SBML files with CXML on LWM. However I noticed the following effect. I get all my actual RUNE-DOM::ELEMENTs interleaved with "bogus" TEXT elements containing just #\Newline and #\Tab (or more #\Tab). This is obviously an artifact of parsing. (See attached figure from a 15 minutes CXML browser I whipped up) What I do not know it's (1) whether this makes sense or not, or (2) whether it is dependent on my platform (LWM). Does this make any sense to you? Note that I was observing the sam effect with CL-XML. Cheers -- Marco Antoniotti

Attachments:

cxml-browser.tiff (image/tiff — 125.4 KB)

Show replies by date

David Lichteblau

6 Feb 6 Feb

9:21 a.m.

Quoting Marco Antoniotti (marcoxa@cs.nyu.edu):

...

I get all my actual RUNE-DOM::ELEMENTs interleaved with "bogus" TEXT elements containing just #\Newline and #\Tab (or more #\Tab). This is obviously an artifact of parsing. (See attached figure from a 15 minutes CXML browser I whipped up)

What I do not know it's (1) whether this makes sense or not, or (2) whether it is dependent on my platform (LWM).

Certainly -- XML preserves whitespace in character data, except for CRLF to LF normalization. There are no universally correct rules for whitespace normalization in XML, and in general, any change to whitespace could change the meaning of the document. One rule that is relatively common is to consider whitespace insignificant in "element content", e.g. in places where no non-whitespace text nodes are allowed by the DTD. This rule is implemented by CXML:MAKE-WHITESPACE-NORMALIZER (see http://common-lisp.net/project/cxml/sax.html#misc), which may be helpful in your case. However, note that the limitation to element content means that you actually need to write or find a DTD that matches your document. Without a DTD, this approach doesn't work. Other approaches are to use HTML rules for whitespace normalization (which are a more tricky to get right though, and cxml does not provide a ready-to-use function for this) or to discard all whitespace. It really depends on the schema and application. (Note that we would like to have some support for this in cxml, because whitespace rules also matter for indentation, and at some point we would like to have more flexible/correct/useful indentation modes in our serializer. Whitespace stripping could be considered as a form of indentation, in the sense that it is a "removal of all indentation". But so far, I haven't found the time to implement anything in this direction.) d.

Marco Antoniotti

3:34 p.m.

On Feb 6, 2009, at 10:21 , David Lichteblau wrote:

...

Quoting Marco Antoniotti (marcoxa@cs.nyu.edu):

...
I get all my actual RUNE-DOM::ELEMENTs interleaved with "bogus" TEXT elements containing just #\Newline and #\Tab (or more #\Tab). This is obviously an artifact of parsing. (See attached figure from a 15 minutes CXML browser I whipped up)

What I do not know it's (1) whether this makes sense or not, or (2) whether it is dependent on my platform (LWM).

Certainly -- XML preserves whitespace in character data, except for CRLF to LF normalization.

There are no universally correct rules for whitespace normalization in XML, and in general, any change to whitespace could change the meaning of the document.

One rule that is relatively common is to consider whitespace insignificant in "element content", e.g. in places where no non-whitespace text nodes are allowed by the DTD.

This rule is implemented by CXML:MAKE-WHITESPACE-NORMALIZER (see http://common-lisp.net/project/cxml/sax.html#misc), which may be helpful in your case.

However, note that the limitation to element content means that you actually need to write or find a DTD that matches your document. Without a DTD, this approach doesn't work.

Ok. I think I understand this. I'll try the CXML:MAKE-WHITESPACE- NORMALIZER (I need to understand how to use it first). However, let me ask you this too. The SBML XML files start like this: <?xml version="1.0" encoding="UTF-8"?>  <sbml xmlns="http://www.sbml.org/sbml/level1" level="1" version="1"> Am I correct in assuming that CXML would be able to forgo the DTD if it could access the www.sbml.org site and find a DTD or a XSD there? Pardon the naïveté of my questions, but I really do not know enough about XML. Cheers -- Marco

...

Other approaches are to use HTML rules for whitespace normalization (which are a more tricky to get right though, and cxml does not provide a ready-to-use function for this) or to discard all whitespace. It really depends on the schema and application.

(Note that we would like to have some support for this in cxml, because whitespace rules also matter for indentation, and at some point we would like to have more flexible/correct/useful indentation modes in our serializer. Whitespace stripping could be considered as a form of indentation, in the sense that it is a "removal of all indentation". But so far, I haven't found the time to implement anything in this direction.)

d.

-- Marco Antoniotti

Marco Antoniotti

25 Nov 25 Nov

1:09 p.m.

Hi months ago I posted this problem and I still do not have a solution for it. Note that I do not have a DTD for the file I want to parse (and I don't want and cannot write it). The SBML XML files I want to read start like this: <?xml version="1.0" encoding="UTF-8"?>  <sbml xmlns="http://www.sbml.org/sbml/level1" level="1" version="1"> Am I correct in assuming that CXML would be able to forgo the DTD if it could access the www.sbml.org site and find a DTD or a XSD there? Note that my problem is to get rid of the spurious TEXT elements. The http://www.sbml.org/sbml/level1 points to a xsd file. Cheers -- Marco On Feb 6, 2009, at 16:34 , Marco Antoniotti wrote:

...

On Feb 6, 2009, at 10:21 , David Lichteblau wrote:

...
Quoting Marco Antoniotti (marcoxa@cs.nyu.edu):

...
I get all my actual RUNE-DOM::ELEMENTs interleaved with "bogus" TEXT elements containing just #\Newline and #\Tab (or more #\Tab). This is obviously an artifact of parsing. (See attached figure from a 15 minutes CXML browser I whipped up)

What I do not know it's (1) whether this makes sense or not, or (2) whether it is dependent on my platform (LWM).

Certainly -- XML preserves whitespace in character data, except for CRLF to LF normalization.

There are no universally correct rules for whitespace normalization in XML, and in general, any change to whitespace could change the meaning of the document.

One rule that is relatively common is to consider whitespace insignificant in "element content", e.g. in places where no non-whitespace text nodes are allowed by the DTD.

This rule is implemented by CXML:MAKE-WHITESPACE-NORMALIZER (see http://common-lisp.net/project/cxml/sax.html#misc), which may be helpful in your case.

However, note that the limitation to element content means that you actually need to write or find a DTD that matches your document. Without a DTD, this approach doesn't work.

Ok. I think I understand this. I'll try the CXML:MAKE-WHITESPACE- NORMALIZER (I need to understand how to use it first).

However, let me ask you this too. The SBML XML files start like this:

<?xml version="1.0" encoding="UTF-8"?>  <sbml xmlns="http://www.sbml.org/sbml/level1" level="1" version="1">

Am I correct in assuming that CXML would be able to forgo the DTD if it could access the www.sbml.org site and find a DTD or a XSD there?

Pardon the naïveté of my questions, but I really do not know enough about XML.

Cheers -- Marco

...
Other approaches are to use HTML rules for whitespace normalization (which are a more tricky to get right though, and cxml does not provide a ready-to-use function for this) or to discard all whitespace. It really depends on the schema and application.

(Note that we would like to have some support for this in cxml, because whitespace rules also matter for indentation, and at some point we would like to have more flexible/correct/useful indentation modes in our serializer. Whitespace stripping could be considered as a form of indentation, in the sense that it is a "removal of all indentation". But so far, I haven't found the time to implement anything in this direction.)

d.

-- Marco Antoniotti

-- Marco Antoniotti

Marco Antoniotti

30 Nov 30 Nov

11:53 p.m.

Well... Any ideas on this or the list is dead? Cheers Marco On Nov 25, 2009, at 14:09 , Marco Antoniotti wrote:

...

Hi

months ago I posted this problem and I still do not have a solution for it.

Note that I do not have a DTD for the file I want to parse (and I don't want and cannot write it).

The SBML XML files I want to read start like this:

<?xml version="1.0" encoding="UTF-8"?>  <sbml xmlns="http://www.sbml.org/sbml/level1" level="1" version="1">

Am I correct in assuming that CXML would be able to forgo the DTD if it could access the www.sbml.org site and find a DTD or a XSD there?

Note that my problem is to get rid of the spurious TEXT elements.

The http://www.sbml.org/sbml/level1 points to a xsd file.

Cheers -- Marco

On Feb 6, 2009, at 16:34 , Marco Antoniotti wrote:

...
On Feb 6, 2009, at 10:21 , David Lichteblau wrote:

...
Quoting Marco Antoniotti (marcoxa@cs.nyu.edu):

...
I get all my actual RUNE-DOM::ELEMENTs interleaved with "bogus" TEXT elements containing just #\Newline and #\Tab (or more #\Tab). This is obviously an artifact of parsing. (See attached figure from a 15 minutes CXML browser I whipped up)

What I do not know it's (1) whether this makes sense or not, or (2) whether it is dependent on my platform (LWM).

Certainly -- XML preserves whitespace in character data, except for CRLF to LF normalization.

There are no universally correct rules for whitespace normalization in XML, and in general, any change to whitespace could change the meaning of the document.

One rule that is relatively common is to consider whitespace insignificant in "element content", e.g. in places where no non-whitespace text nodes are allowed by the DTD.

This rule is implemented by CXML:MAKE-WHITESPACE-NORMALIZER (see http://common-lisp.net/project/cxml/sax.html#misc), which may be helpful in your case.

However, note that the limitation to element content means that you actually need to write or find a DTD that matches your document. Without a DTD, this approach doesn't work.

Ok. I think I understand this. I'll try the CXML:MAKE-WHITESPACE- NORMALIZER (I need to understand how to use it first).

However, let me ask you this too. The SBML XML files start like this:

<?xml version="1.0" encoding="UTF-8"?>  <sbml xmlns="http://www.sbml.org/sbml/level1" level="1" version="1">

Am I correct in assuming that CXML would be able to forgo the DTD if it could access the www.sbml.org site and find a DTD or a XSD there?

Pardon the naïveté of my questions, but I really do not know enough about XML.

Cheers -- Marco

...
Other approaches are to use HTML rules for whitespace normalization (which are a more tricky to get right though, and cxml does not provide a ready-to-use function for this) or to discard all whitespace. It really depends on the schema and application.

(Note that we would like to have some support for this in cxml, because whitespace rules also matter for indentation, and at some point we would like to have more flexible/correct/useful indentation modes in our serializer. Whitespace stripping could be considered as a form of indentation, in the sense that it is a "removal of all indentation". But so far, I haven't found the time to implement anything in this direction.)

d.

-- Marco Antoniotti

-- Marco Antoniotti

-- Marco Antoniotti

Cyrus Harmon

1 Dec 1 Dec

5:39 a.m.

Marco, Maybe the problem isn't that the liveness of the list (although, yes, it isn't particularly active), but rather the phrasing of your question: "Am I correct in assuming that CXML would be able to forgo the DTD if it could access the www.sbml.org site and find a DTD or a XSD there?" What are you suggesting CXML should do by ignoring (if that's what you mean by forgo) a DTD that it may or not find at some site? Do you have a URL to a SBML file with which the folks at home can try to convince CXML to forgo the DTD (whatever that means)? thanks, Cyrus On Nov 30, 2009, at 3:53 PM, Marco Antoniotti wrote:

...

Well... Any ideas on this or the list is dead?

Cheers

Marco

On Nov 25, 2009, at 14:09 , Marco Antoniotti wrote:

...
Hi

months ago I posted this problem and I still do not have a solution for it.

Note that I do not have a DTD for the file I want to parse (and I don't want and cannot write it).

The SBML XML files I want to read start like this:

<?xml version="1.0" encoding="UTF-8"?>  <sbml xmlns="http://www.sbml.org/sbml/level1" level="1" version="1">

Am I correct in assuming that CXML would be able to forgo the DTD if it could access the www.sbml.org site and find a DTD or a XSD there?

Note that my problem is to get rid of the spurious TEXT elements.

The http://www.sbml.org/sbml/level1 points to a xsd file.

Cheers -- Marco

On Feb 6, 2009, at 16:34 , Marco Antoniotti wrote:

...
On Feb 6, 2009, at 10:21 , David Lichteblau wrote:

...
Quoting Marco Antoniotti (marcoxa@cs.nyu.edu):

...
I get all my actual RUNE-DOM::ELEMENTs interleaved with "bogus" TEXT elements containing just #\Newline and #\Tab (or more #\Tab). This is obviously an artifact of parsing. (See attached figure from a 15 minutes CXML browser I whipped up)

What I do not know it's (1) whether this makes sense or not, or (2) whether it is dependent on my platform (LWM).

Certainly -- XML preserves whitespace in character data, except for CRLF to LF normalization.

There are no universally correct rules for whitespace normalization in XML, and in general, any change to whitespace could change the meaning of the document.

One rule that is relatively common is to consider whitespace insignificant in "element content", e.g. in places where no non-whitespace text nodes are allowed by the DTD.

This rule is implemented by CXML:MAKE-WHITESPACE-NORMALIZER (see http://common-lisp.net/project/cxml/sax.html#misc), which may be helpful in your case.

However, note that the limitation to element content means that you actually need to write or find a DTD that matches your document. Without a DTD, this approach doesn't work.

Ok. I think I understand this. I'll try the CXML:MAKE-WHITESPACE-NORMALIZER (I need to understand how to use it first).

However, let me ask you this too. The SBML XML files start like this:

<?xml version="1.0" encoding="UTF-8"?>  <sbml xmlns="http://www.sbml.org/sbml/level1" level="1" version="1">

Am I correct in assuming that CXML would be able to forgo the DTD if it could access the www.sbml.org site and find a DTD or a XSD there?

Pardon the naïveté of my questions, but I really do not know enough about XML.

Cheers -- Marco

...
Other approaches are to use HTML rules for whitespace normalization (which are a more tricky to get right though, and cxml does not provide a ready-to-use function for this) or to discard all whitespace. It really depends on the schema and application.

(Note that we would like to have some support for this in cxml, because whitespace rules also matter for indentation, and at some point we would like to have more flexible/correct/useful indentation modes in our serializer. Whitespace stripping could be considered as a form of indentation, in the sense that it is a "removal of all indentation". But so far, I haven't found the time to implement anything in this direction.)

d.

-- Marco Antoniotti

-- Marco Antoniotti

-- Marco Antoniotti

Marco Antoniotti

2:38 p.m.

Dear Cyrus On Dec 1, 2009, at 06:39 , Cyrus Harmon wrote:

...

Marco,

Maybe the problem isn't that the liveness of the list (although, yes, it isn't particularly active), but rather the phrasing of your question:

"Am I correct in assuming that CXML would be able to forgo the DTD if it could access the www.sbml.org site and find a DTD or a XSD there?"

What are you suggesting CXML should do by ignoring (if that's what you mean by forgo) a DTD that it may or not find at some site?

Do you have a URL to a SBML file with which the folks at home can try to convince CXML to forgo the DTD (whatever that means)?

thanks,

you are right. I am not very well versed in XML stuff, so my question is not really well phrased. Let's recapitulate: AFAIU, there is a way to tell CXML to use a particular DTD in order to make the CXML:WHITESPACE-NORMALIZER work as expected. I.e., drop the extra TEXT elements corresponding to newlines and indentation. There is also a way to tell CXML to use a "resolver" downloaded from the net using DRAKMA (or another HTTP client library). Now, in the case of SBML I do not have a DTD. I have a file that starts as <?xml version="1.0" encoding="UTF-8"?> <sbml xmlns="http://www.sbml.org/sbml/level1" level="1" version="1"> AFAIU, using the xmlns="http://www.sbml.org/sbml/level1" should make it possible for CXML to get such document (which eventually is a .xsd) maybe using DRAKMA, and therefore make it parse the rest of the file automagically dropping the TEXT elements. In this sense this would make CXML "ignore" the DTD. So my questions are 1 - Is this a correct assumption? 2 - since this is not CXML default behavior, is there a way to get CXML to do the "obvious" thing? I know that I could possibly remove the TEXT elements by hand, after having built the internal structure; but it does not feel right. If reading a XSchema file results in a DTD then CXML would not really "ignore it". I am just wondere if it is possible and where exaclty to look in the code base. Thanks -- Marco

...

Cyrus

On Nov 30, 2009, at 3:53 PM, Marco Antoniotti wrote:

...
Well... Any ideas on this or the list is dead?

Cheers

Marco

On Nov 25, 2009, at 14:09 , Marco Antoniotti wrote:

...
Hi

months ago I posted this problem and I still do not have a solution for it.

Note that I do not have a DTD for the file I want to parse (and I don't want and cannot write it).

The SBML XML files I want to read start like this:

<?xml version="1.0" encoding="UTF-8"?>  <sbml xmlns="http://www.sbml.org/sbml/level1" level="1" version="1">

Am I correct in assuming that CXML would be able to forgo the DTD if it could access the www.sbml.org site and find a DTD or a XSD there?

Note that my problem is to get rid of the spurious TEXT elements.

The http://www.sbml.org/sbml/level1 points to a xsd file.

Cheers -- Marco

On Feb 6, 2009, at 16:34 , Marco Antoniotti wrote:

...
On Feb 6, 2009, at 10:21 , David Lichteblau wrote:

...
Quoting Marco Antoniotti (marcoxa@cs.nyu.edu):

...
I get all my actual RUNE-DOM::ELEMENTs interleaved with "bogus" TEXT elements containing just #\Newline and #\Tab (or more #\Tab). This is obviously an artifact of parsing. (See attached figure from a 15 minutes CXML browser I whipped up)

What I do not know it's (1) whether this makes sense or not, or (2) whether it is dependent on my platform (LWM).

Certainly -- XML preserves whitespace in character data, except for CRLF to LF normalization.

There are no universally correct rules for whitespace normalization in XML, and in general, any change to whitespace could change the meaning of the document.

One rule that is relatively common is to consider whitespace insignificant in "element content", e.g. in places where no non-whitespace text nodes are allowed by the DTD.

This rule is implemented by CXML:MAKE-WHITESPACE-NORMALIZER (see http://common-lisp.net/project/cxml/sax.html#misc), which may be helpful in your case.

However, note that the limitation to element content means that you actually need to write or find a DTD that matches your document. Without a DTD, this approach doesn't work.

Ok. I think I understand this. I'll try the CXML:MAKE- WHITESPACE-NORMALIZER (I need to understand how to use it first).

However, let me ask you this too. The SBML XML files start like this:

<?xml version="1.0" encoding="UTF-8"?>  <sbml xmlns="http://www.sbml.org/sbml/level1" level="1" version="1">

Am I correct in assuming that CXML would be able to forgo the DTD if it could access the www.sbml.org site and find a DTD or a XSD there?

Pardon the naïveté of my questions, but I really do not know enough about XML.

Cheers -- Marco

...
Other approaches are to use HTML rules for whitespace normalization (which are a more tricky to get right though, and cxml does not provide a ready-to-use function for this) or to discard all whitespace. It really depends on the schema and application.

(Note that we would like to have some support for this in cxml, because whitespace rules also matter for indentation, and at some point we would like to have more flexible/correct/useful indentation modes in our serializer. Whitespace stripping could be considered as a form of indentation, in the sense that it is a "removal of all indentation". But so far, I haven't found the time to implement anything in this direction.)

d.

-- Marco Antoniotti

-- Marco Antoniotti

-- Marco Antoniotti

-- Marco Antoniotti

David Lichteblau

2:55 p.m.

Quoting Marco Antoniotti (marcoxa@cs.nyu.edu):

...

2 - since this is not CXML default behavior, is there a way to get CXML to do the "obvious" thing?

There is no single obvious thing. You need to define which kind of whitespace stripping you want. a. Strip all text nodes, including those that have non-whitespace in them? b. Strip all text nodes that are made up of whitespace exclusively? c. Take text nodes that have non-whitespace and whitespace, and remove the whitespace from them while keeping the non-whitespace? d. Same as c, but "compress" such whitespace rather than removing it entirely? e. Choose between c and d depending on what the parent element is? f. Do b only depending on what the parent element is? Case study: - XSLT basically does b, with a couple of customization features. - HTML does e - the DTD-based thing is f

...

I know that I could possibly remove the TEXT elements by hand, after having built the internal structure; but it does not feel right.

There are two technical approaches to normalize whitespace with cxml's APIs: - Do it on the fly, either in a SAX handler or a KLACKS source - Do it after the fact in the object model or application The DTD-based thing is implemented as a SAX handler (first approach), see cxml/xml/space-normalizer.lisp XSLT-style normalization is available in Xuriella XSLT, implemented using STP; see the function STRIP-STYLESHEET in xuriella/space.lisp. Note that both implementation types I listed above are done entirely in user code. You don't need to change cxml to implement yet another variety of whitespace stripping. Just copy&paste the code and change it to suit your needs -- or rewrite it. STRIP-STYLESHEET is a total of 23 lines of code long, I think. d.

Marco Antoniotti

3:36 p.m.

On Dec 1, 2009, at 15:55 , David Lichteblau wrote:

...

Quoting Marco Antoniotti (marcoxa@cs.nyu.edu):

...
2 - since this is not CXML default behavior, is there a way to get CXML to do the "obvious" thing?

There is no single obvious thing. You need to define which kind of whitespace stripping you want.

AFAIU the DTD specifies how to deal with whitespaces. The examples in the documentation seem to say that.

...

a. Strip all text nodes, including those that have non-whitespace in them?

b. Strip all text nodes that are made up of whitespace exclusively?

c. Take text nodes that have non-whitespace and whitespace, and remove the whitespace from them while keeping the non-whitespace?

d. Same as c, but "compress" such whitespace rather than removing it entirely?

e. Choose between c and d depending on what the parent element is?

f. Do b only depending on what the parent element is?

Case study:

- XSLT basically does b, with a couple of customization features.

- HTML does e

- the DTD-based thing is f

...
I know that I could possibly remove the TEXT elements by hand, after having built the internal structure; but it does not feel right.

There are two technical approaches to normalize whitespace with cxml's APIs: - Do it on the fly, either in a SAX handler or a KLACKS source - Do it after the fact in the object model or application

The DTD-based thing is implemented as a SAX handler (first approach), see cxml/xml/space-normalizer.lisp

XSLT-style normalization is available in Xuriella XSLT, implemented using STP; see the function STRIP-STYLESHEET in xuriella/space.lisp.

Note that both implementation types I listed above are done entirely in user code. You don't need to change cxml to implement yet another variety of whitespace stripping.

Just copy&paste the code and change it to suit your needs -- or rewrite it. STRIP-STYLESHEET is a total of 23 lines of code long, I think.

Ok, that is a lot of work on my part AFAIAC. I think I understand the mechanics of what you are saying, but you are not answering my question. I gave you the first two lines of theSBML document. SBML comes with a XSchema definition. I am assuming that having the xsd will be equivalent to having the DTD (I think I am right on this) and therefore have the correct indication about what is what and how it should be parsed. <?xml version="1.0" encoding="UTF-8"?> <sbml xmlns="http://www.sbml.org/sbml/level1" level="1" version="1"> ... </sbml> is what I have. Can XML be coerced into accessing the xmlns="http://www.sbml.org/sbml/level1 " (with DRAKMA), understanding it and using it or not? (Thus - hopefully - stripping the TEXT elements automatically?) If yes, how? IMHO, it would be quite a plus to be able to deal with a case like this automatically (i.e., SBML) without much user intervention, especially as a post-processing step. Cheers -- Marco Antoniotti

Marco Antoniotti

3:38 p.m.

On Dec 1, 2009, at 16:36 , Marco Antoniotti wrote: ...

...

...
Just copy&paste the code and change it to suit your needs -- or rewrite it. STRIP-STYLESHEET is a total of 23 lines of code long, I think.

Ok, that is a lot of work on my part AFAIAC.

Sorry.... forgot a smiley :) Cheers -- Marco Antoniotti

David Lichteblau

3:48 p.m.

Quoting Marco Antoniotti (marcoxa@cs.nyu.edu):

...

Ok, that is a lot of work on my part AFAIAC. I think I understand the mechanics of what you are saying, but you are not answering my question.

I gave you the first two lines of theSBML document. SBML comes with a XSchema definition. I am assuming that having the xsd will be equivalent to having the DTD (I think I am right on this) and therefore have the correct indication about what is what and how it should be parsed.

In general, you cannot translate XML Schema into a DTD.

...

<?xml version="1.0" encoding="UTF-8"?> <sbml xmlns="http://www.sbml.org/sbml/level1" level="1" version="1"> ... </sbml>

is what I have.

Can XML be coerced into accessing the xmlns="http://www.sbml.org/sbml/level1" (with DRAKMA), understanding it and using it or not? (Thus - hopefully - stripping the TEXT elements automatically?)

If yes, how?

No, cxml does not do that. I has no support for XML Schema at all. d.

Marco Antoniotti

9:47 p.m.

On Dec 1, 2009, at 16:48 , David Lichteblau wrote:

...

Quoting Marco Antoniotti (marcoxa@cs.nyu.edu):

...
Ok, that is a lot of work on my part AFAIAC. I think I understand the mechanics of what you are saying, but you are not answering my question.

I gave you the first two lines of theSBML document. SBML comes with a XSchema definition. I am assuming that having the xsd will be equivalent to having the DTD (I think I am right on this) and therefore have the correct indication about what is what and how it should be parsed.

In general, you cannot translate XML Schema into a DTD.

Ok. It just strenghten my hunches about XML being quite messy.

...

...
<?xml version="1.0" encoding="UTF-8"?> <sbml xmlns="http://www.sbml.org/sbml/level1" level="1" version="1"> ... </sbml>

is what I have.

Can XML be coerced into accessing the xmlns="http://www.sbml.org/sbml/level1" (with DRAKMA), understanding it and using it or not? (Thus - hopefully - stripping the TEXT elements automatically?)

If yes, how?

No, cxml does not do that. I has no support for XML Schema at all.

Ok. But what about xerces or the Java libraries; would they do it? Cheers -- Marco

Peter Stirling

5 Dec 5 Dec

4:30 p.m.

New subject: [cxml-devel] Parsing yields TEXT elements with #\Newline #\Tab

DTD (Document Type Definition) was the original validation system for HTML & XML, it allows you to specify where you can put significant text, and the kinds of elements and attributes that can be used together. This was fine for HTML, because you aren't doing anything more than showing text to a user, and it isn't all that difficult to implement. However, XML (unlike HTML) was intended for inter-application data transfer. Developers using XML this way soon had problems where you want to say that a particular bit of text in the document is a number, or date, etc. There were also issues with making XML documents that wanted to contain elements from a variety of schema sources. If you had 2 DTDs that you wanted to use together then you would have unpleasant issues if they contained an element in common. XML schema was one of the validation alternatives that was created in order to bridge this gap, and it became popular enough to be integrated in a number of the XML libraries. XML-schema is a superset of DTD, in effect, but it uses a very different syntax for its files, so a DTD is not a valid XML-schema, and an XML-schema is not a valid DTD. All of the major java XML bindings will validate with XML-schema, so they will allow you to ignore irrelevant whitespace, if you decide to go that way. Looking at http://sbml.org/Software/libSBML they claim to offer lisp (among other languages) bindings to their library, which probably makes more sense than using CXML (or any other generic XML library) in this case. Finally, the question that made me join the list!

...

There are two technical approaches to normalize whitespace with cxml's APIs: - Do it on the fly, either in a SAX handler or a KLACKS source - Do it after the fact in the object model or application

I'm still new with lisp, but I can't see how to make a KLACKS source that will remove white-space. The docs mention bridging KLACKS and SAX, but the functions that it mentions only appear to allow you to send KLACKS events to SAX handlers, and not the other way around? My problem is a large XML database (owned by another application) that I want to pull data out of (and not just once). Building the DOM tree is kinda slow, and while CXML does better than Firefox (address space exhaustion before it manages to render the file), it's less than ideal, particularly when there are large numbers of white-space nodes that I immediately want to skip past (which is complicating my KLACKS-based recursive descent parser). On Dec 1, 2009, at 16:48 , David Lichteblau wrote:

...

On Dec 1, 2009, at 16:48 , David Lichteblau wrote:

...
Quoting Marco Antoniotti (marcoxa at cs.nyu.edu):

...
Ok, that is a lot of work on my part AFAIAC. I think I understand the mechanics of what you are saying, but you are not answering my question.

I gave you the first two lines of theSBML document. SBML comes with a XSchema definition. I am assuming that having the xsd will be equivalent to having the DTD (I think I am right on this) and therefore have the correct indication about what is what and how it should be parsed.

In general, you cannot translate XML Schema into a DTD.

Ok. It just strenghten my hunches about XML being quite messy.

...
...
<?xml version="1.0" encoding="UTF-8"?> <sbml xmlns="http://www.sbml.org/sbml/level1" level="1" version="1"> ... </sbml>

is what I have.

Can XML be coerced into accessing the xmlns="http://www.sbml.org/sbml/level1" (with DRAKMA), understanding it and using it or not? (Thus - hopefully - stripping the TEXT elements automatically?)

If yes, how?

No, cxml does not do that. I has no support for XML Schema at all.

Ok. But what about xerces or the Java libraries; would they do it?

Cheers

-- Marco

Marco Antoniotti

7 Dec 7 Dec

9:21 a.m.

On Dec 5, 2009, at 17:30 , Peter Stirling wrote:

...

DTD (Document Type Definition) was the original validation system for HTML & XML, it allows you to specify where you can put significant text, and the kinds of elements and attributes that can be used together. This was fine for HTML, because you aren't doing anything more than showing text to a user, and it isn't all that difficult to implement.

However, XML (unlike HTML) was intended for inter-application data transfer. Developers using XML this way soon had problems where you want to say that a particular bit of text in the document is a number, or date, etc.

There were also issues with making XML documents that wanted to contain elements from a variety of schema sources. If you had 2 DTDs that you wanted to use together then you would have unpleasant issues if they contained an element in common.

XML schema was one of the validation alternatives that was created in order to bridge this gap, and it became popular enough to be integrated in a number of the XML libraries. XML-schema is a superset of DTD, in effect, but it uses a very different syntax for its files, so a DTD is not a valid XML-schema, and an XML-schema is not a valid DTD.

All of the major java XML bindings will validate with XML-schema, so they will allow you to ignore irrelevant whitespace, if you decide to go that way.

Which begs the question of having XML-schema incorporated into CXML. I have gotten the feeling that this is not easy to set up, although it's seem to be TRT. It is just sad that it is that way.

...

Looking at http://sbml.org/Software/libSBML they claim to offer lisp (among other languages) bindings to their library, which probably makes more sense than using CXML (or any other generic XML library) in this case.

I did the first CL parsing of SBML using CL-XML (CL-SBML on common- lisp.net). And I had to, at that point, do some post-processing massaging. Next I tried to make a CXML port, and I stumbled upon the same problem that did make me ask these questions. There is another SBML CL binding which uses libSBML via FFI. Maybe that is the way to go as libSBML does some extra validity checking. Being where we are (a CL related list) I'd say that that sucks a bit. :) Cheers -- Marco

...

Finally, the question that made me join the list!

...
There are two technical approaches to normalize whitespace with cxml's APIs: - Do it on the fly, either in a SAX handler or a KLACKS source - Do it after the fact in the object model or application

I'm still new with lisp, but I can't see how to make a KLACKS source that will remove white-space. The docs mention bridging KLACKS and SAX, but the functions that it mentions only appear to allow you to send KLACKS events to SAX handlers, and not the other way around?

My problem is a large XML database (owned by another application) that I want to pull data out of (and not just once). Building the DOM tree is kinda slow, and while CXML does better than Firefox (address space exhaustion before it manages to render the file), it's less than ideal, particularly when there are large numbers of white-space nodes that I immediately want to skip past (which is complicating my KLACKS-based recursive descent parser).

On Dec 1, 2009, at 16:48 , David Lichteblau wrote:

...
On Dec 1, 2009, at 16:48 , David Lichteblau wrote:

...
Quoting Marco Antoniotti (marcoxa at cs.nyu.edu):

...
Ok, that is a lot of work on my part AFAIAC. I think I understand the mechanics of what you are saying, but you are not answering my question.

I gave you the first two lines of theSBML document. SBML comes with a XSchema definition. I am assuming that having the xsd will be equivalent to having the DTD (I think I am right on this) and therefore have the correct indication about what is what and how it should be parsed.

In general, you cannot translate XML Schema into a DTD.

Ok. It just strenghten my hunches about XML being quite messy.

...
...
<?xml version="1.0" encoding="UTF-8"?> <sbml xmlns="http://www.sbml.org/sbml/level1" level="1" version="1"> ... </sbml>

is what I have.

Can XML be coerced into accessing the xmlns="http://www.sbml.org/sbml/level1" (with DRAKMA), understanding it and using it or not? (Thus - hopefully - stripping the TEXT elements automatically?)

If yes, how?

No, cxml does not do that. I has no support for XML Schema at all.

Ok. But what about xerces or the Java libraries; would they do it?

Cheers

-- Marco

-- Marco Antoniotti

5698

Age (days ago)

6003

Last active (days ago)

List overview

Download

13 comments

4 participants

participants (4)

Cyrus Harmon
David Lichteblau
Marco Antoniotti
Peter Stirling