I have been using closure-html and cxml to extract data from web pages. Up until now it has worked quite well, but I have come across a web page that has quite extraordinarily malformed html markup.
I am looking for suggestions on how to continue parsing even when attribute names are malformed. Here is a self-contained example containing some of the offending markup:
(require :closure-html) (require :cxml-stp) (defvar *html-str*) (setf *html-str* "<html> <body> <table> <tr><td vertical-align:="" ;=""></td></tr> </table> </body> </html> ") (chtml:parse *html-str* (cxml-stp:make-builder))
This example give CXML:WELL-FORMEDNESS-VIOLATION. I guess I could write a custom SAX handler to skip attributes with these errors, but is there a more simple solution? Should the chtml parser be patched to handle this case? Are there any other ways to make parsing more robust? Or is this markup beyond hope of repair?
Regards,
Russell
Hi,
Quoting Russell Kliese (russell@kliese.id.au):
I have been using closure-html and cxml to extract data from web pages. Up until now it has worked quite well, but I have come across a web page that has quite extraordinarily malformed html markup.
I am looking for suggestions on how to continue parsing even when attribute names are malformed. Here is a self-contained example containing some of the offending markup:
[...]
This example give CXML:WELL-FORMEDNESS-VIOLATION. I guess I could write a custom SAX handler to skip attributes with these errors, but is there a more simple solution? Should the chtml parser be patched to handle this case? Are there any other ways to make parsing more robust? Or is this markup beyond hope of repair?
thanks for the report.
It should be the chtml parser's responsibility to clean up data sufficiently that it would be well-formed as XML, and use of an STP builder as in your code is a pretty good test case to ensure that this would be the case.
When Closure was written, I think a lot of effort was put into making it correct errors, but that logic is just not 100% complete. We'd need a set of good test cases.
Unfortunately I don't have a ready-to-use patch at this point; do you have a suggestion? There are probably little dependency issues, in that some utility functions needed to determine whether data is OK are currently part of the XML parser, which the HTML parser does not depend on in terms of ASDF. But I'm sure those issues would be solvable.
d.
Hi David,
Thanks for your reply.
It sounds that I would need to do some careful reading of the html and xml specs in order to come up with a patch to get parsing right.
It may be possible, however, to solve the problem without having to perfect the parser. I found that HTML Tidy can deal with the malformed attributes if you use the option that removes proprietary attributes. This way only a set of known good attributes are passed to the SAX builder. Would this be appropriate for Closure HTML?
If you care to look at Tidy (I was using HTML Tidy for Linux released on 25 March 2009), here is an example invocation:
echo '<html> <body> <table> <tr><td vertical-align:="" ;=""></td></tr> </table> </body> </html> ' | tidy -q -asxml --force-output yes --drop-proprietary-attributes yes
Regards,
Russell
On 28/10/11 23:13, David Lichteblau wrote:
Quoting Russell Kliese (russell@kliese.id.au):
I am looking for suggestions on how to continue parsing even when attribute names are malformed.
When Closure was written, I think a lot of effort was put into making it correct errors, but that logic is just not 100% complete. We'd need a set of good test cases.
Unfortunately I don't have a ready-to-use patch at this point; do you have a suggestion?
Hi,
Quoting Russell Kliese (russell@kliese.id.au):
It may be possible, however, to solve the problem without having to perfect the parser. I found that HTML Tidy can deal with the malformed attributes if you use the option that removes proprietary attributes. This way only a set of known good attributes are passed to the SAX builder. Would this be appropriate for Closure HTML?
I've been using HTML Tidy for several years to clean up HTML. It's a very extensive solution, and was helpful to us at the time. However, I had to modify its code to fix some of its problems. The basic issue seems to be that Tidy is too aggressive: It does not agree with the more gentle cleanup performed by actual browsers (and as specified through HTML5). So if you want to pass W3C validators for HTML 4 Strict, then Tidy will indeed give output which achieves that goal. But that output won't actually match like the input when rendered by an actual browser...
None of that is an excuse for Closure HTML to be too lenient though; lack of time has prevented me so far from migrating my HTML Tidy using codebase to Closure HTML, which would entail working through tests for proper HTML5-style cleanup.
d.