Hi,
Quoting Russell Kliese (russell@kliese.id.au):
I have been using closure-html and cxml to extract data from web pages. Up until now it has worked quite well, but I have come across a web page that has quite extraordinarily malformed html markup.
I am looking for suggestions on how to continue parsing even when attribute names are malformed. Here is a self-contained example containing some of the offending markup:
[...]
This example give CXML:WELL-FORMEDNESS-VIOLATION. I guess I could write a custom SAX handler to skip attributes with these errors, but is there a more simple solution? Should the chtml parser be patched to handle this case? Are there any other ways to make parsing more robust? Or is this markup beyond hope of repair?
thanks for the report.
It should be the chtml parser's responsibility to clean up data sufficiently that it would be well-formed as XML, and use of an STP builder as in your code is a pretty good test case to ensure that this would be the case.
When Closure was written, I think a lot of effort was put into making it correct errors, but that logic is just not 100% complete. We'd need a set of good test cases.
Unfortunately I don't have a ready-to-use patch at this point; do you have a suggestion? There are probably little dependency issues, in that some utility functions needed to determine whether data is OK are currently part of the XML parser, which the HTML parser does not depend on in terms of ASDF. But I'm sure those issues would be solvable.
d.