Re: [cxml-devel] seeking suggestion for handling malformed attribute names

28 Oct 2011

      Hi,

Quoting Russell Kliese (russell@kliese.id.au):
...
I have been using closure-html and cxml to extract data from web
pages. Up until now it has worked quite well, but I have come across
a web page that has quite extraordinarily malformed html markup.
I am looking for suggestions on how to continue parsing even when
attribute names are malformed. Here is a self-contained example
containing some of the offending markup:
[...]
This example give CXML:WELL-FORMEDNESS-VIOLATION. I guess I could
write a custom SAX handler to skip attributes with these errors, but
is there a more simple solution? Should the chtml parser be patched
to handle this case? Are there any other ways to make parsing more
robust? Or is this markup beyond hope of repair?
thanks for the report.

It should be the chtml parser's responsibility to clean up data
sufficiently that it would be well-formed as XML, and use of an STP
builder as in your code is a pretty good test case to ensure that this
would be the case.

When Closure was written, I think a lot of effort was put into making it
correct errors, but that logic is just not 100% complete.  We'd need
a set of good test cases.

Unfortunately I don't have a ready-to-use patch at this point; do you
have a suggestion?  There are probably little dependency issues, in that
some utility functions needed to determine whether data is OK are
currently part of the XML parser, which the HTML parser does not depend
on in terms of ASDF.  But I'm sure those issues would be solvable.

d.

Re: [cxml-devel] seeking suggestion for handling malformed attribute names

David Lichteblau