I have been using closure-html and cxml to extract data from web pages. Up until now it has worked quite well, but I have come across a web page that has quite extraordinarily malformed html markup.
I am looking for suggestions on how to continue parsing even when attribute names are malformed. Here is a self-contained example containing some of the offending markup:
(require :closure-html) (require :cxml-stp) (defvar *html-str*) (setf *html-str* "<html> <body> <table> <tr><td vertical-align:="" ;=""></td></tr> </table> </body> </html> ") (chtml:parse *html-str* (cxml-stp:make-builder))
This example give CXML:WELL-FORMEDNESS-VIOLATION. I guess I could write a custom SAX handler to skip attributes with these errors, but is there a more simple solution? Should the chtml parser be patched to handle this case? Are there any other ways to make parsing more robust? Or is this markup beyond hope of repair?
Regards,
Russell