closure-devel

Download

closure-devel@common-lisp.net

September 2012

1 participants
2 discussions

[closure-devel] Dealing with HTML in a non-UTF-8 encoding
by Elias Mårtenson 11 Sep '12

11 Sep '12

In my application I have to deal with all sorts of garbage that poses as HTML. In this particular case, had an email that was encoded in Windows-847. The document had the following header: <html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m=" http://schemas.microsoft.com/office/2004/12/omml" xmlns=" http://www.w3.org/TR/REC-html40"><head><meta http-equiv=Content-Type *content="text/html; charset=windows-874"*><meta name=Generator content="Microsoft Word 12 (filtered medium)"> The document then contained some text in thai. When attempting to parse this document, Closure-HTML gave me the following error: Corrupted UTF-8 input (initial byte was #b10111110) It's pretty clear that this error message is expected, since the input isn't actually in UTF-8. It seems as though Closure-HTML is making an incorrect assumption here. Now, the proper way to handle this case would be to detect the* *"content" attribute in the "html" tag and reparse the text using the correct encoding. Is this something that can be easily done? I haven't looked at the necessary code changes to do this myself (yet). Regards, Elias

1 0

[closure-devel] Parsing Outlook HTML emails
by Elias Mårtenson 11 Sep '12

11 Sep '12

I am currently faced with the task of parsing HTML emails generated by Outlook. My frustrations with that thing can fill an entire email of its own, so I won't do that. Anyway, one thing it keeps doing is to create lots of non-standard tags of the form <o:p></o:p> and the likes. The problem is that when Closure-HTML parses these, they end up like this: "#BAD TAGp>". I worked around the problem by adding the following check to the function NAME-RUNE-P: (rune= char #/:). This includes the colon as a valid character in a node name, and thus will cause such nodes to be ignored in the generated output. Would it be reasonable to include this fix in an update to Closure-HTML? Regards, Elias

1 0