[closure-devel] Dealing with HTML in a non-UTF-8 encoding
In my application I have to deal with all sorts of garbage that poses as HTML. In this particular case, had an email that was encoded in Windows-847. The document had the following header: <html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m=" http://schemas.microsoft.com/office/2004/12/omml" xmlns=" http://www.w3.org/TR/REC-html40"><head><meta http-equiv=Content-Type *content="text/html; charset=windows-874"*><meta name=Generator content="Microsoft Word 12 (filtered medium)"> The document then contained some text in thai. When attempting to parse this document, Closure-HTML gave me the following error: Corrupted UTF-8 input (initial byte was #b10111110) It's pretty clear that this error message is expected, since the input isn't actually in UTF-8. It seems as though Closure-HTML is making an incorrect assumption here. Now, the proper way to handle this case would be to detect the* *"content" attribute in the "html" tag and reparse the text using the correct encoding. Is this something that can be easily done? I haven't looked at the necessary code changes to do this myself (yet). Regards, Elias
participants (1)
-
Elias Mårtenson