In my application I have to deal with all sorts of garbage that poses as HTML. In this particular case, had an email that was encoded in Windows-847. The document had the following header:
<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m=" http://schemas.microsoft.com/office/2004/12/omml" xmlns=" http://www.w3.org/TR/REC-html40"><head><meta http-equiv=Content-Type *content="text/html; charset=windows-874"*><meta name=Generator content="Microsoft Word 12 (filtered medium)">
The document then contained some text in thai. When attempting to parse this document, Closure-HTML gave me the following error:
Corrupted UTF-8 input (initial byte was #b10111110)
It's pretty clear that this error message is expected, since the input isn't actually in UTF-8. It seems as though Closure-HTML is making an incorrect assumption here. Now, the proper way to handle this case would be to detect the* *"content" attribute in the "html" tag and reparse the text using the correct encoding.
Is this something that can be easily done? I haven't looked at the necessary code changes to do this myself (yet).
Regards, Elias