In my application I have to deal with all sorts of garbage that poses as HTML. In this particular case, had an email that was encoded in Windows-847. The document had the following header:

<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40"><head><meta http-equiv=Content-Type content="text/html; charset=windows-874"><meta name=Generator content="Microsoft Word 12 (filtered medium)">

The document then contained some text in thai. When attempting to parse this document, Closure-HTML gave me the following error:

Corrupted UTF-8 input (initial byte was #b10111110)

It's pretty clear that this error message is expected, since the input isn't actually in UTF-8. It seems as though Closure-HTML is making an incorrect assumption here. Now, the proper way to handle this case would be to detect the "content" attribute in the "html" tag and reparse the text using the correct encoding.

Is this something that can be easily done? I haven't looked at the necessary code changes to do this myself (yet).

Regards,
Elias