[closure-devel] Dealing with HTML in a non-UTF-8 encoding

11 Sep 2012

      In my application I have to deal with all sorts of garbage that poses as
HTML. In this particular case, had an email that was encoded in
Windows-847. The document had the following header:

<html xmlns:v="urn:schemas-microsoft-com:vml"
xmlns:o="urn:schemas-microsoft-com:office:office"
xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="
http://schemas.microsoft.com/office/2004/12/omml" xmlns="
http://www.w3.org/TR/REC-html40"><head><meta http-equiv=Content-Type
*content="text/html;
charset=windows-874"*><meta name=Generator content="Microsoft Word 12
(filtered medium)">

The document then contained some text in thai. When attempting to parse
this document, Closure-HTML gave me the following error:

Corrupted UTF-8 input (initial byte was #b10111110)

It's pretty clear that this error message is expected, since the input
isn't actually in UTF-8. It seems as though Closure-HTML is making an
incorrect assumption here. Now, the proper way to handle this case would be
to detect the* *"content" attribute in the "html" tag and reparse the text
using the correct encoding.

Is this something that can be easily done? I haven't looked at
the necessary code changes to do this myself (yet).

Regards,
Elias

Elias Mårtenson

tags

participants (1)