Hi,
I'm using chtml for a simple experimental web crawler. I'm occasionally getting this error (Slime output):
0: (RUNES-ENCODING::XERROR "Corrupted UTF-8 input (initial byte was #b~8,'0B)" 255) 1: (#<STANDARD-METHOD RUNES-ENCODING:DECODE-SEQUENCE ((EQL :UTF-8) T T T T T ...)> :UTF-8 #(255 216 255 0 0 0 ...) 0 3 #(65535 0 0 0 0 0 ...) 0 8191 NIL) 2: (NIL #<Unknown Arguments>) 3: (#<STANDARD-METHOD RUNES::XSTREAM-UNDERFLOW (RUNES:XSTREAM)> #<RUNES:XSTREAM NIL>) 4: (SGML::READ-TOKEN #<RUNES:XSTREAM NIL> #<SGML::DTD (:PUBLIC "-//W3C//DTD HTML 4.0 Frameset//EN")>) 5: (SGML::READ-TOKEN* #<RUNES:XSTREAM NIL> #<SGML::DTD (:PUBLIC "-//W3C//DTD HTML 4.0 Frameset//EN")>) 6: (SGML:SGML-PARSE #<SGML::DTD (:PUBLIC "-//W3C//DTD HTML 4.0 Frameset//EN")> #<RUNES:XSTREAM NIL>) 7: (CLOSURE-HTML::PARSE-XSTREAM #<RUNES:XSTREAM NIL> #<CLOSURE-HTML:LHTML-BUILDER #x3020032CBF3D>)
Choosing the restart continuation seems to get past it, but I'd like to understand what's going on and how to automatically detect and work around it.
Any input appreciated.
Thanks,
Patrick
On Oct 16, 2013, at 17:12 , Patrick May patrick.may@mac.com wrote:
Hi,
I'm using chtml for a simple experimental web crawler. I'm occasionally getting this error (Slime output):
0: (RUNES-ENCODING::XERROR "Corrupted UTF-8 input (initial byte was #b~8,'0B)" 255) 1: (#<STANDARD-METHOD RUNES-ENCODING:DECODE-SEQUENCE ((EQL :UTF-8) T T T T T ...)> :UTF-8 #(255 216 255 0 0 0 ...) 0 3 #(65535 0 0 0 0 0 ...) 0 8191 NIL) 2: (NIL #<Unknown Arguments>) 3: (#<STANDARD-METHOD RUNES::XSTREAM-UNDERFLOW (RUNES:XSTREAM)> #<RUNES:XSTREAM NIL>) 4: (SGML::READ-TOKEN #<RUNES:XSTREAM NIL> #<SGML::DTD (:PUBLIC "-//W3C//DTD HTML 4.0 Frameset//EN")>) 5: (SGML::READ-TOKEN* #<RUNES:XSTREAM NIL> #<SGML::DTD (:PUBLIC "-//W3C//DTD HTML 4.0 Frameset//EN")>) 6: (SGML:SGML-PARSE #<SGML::DTD (:PUBLIC "-//W3C//DTD HTML 4.0 Frameset//EN")> #<RUNES:XSTREAM NIL>) 7: (CLOSURE-HTML::PARSE-XSTREAM #<RUNES:XSTREAM NIL> #<CLOSURE-HTML:LHTML-BUILDER #x3020032CBF3D>)
Choosing the restart continuation seems to get past it, but I'd like to understand what's going on and how to automatically detect and work around it.
Any input appreciated.
How sure are you that the input is actually in UTF-8 format? What does the "restart" do?
On Oct 16, 2013, at 11:35 AM, Raymond Wiker rwiker@gmail.com wrote:
On Oct 16, 2013, at 17:12 , Patrick May patrick.may@mac.com wrote:
Hi,
I'm using chtml for a simple experimental web crawler. I'm occasionally getting this error (Slime output):
0: (RUNES-ENCODING::XERROR "Corrupted UTF-8 input (initial byte was #b~8,'0B)" 255) 1: (#<STANDARD-METHOD RUNES-ENCODING:DECODE-SEQUENCE ((EQL :UTF-8) T T T T T ...)> :UTF-8 #(255 216 255 0 0 0 ...) 0 3 #(65535 0 0 0 0 0 ...) 0 8191 NIL) 2: (NIL #<Unknown Arguments>) 3: (#<STANDARD-METHOD RUNES::XSTREAM-UNDERFLOW (RUNES:XSTREAM)> #<RUNES:XSTREAM NIL>) 4: (SGML::READ-TOKEN #<RUNES:XSTREAM NIL> #<SGML::DTD (:PUBLIC "-//W3C//DTD HTML 4.0 Frameset//EN")>) 5: (SGML::READ-TOKEN* #<RUNES:XSTREAM NIL> #<SGML::DTD (:PUBLIC "-//W3C//DTD HTML 4.0 Frameset//EN")>) 6: (SGML:SGML-PARSE #<SGML::DTD (:PUBLIC "-//W3C//DTD HTML 4.0 Frameset//EN")> #<RUNES:XSTREAM NIL>) 7: (CLOSURE-HTML::PARSE-XSTREAM #<RUNES:XSTREAM NIL> #<CLOSURE-HTML:LHTML-BUILDER #x3020032CBF3D>)
Choosing the restart continuation seems to get past it, but I'd like to understand what's going on and how to automatically detect and work around it.
Any input appreciated.
How sure are you that the input is actually in UTF-8 format? What does the "restart" do?
Thanks for the quick reply.
I'm crawling websites in the wild, so it's quite likely the error is legitimate. I call chtml like this:
(chtml:parse (drakma:http-request url :user-agent :chrome) (chtml:make-lhtml-builder)))
I've only been playing with this for a couple of days, so I'm not sure what other options I have for the builder or if I should just catch the error and skip that url.
Restart appears to skip processing that call and move on to the next url.
Thanks again,
Patrick
On Oct 16, 2013, at 17:39 , Patrick May patrick.may@mac.com wrote:
Thanks for the quick reply.
I'm crawling websites in the wild, so it's quite likely the error is legitimate. I call chtml like this:
(chtml:parse (drakma:http-request url :user-agent :chrome) (chtml:make-lhtml-builder)))
I've only been playing with this for a couple of days, so I'm not sure what other options I have for the builder or if I should just catch the error and skip that url.
Restart appears to skip processing that call and move on to the next url.
Try evaluating just
(drakma:http-request url :user-agent :chrome)
separately, for one of the urls that fail. That will give you the document, plus a number of other values. One of them, the "headers" (value 3) may contain information about content type and encoding. Compare this with what is actually returned.
Other things to try:
1) add :want-stream t to the call to drakma:http-request. This may give chtml a better chance to pick up the correct encoding.
2) add :force-binary t to the call to drakma:http-request.
On Oct 16, 2013, at 11:55 AM, Raymond Wiker rwiker@gmail.com wrote:
On Oct 16, 2013, at 17:39 , Patrick May patrick.may@mac.com wrote:
Thanks for the quick reply.
I'm crawling websites in the wild, so it's quite likely the error is legitimate. I call chtml like this:
(chtml:parse (drakma:http-request url :user-agent :chrome) (chtml:make-lhtml-builder)))
I've only been playing with this for a couple of days, so I'm not sure what other options I have for the builder or if I should just catch the error and skip that url.
Restart appears to skip processing that call and move on to the next url.
Try evaluating just
(drakma:http-request url :user-agent :chrome)
separately, for one of the urls that fail. That will give you the document, plus a number of other values. One of them, the "headers" (value 3) may contain information about content type and encoding. Compare this with what is actually returned.
Other things to try:
add :want-stream t to the call to drakma:http-request. This may give chtml a better chance to pick up the correct encoding.
add :force-binary t to the call to drakma:http-request.
I just wrapped it in a handler-case and logged the bad URLs. Turns out I was chasing links to PNG files (in <a> tags, not <img>).
Thanks for the input.
Regards,
Patrick