On Oct 16, 2013, at 11:55 AM, Raymond Wiker rwiker@gmail.com wrote:
On Oct 16, 2013, at 17:39 , Patrick May patrick.may@mac.com wrote:
Thanks for the quick reply.
I'm crawling websites in the wild, so it's quite likely the error is legitimate. I call chtml like this:
(chtml:parse (drakma:http-request url :user-agent :chrome) (chtml:make-lhtml-builder)))
I've only been playing with this for a couple of days, so I'm not sure what other options I have for the builder or if I should just catch the error and skip that url.
Restart appears to skip processing that call and move on to the next url.
Try evaluating just
(drakma:http-request url :user-agent :chrome)
separately, for one of the urls that fail. That will give you the document, plus a number of other values. One of them, the "headers" (value 3) may contain information about content type and encoding. Compare this with what is actually returned.
Other things to try:
add :want-stream t to the call to drakma:http-request. This may give chtml a better chance to pick up the correct encoding.
add :force-binary t to the call to drakma:http-request.
I just wrapped it in a handler-case and logged the bad URLs. Turns out I was chasing links to PNG files (in <a> tags, not <img>).
Thanks for the input.
Regards,
Patrick