I have what will probably end up being an obvious and foolish question.
When I run (http-request "http://popurls.com/") I get an error from flexi-streams that says:
Unexpected value #x20 in UTF-8 sequence. [Condition of type FLEXI-STREAM-ENCODING-ERROR]
What's going on? Do I need to set external-format-in ?
An abbreviated stack trace is below.
Cheers, Chris Dean
1: (METHOD STREAM:STREAM-READ-CHAR (FLEXI-STREAMS::FLEXI-UTF-8-INPUT-STREAM)) (#<FLEXI-STREAMS::FLEXI-BINARY-UTF-8-IO-STREAM 200C2FA7>) Locals: STREAM = #<FLEXI-STREAMS::FLEXI-BINARY-UTF-8-IO-STREAM 200C2FA7> CLOS::.ISL. = #(#(NIL NIL NIL NIL NIL NIL NIL NIL NIL NIL ...) #(FLEXI-STREAMS::LAST-CHAR-CODE FLEXI-STREAMS::LAST-OCTET) 126 0) CLOS::.PV. = #(5 6) FLEXI-STREAMS::FIRST-OCTET-SEEN = T OCTET = 194 FLEXI-STREAMS::START = 2 COUNT = 1 DBG::EXTRA-VALS = :DONT-KNOW FLEXI-STREAMS::RESULT = 2 DBG::|repeat-counter-| = 0 OCTET = 32 2: (METHOD TRIVIAL-GRAY-STREAMS:STREAM-READ-SEQUENCE (FLEXI-INPUT-STREAM T T T)) (#<FLEXI-STREAMS::FLEXI-BINARY-UTF-8-IO-STREAM 200C2FA7> ... 3: CLOS::GENERIC-FUNCTION-DISCRIMINATOR NIL 4: DRAKMA::READ-BODY (#<FLEXI-STREAMS::FLEXI-BINARY-UTF-8-IO-STREAM 200C2FA7> ((:DATE . "Mon, 29 Jan 2007 23:02:31 GMT") (:SERVER . "Apache") (:EXPIRES . "Mon, 26 Jul 1997 05:00:00 GMT") (:CACHE-CONTROL . "no-store, no-cache, must-revalidate,post-check=0, pre-check=0") (:PRAGMA . "no-cache") (:CONNECTION . "close") (:TRANSFER-ENCODING . "chunked") (:CONTENT-TYPE . "text/html; charset=UTF-8")) T #<FLEXI-STREAMS::EXTERNAL-FORMAT (:UTF-8 :EOL-STYLE :LF) 20106C9B>) Locals: STREAM = #<FLEXI-STREAMS::FLEXI-BINARY-UTF-8-IO-STREAM 200C2FA7> DRAKMA::HEADERS = ((:DATE . "Mon, 29 Jan 2007 23:02:31 GMT") (:SERVER . "Apache") (:EXPIRES . "Mon, 26 Jul 1997 05:00:00 GMT") (:CACHE-CONTROL . "no-store, no-cache, must-revalidate,post-check=0, pre-check=0") (:PRAGMA . "no-cache") (:CONNECTION . "close") (:TRANSFER-ENCODING . "chunked") (:CONTENT-TYPE . "text/html; charset=UTF-8")) DRAKMA::MUST-CLOSE = T DRAKMA::TEXTP = #<FLEXI-STREAMS::EXTERNAL-FORMAT (:UTF-8 :EOL-STYLE :LF) 20106C9B> DRAKMA::CONTENT-LENGTH = NIL DRAKMA::ELEMENT-TYPE = LISPWORKS:SIMPLE-CHAR DRAKMA::CHUNKEDP = T DRAKMA::BUFFER = ... DRAKMA::RESULT = ... DRAKMA::INDEX = 49152 DRAKMA::POS = 8192
On Mon, 29 Jan 2007 15:23:58 -0800, Chris Dean ctdean@sokitomi.com wrote:
I have what will probably end up being an obvious and foolish question.
No, that's not a foolish question. It's just that the website you're trying to visit has errors - see below.
When I run (http-request "http://popurls.com/") I get an error from flexi-streams that says:
Unexpected value #x20 in UTF-8 sequence. [Condition of type FLEXI-STREAM-ENCODING-ERROR]
What's going on? Do I need to set external-format-in ?
According to
http://validator.w3.org/check?uri=http%3A%2F%2Fpopurls.com%2F
the website claims to be encoded as UTF-8 but contains octet sequences that are illegal in UTF-8. And that's why you get errors - Drakma looks at the headers sent by the server, believes what the server says, and tries to decode the body accordingly. You can work around this by using the FORCE-BINARY keyword argument, but then you end up with a bunch of octets...
You should probably ask the operators of popurls.com to fix their site.
Cheers, Edi.
Edi Weitz edi@agharta.de writes:
On Mon, 29 Jan 2007 15:23:58 -0800, Chris Dean ctdean@sokitomi.com wrote: According to
http://validator.w3.org/check?uri=http%3A%2F%2Fpopurls.com%2F
the website claims to be encoded as UTF-8 but contains octet sequences that are illegal in UTF-8. And that's why you get errors -
That makes sense, and I'm glad to know that the error is on their end.
You should probably ask the operators of popurls.com to fix their site.
I certainly will do that, but I now have a larger problem. The problem is that I regularly download web pages and many of them are poorly formed. I'd like my software to be permissive and return something reasonable.
Drakma is nicely designed and I'd like to keep using it. If I were to add this "feature" of less-strict UTF-8 where should I do that?
I could modify (define-char-reader (stream flexi-utf-8-input-stream) ...) in some clever way I suppose.
Cheers, Chris Dean
On Mon, 29 Jan 2007 18:20:17 -0800, Chris Dean ctdean@sokitomi.com wrote:
The problem is that I regularly download web pages and many of them are poorly formed. I'd like my software to be permissive and return something reasonable.
Sure, I agree.
Drakma is nicely designed and I'd like to keep using it. If I were to add this "feature" of less-strict UTF-8 where should I do that?
I could modify (define-char-reader (stream flexi-utf-8-input-stream) ...) in some clever way I suppose.
My hope is that FLEXI-STREAMS is already "flexible" enough to deal with this:
CL-USER 22 > (drakma:http-request "http://zappa.agharta.de/test.html")
Error: Unexpected value #xF6 in UTF-8 sequence. 1 (abort) Return to level 0. 2 Return to top loop level 0.
Type :b for backtrace, :c <option number> to proceed, or :? for other options
CL-USER 23 : 1 > :a
CL-USER 24 > (defun use-replacement-char (condition) (declare (ignore condition)) (use-value #.(code-char 65533))) USE-REPLACEMENT-CHAR
CL-USER 25 > (let ((flex:*provide-use-value-restart* t)) (handler-bind ((flex:flexi-stream-encoding-error #'use-replacement-char)) (drakma:http-request "http://zappa.agharta.de/test.html"))) "<html> <body> This is not really UTF-8: �� </body> </html> " 200 ((:DATE . "Tue, 30 Jan 2007 07:47:59 GMT") (:SERVER . "Apache") (:CONNECTION . "close") (:TRANSFER-ENCODING . "chunked") (:CONTENT-TYPE . "text/html; charset=utf-8")) #<URI http://zappa.agharta.de/test.html%3E #<FLEXI-STREAMS::FLEXI-BINARY-UTF-8-IO-STREAM 226B80FB> T
CL-USER 26 > (let ((flex:*provide-use-value-restart* t) (flex:*substitution-char* #?)) (drakma:http-request "http://zappa.agharta.de/test.html")) "<html> <body> This is not really UTF-8: ?? </body> </html> " 200 ((:DATE . "Tue, 30 Jan 2007 07:50:30 GMT") (:SERVER . "Apache") (:CONNECTION . "close") (:TRANSFER-ENCODING . "chunked") (:CONTENT-TYPE . "text/html; charset=utf-8")) #<URI http://zappa.agharta.de/test.html%3E #<FLEXI-STREAMS::FLEXI-BINARY-UTF-8-IO-STREAM 2263F957> T
http://weitz.de/flexi-streams/#*provide-use-value-restart* http://weitz.de/flexi-streams/#*substitution-char*
Does that help?
Cheers, Edi.
Edi Weitz edi@agharta.de writes:
http://weitz.de/flexi-streams/#*provide-use-value-restart* http://weitz.de/flexi-streams/#*substitution-char*
Great! Just what I was needed. I guess I should have remembered that part of the docs.
Cheers, Chris Dean