I was playing with drakma and had it drop into the debugger when retrieving a commercial page. It looks like it might be a bug in flexi-streams, but I don't know how to isolate the input more specifically than what came up here:
Unexpected value #xA0 at start of UTF-8 sequence. [Condition of type FLEXI-STREAMS:FLEXI-STREAM-ENCODING-ERROR]
Restarts: 0: [ABORT] Abort SLIME compilation. 1: [ABORT] Return to SLIME's top level. 2: [TERMINATE-THREAD] Terminate this thread (#<THREAD "worker" {CDD6631}>)
Backtrace: 0: (FLEXI-STREAMS::SIGNAL-ENCODING-ERROR #<FLEXI-STREAMS::FLEXI-UTF-8-IO-STREAM {12CF8379}> "Unexpected value #x~X at start of UTF-8 sequence." 160) 1: (FLEXI-STREAMS::SIGNAL-ENCODING-ERROR #<FLEXI-STREAMS::FLEXI-UTF-8-IO-STREAM {12CF8379}> "Unexpected value #x~X at start of UTF-8 sequence.") 2: ((FLET #:BODY-FN327)) 3: ((SB-PCL::FAST-METHOD STREAM-READ-CHAR (FLEXI-STREAMS::FLEXI-UTF-8-INPUT-STREAM)) #<unavailable argument> #<unavailable argument> #<unavailable argument>) 4: ((SB-PCL::FAST-METHOD TRIVIAL-GRAY-STREAMS:STREAM-READ-SEQUENCE (FLEXI-STREAMS:FLEXI-INPUT-STREAM #1="#<...>" . #1#)) #<unused argument> #<unused argument> #<unavailable argument> #<unavailable argument> #<unavailable argument> #<unavailable argument>) 5: (READ-SEQUENCE "y make a difference this holiday season. Our gift ideas
are unique and of high quality.<br/><br/></p>
<p><a href="http://www.1giftidea.com/%5C" target="_blank" title="Christmas Gift Ideas">Gift ideas for every occasion, Christmas, Birthday, Mother's day...</a><br/>
Gift ideas for every occasion, Christmas, Birthday, Mothers day, Graduation, Fathers day, Anniversary, Wedding, & Baby Shower.<br/><br/></p>
<p><a href="http://www.mixedblessing.com/%5C" target="_blank" title="Mixed Blessing">Hanukkah card, Christmas gift idea and Holiday greeting cards from MixedBlessing</a><br/>
Greeting Cards for Interfaith and Multicultures from MixedBlesing. Hanukkah cards, Holiday cards, Christmas Gift Ideas, Holiday Gifts and more.. Find great gifts now!<br/><br/></p>
..) 6: (DRAKMA::READ-BODY #<FLEXI-STREAMS::FLEXI-UTF-8-IO-STREAM {12CF8379}> ((:DATE . "Sat, 24 Feb 2007 06:30:03 GMT") (:SERVER . "Apache/2.0.46 (Red Hat)") (:SET-COOKIE . "GS_UUID=24.18.193.65.1172298603635841; path=/,PHPSESSID=e009a521cb2bf134a00df925e4f4d510; path=/,cart_hash=e009a521cb2bf134a00df925e4f4d510; expires=Tuesday, 27-Feb-07 06:30:03 GMT; path=/") (:X-POWERED-BY . "PHP/4.4.0") (:EXPIRES . "Thu, 19 Nov 1981 08:52:00 GMT") (:CACHE-CONTROL . "no-store, no-cache, must-revalidate, post-check=0, pre-check=0") ..)) 7: ((LABELS DRAKMA::FINISH-REQUEST) NIL NIL) 8: (HTTP-REQUEST #<URI http://www.gifttree.com/Christmas/Christmas-gift-idea.html%3E :PROXY NIL) 9: (RETRIEVE-URI "http://www.gifttree.com/Christmas/Christmas-gift-idea.html" NIL) 10: (WALK-SITE "http://www.gifttree.com/Christmas/Christmas-gift-idea.html" #<unavailable argument> #<unavailable argument> #<unavailable argument> #<unavailable argument> #<unavailable argument> #<unavailable argument>) 11: (SB-FASL::FOP-FUNCALL) 12: (SB-FASL::LOAD-FASL-GROUP #<SB-SYS:FD-STREAM for "file /tmp/fileIQGlqR.fasl" {CDF1089}>) 13: (SB-FASL::LOAD-AS-FASL #<SB-SYS:FD-STREAM for "file /tmp/fileIQGlqR.fasl" {CDF1089}> NIL #<unavailable argument>) 14: (SB-FASL::INTERNAL-LOAD #P"/tmp/fileIQGlqR.fasl" #P"/tmp/fileIQGlqR.fasl" :ERROR NIL NIL :BINARY NIL) 15: (SB-FASL::INTERNAL-LOAD #P"/tmp/fileIQGlqR.fasl" #P"/tmp/fileIQGlqR.fasl" :ERROR NIL NIL NIL :DEFAULT) 16: (LOAD #P"/tmp/fileIQGlqR.fasl") 17: ((LAMBDA (STRING &KEY #1="#<...>" . #1#)) "(print (walk-site "http://www.gifttree.com%5C")) " :BUFFER "seo.lisp" :POSITION 27060 :DIRECTORY #<unused argument>) 18: ((LAMBDA ())) --more--
--Jeff
On Sat, 24 Feb 2007 09:07:25 -0800, Jeffrey Cunningham jeffrey@cunningham.net wrote:
I was playing with drakma and had it drop into the debugger when retrieving a commercial page. It looks like it might be a bug in flexi-streams, but I don't know how to isolate the input more specifically than what came up here:
Unexpected value #xA0 at start of UTF-8 sequence.
My guess is that the website sends wrong content-type headers. (Or, in other words, it claims to send UTF-8 but it doesn't.) This is not unusual. See the mailing list archive of the last weeks for similar problems and for workarounds.
If you still think this is a bug in FLEXI-STREAMS, send a simple, reproducible test case and point out where in the sequence of characters FLEXI-STREAMS thinks it's not UTF-8 anymore although it is.
Thanks, Edi.
On Sat Feb 24, 2007 at 09:47:15PM +0100, Edi Weitz wrote:
My guess is that the website sends wrong content-type headers. (Or, in other words, it claims to send UTF-8 but it doesn't.) This is not unusual. See the mailing list archive of the last weeks for similar problems and for workarounds.
If you still think this is a bug in FLEXI-STREAMS, send a simple, reproducible test case and point out where in the sequence of characters FLEXI-STREAMS thinks it's not UTF-8 anymore although it is.
I believe you are right - incorrectly identified content-type. This gets it to work:
(setf flexi-streams::*SUBSTITUTION-CHAR* (code-char #xA0)) (setf flexi-streams::*PROVIDE-USE-VALUE-RESTART* t) (http-request "http://www.gifttree.com/Christmas/Christmas-gift-idea.html")
And I read about the performance hit associated with setting this up as a default. But it seems like it raises some issues - at least for what I'm doing, which is trying to automate updating information about some sites I have no control over. In this case I set it to make a substitution for the 'bad' character. Is it possible for there to be more than one? If so, how could that be handled?
And more generally, should there not be a way to set drakma so it may take a performance hit but is guaranteed not to die on any html that is thrown at it?
Thanks,
--Jeff
On Sat, 24 Feb 2007 16:39:54 -0800, Jeffrey Cunningham jeffrey@cunningham.net wrote:
In this case I set it to make a substitution for the 'bad' character. Is it possible for there to be more than one?
Not yet. See current discussion on the FLEXI-STREAMS mailing list.
And more generally, should there not be a way to set drakma so it may take a performance hit but is guaranteed not to die on any html that is thrown at it?
It's not dying, it just signals an error.
And, no, I don't think there's a way to provide meaningful results and at the same time to be prepared to accept whatever bogus data or headers the server choses to send. If you find something like that, send patches, but it sounds like magic (or at least very good AI) to me.
As for dealing with wrong character encodings, there are already ways to deal with that. You cited one yourself. Another one would be to read everything as binary data (and then to decode it yourself it needed).
On Sun Feb 25, 2007 at 11:25:04AM +0100, Edi Weitz wrote:
On Sat, 24 Feb 2007 16:39:54 -0800, Jeffrey Cunningham jeffrey@cunningham.net wrote:
In this case I set it to make a substitution for the 'bad' character. Is it possible for there to be more than one?
Not yet. See current discussion on the FLEXI-STREAMS mailing list.
And more generally, should there not be a way to set drakma so it may take a performance hit but is guaranteed not to die on any html that is thrown at it?
It's not dying, it just signals an error.
And, no, I don't think there's a way to provide meaningful results and at the same time to be prepared to accept whatever bogus data or headers the server choses to send. If you find something like that, send patches, but it sounds like magic (or at least very good AI) to me.
I guess I disagree.
If I try to access a page like that using: links, lynx, wget, mozilla, firefox, or any html parsing entity I can think of they don't stop functioning, signal an error, or whatever you want to call it. They give me their best approximation of the content. Seems like that ought be the goal here, or at least a possibility.
In an automated process, signaling an error means that processing has stopped (or 'died'). The source of the error signal may be in flexi-streams (I have read the discussions in the that list), but its drakma that has to deal with its consequences.
How do the above mentioned applications manage this problem? Certainly not by magic. And I doubt the AI in links or lynx is very sophisticated.
--Jeff
Hi, Jeff.
"Signaling an error" means in this case that work can be proceeded.
(setq *provide-use-value-restart* t)
(handler-bind ((flexi-stream-encoding-error (lambda (condition)
(use-value ?)))) (drakma:http-request("http://bad-host/bad-page.html")))
This is example from flexi-stream documentation.
You can easy get "the best approximation of the content" using drakma, but with more control. So it is unclear to my, what problems you have.
-Anton
Jeffrey Cunningham:
On Sun Feb 25, 2007 at 11:25:04AM +0100, Edi Weitz wrote:
On Sat, 24 Feb 2007 16:39:54 -0800, Jeffrey Cunningham jeffrey@cunningham.net wrote:
In this case I set it to make a substitution for the 'bad' character. Is it possible for there to be more than one?
Not yet. See current discussion on the FLEXI-STREAMS mailing list.
And more generally, should there not be a way to set drakma so it may take a performance hit but is guaranteed not to die on any html that is thrown at it?
It's not dying, it just signals an error.
And, no, I don't think there's a way to provide meaningful results and at the same time to be prepared to accept whatever bogus data or headers the server choses to send. If you find something like that, send patches, but it sounds like magic (or at least very good AI) to me.
I guess I disagree.
If I try to access a page like that using: links, lynx, wget, mozilla, firefox, or any html parsing entity I can think of they don't stop functioning, signal an error, or whatever you want to call it. They give me their best approximation of the content. Seems like that ought be the goal here, or at least a possibility.
In an automated process, signaling an error means that processing has stopped (or 'died'). The source of the error signal may be in flexi-streams (I have read the discussions in the that list), but its drakma that has to deal with its consequences.
How do the above mentioned applications manage this problem? Certainly not by magic. And I doubt the AI in links or lynx is very sophisticated.
--Jeff
drakma-devel mailing list drakma-devel@common-lisp.net http://common-lisp.net/cgi-bin/mailman/listinfo/drakma-devel
On Sun Feb 25, 2007 at 07:00:03PM +0200, Anton Vodonosov wrote:
Hi, Jeff.
"Signaling an error" means in this case that work can be proceeded.
(setq *provide-use-value-restart* t)
(handler-bind ((flexi-stream-encoding-error (lambda (condition)
(use-value \?)))) (drakma:http-request("http://bad-host/bad-page.html")))
This is example from flexi-stream documentation.
You can easy get "the best approximation of the content" using drakma, but with more control. So it is unclear to my, what problems you have.
-Anton
Hi Anton,
Thanks for the help. Will the example above work for any bad charactor, or only the one set by
(setf flexi-streams::*SUBSTITUTION-CHAR* (code-char #xA0))
The only example I've run across is the site I mentioned, but it seems like the possibilities for bad html are endless.
--Jeff
You misunderstand meaning of *substitution-char*. This is the character that will be used as a substitution for all badly encoded characters.
Thus, this example is equvalent to (setq flexi-streams::*provide-use-value-restart* t) (setf flexi-streams::*SUBSTITUTION-CHAR* ?)
You will have ? instead of any wrong character.
I.e. you can use the whatever mechanism you like: *substitution-char* for most cases or use-value-restart if you whant more control (for example you what to use ? as a substitution for even wrong byte sequence, and * for odd wrong byte sequence; count encoding errors, log them into file or something)
Read the docs, http://weitz.de/flexi-streams/
-Anton
Jeffrey Cunningham:
On Sun Feb 25, 2007 at 07:00:03PM +0200, Anton Vodonosov wrote:
Hi, Jeff.
"Signaling an error" means in this case that work can be proceeded.
(setq *provide-use-value-restart* t)
(handler-bind ((flexi-stream-encoding-error (lambda (condition)
(use-value \?)))) (drakma:http-request("http://bad-host/bad-page.html")))
This is example from flexi-stream documentation.
You can easy get "the best approximation of the content" using drakma, but with more control. So it is unclear to my, what problems you have.
-Anton
Hi Anton,
Thanks for the help. Will the example above work for any bad charactor, or only the one set by
(setf flexi-streams::*SUBSTITUTION-CHAR* (code-char #xA0))
The only example I've run across is the site I mentioned, but it seems like the possibilities for bad html are endless.
--Jeff _______________________________________________ drakma-devel mailing list drakma-devel@common-lisp.net http://common-lisp.net/cgi-bin/mailman/listinfo/drakma-devel
On Sun Feb 25, 2007 at 07:43:26PM +0200, Anton Vodonosov wrote:
You misunderstand meaning of *substitution-char*. This is the character that will be used as a substitution for all badly encoded characters.
Thus, this example is equvalent to (setq flexi-streams::*provide-use-value-restart* t) (setf flexi-streams::*SUBSTITUTION-CHAR* ?)
You will have ? instead of any wrong character.
I.e. you can use the whatever mechanism you like: *substitution-char* for most cases or use-value-restart if you whant more control (for example you what to use ? as a substitution for even wrong byte sequence, and * for odd wrong byte sequence; count encoding errors, log them into file or something)
You're right, Anton, I did misunderstand the meaning. Thank you for clearing that up.
--Jeff
On Sun, 25 Feb 2007 09:23:45 -0800, Jeffrey Cunningham jeffrey@cunningham.net wrote:
The only example I've run across is the site I mentioned, but it seems like the possibilities for bad html are endless.
The problems you've encountered have nothing to do with bad HTML at all, and Drakma doesn't try to parse HTML. I think you're a bit confused.
Cheers, Edi.
Scribit Jeffrey Cunningham dies 25/02/2007 hora 08:26:
If you find something like that, send patches, but it sounds like magic (or at least very good AI) to me.
I guess I disagree.
If I try to access a page like that using: links, lynx, wget, mozilla, firefox, or any html parsing entity I can think of they don't stop functioning, signal an error, or whatever you want to call it. They give me their best approximation of the content. Seems like that ought be the goal here, or at least a possibility.
AFAICS, those browsers just substitute bad bytes with a single substitution glyph. My Firefox uses a white interrogation mark in a black diamond.
You can already achieve that with flexi-streams, IIUC.
Quickly, Pierre
On Sun, 25 Feb 2007 08:26:45 -0800, Jeffrey Cunningham jeffrey@cunningham.net wrote:
If I try to access a page like that using: links, lynx, wget, mozilla, firefox, or any html parsing entity I can think of they don't stop functioning, signal an error, or whatever you want to call it. They give me their best approximation of the content. Seems like that ought be the goal here, or at least a possibility.
In an automated process, signaling an error means that processing has stopped (or 'died'). The source of the error signal may be in flexi-streams (I have read the discussions in the that list), but its drakma that has to deal with its consequences.
You are missing two crucial points:
1. The applications you listed are just that - monolithic applications. You either use them for what they are intended or you leave them alone. They'd better be as permissible as possible.
Drakma, OTOH, is a library - a tool or building block used by programmers to build applications. It should do what it advertises to do correctly - not more and not less. And if that's not what the programmer expected, he can tweak it as much as he wants. (That doesn't imply that he modifies the library itself, but as Drakma is open source he can do even that, if deemed necessary.)
2. In Common Lisp, signalling an error doesn't mean that processing has stopped. If that is news to you, you might want to read, for example, the chapter about conditions and restarts in Peter Seibel's book.
How do the above mentioned applications manage this problem? Certainly not by magic.
In this specific case, they're usually doing it the same way you can do it with Drakma and FLEXI-STREAMS - they insert some kind of replacement character. I don't see where the problem is.
Cheers, Edi.