Hi!
I have a problem with drakma and character encoding. My goal is to make a small web utility which first GETs some content from a certain page on my wiki, then optionally adds prepends stuff to this content, and then POST the new content back. It works as long as only ASCII characters are involved but fails when I use characters from the higher part of Latin-1, in my case the Swedish character "ä".
Below I have simplified the code so that it just GETs the old content and POST it back. What one sees is that the Swedish character is correctly handled in the GET, I see the "ä" in it's full glory, but after having posted it back, the content gets corrupt and the next time I run the function I get an error because of the strange character.
Ok, enough blabbing, here is the code:
;;;;; code starts here
(defvar *boundary* "-------------------------1852275791466338532535335716")
(defconstant +crlf+ #.(format nil "~C~C" #\Return #\Linefeed))
(defun format-field (name value) (format nil "--~a~aContent-Disposition: form-data; name="~a"~a~a~a~a" *boundary* +crlf+ name +crlf+ +crlf+ value +crlf+))
(defun foo () (let* ((old-content (drakma:http-request
"http://klibb.com/cgi-bin/wiki.pl?action=browse;id=2007-05-31;raw=1")) (cookie-jar (make-instance 'drakma:cookie-jar)) (new-content (concatenate 'string (format-field "title" "2007-05-31") (format-field "text" old-content) (format-field "recent_edit" "on") (format-field "username" "MathiasDahl") "--" *boundary* "--" +crlf+))) (format t "Old content: ~a" old-content) (setf (drakma:cookie-jar-cookies cookie-jar) (list (make-instance 'drakma:cookie :name "pwd" :value "editeramera" :expires (+ (get-universal-time) 36000) :domain "klibb.com"))) (format t "New content: ~a" new-content) (drakma:http-request "http://klibb.com/cgi-bin/wiki.pl" :method :post :cookie-jar cookie-jar :content-type (format nil "multipart/form-data; boundary=~a" *boundary*) :content new-content)))
;;;;; code ends here
Again, the code is simplified, some parts are hardcoded etc, but the above is enough to recreate the problem. Note that after running the code one time, you cannot test it again, because the content on the page is now changed.
Here is what I get after running the function the first time:
==== * (foo) Old content: blä New content: ---------------------------1852275791466338532535335716 Content-Disposition: form-data; name="title"
2007-05-31 ---------------------------1852275791466338532535335716 Content-Disposition: form-data; name="text"
blä
---------------------------1852275791466338532535335716 Content-Disposition: form-data; name="recent_edit"
on ---------------------------1852275791466338532535335716 Content-Disposition: form-data; name="username"
MathiasDahl ---------------------------1852275791466338532535335716-- NIL 302 ((:DATE . "Sat, 02 Jun 2007 09:30:53 GMT") (:SERVER . "Apache/2.2.3 (Mandriva Linux/PREFORK-1mdv2007.0)") (:SET-COOKIE . "MuuWiki=username%1EMathiasDahl; path=/; expires=Mon, 01-Jun-2009 09:30:53 GMT") (:LOCATION . "http://klibb.com/cgi-bin/wiki.pl/2007-05-31") (:CONTENT-LENGTH . "0") (:CONNECTION . "close") (:CONTENT-TYPE . "application/x-perl")) #<PURI:URI http://klibb.com/cgi-bin/wiki.pl%3E #<FLEXI-STREAMS:FLEXI-IO-STREAM {C5728E1}> T ====
As you can see, all looks well; the old content ("blä") looks like it should, and the new content looks the same (it's the data in the form field "text"). However, when I now run the function again, I get this:
==== * (foo)
debugger invoked on a FLEXI-STREAMS:FLEXI-STREAM-ENCODING-ERROR in thread #<THREAD "initial thread" {AC14469}>: Unexpected value #xA in UTF-8 sequence.
Type HELP for debugger help, or (SB-EXT:QUIT) to exit from SBCL.
restarts (invokable by number or by possibly-abbreviated name): 0: [USE-VALUE] Specify a character to be used instead. 1: [ABORT ] Exit debugger, returning to top level.
(FLEXI-STREAMS::SIGNAL-ENCODING-ERROR #<FLEXI-STREAMS:FLEXI-IO-STREAM {C5B37D1}> "Unexpected value #x~X in UTF-8 sequence." 10) ====
It fails because that "ä" is now something else.
When I do the same thing from a browser, i.e. POST the page again and again, I don't see any problems. I have done some network sniffing with Wireshar and what I can see is that when the browser POSTs the content, the "ä" is correctly encoded in UTF-8 as xC3 xA4. In the POST done by drakma, the character is encoded xE4 (which IS the unicode code point, but not encoded as UTF-8 if I understand things correctly).
At first I tried to include the encoding in Content-Type, but when I saw that it did not do any difference and also saw that Firefox does not include this, I removed it. Oh, and I should show this as well:
* (sb-impl::default-external-format)
:UTF-8
Just so that we are clear that I DO see the content correctly and UTF-8 is used.
I also tried with a version where I even hardcoded the content to be sent to be "blä", and that gives the same problem. Maybe I should have shortened the code above to that, but what I wanted to show was that the same content I can GET nicely enough cannot be POSTed without problems.
Any ideas on how I can continue debugging this? I feel kinda lost. It feels frustrating to get stuck on a problem like this when I have got the other logic to work, GETing and POSTing and stuff...
I am running this in SBCL 1.0 under Mandriva GNU/Linux.
Thanks!
/Mathias