[Please use the mailing list - see Cc (and register first).]
Hi Will!
On Wed, 21 Sep 2005 23:52:41 -0700, Will will@glozer.net wrote:
CafeSpot came a problem in the URL-DECODE function of TBNL, it doesn't decode UTF-8 encoded URLs correctly. I see there was a thread on this in July, http://common-lisp.net/pipermail/tbnl-devel/2005-July/000358.html, but apparently no resolution. Enclosed is a new version of the function, I'm a lisp newbie so it may not be ideal =)
This particular function only works in Allegro, but it would work in any lisp that has a function to convert a UTF-8 encoded octet array to a string. I belive SBCL has a similar OCTETS-TO-STRING function, I didn't see anything really obvious for LispWorks though. At the moment I only have ACL.
(defun url-decode (string) (let ((string-length (length string))) (flet ((parse-hex-escape (start) (if (<= (+ start 3) string-length) (parse-integer string :start (+ start 1) :end (+ start 3) :radix 16) (error "invalid hex encoding in string '~A'" string)))) (let ((vector (make-array string-length :adjustable t :element-type '(unsigned-byte 8) :fill-pointer 0))) (loop for i below string-length for char = (aref string i) do (vector-push-extend (case char ((#+) (char-code #\Space)) ((#%) (parse-hex-escape (prog1 i (incf i 2)))) (otherwise (char-code char))) vector)) #+allegro (excl:octets-to-string vector :external-format :utf-8)))))
Thanks for that. I admit that the current version of URL-DECODE is not ideal but your version will break existing code. Note that browsers will use different URL encodings based on the charset of the HTML document they're responding to. For example, if the charset is ISO-8859-1 (which AFAIK is the default charset for Apache) the string "äöü" (that's umlaut a, umlaut o, umlaut u in case it doesn't make it through email) will be sent as
%E4%F6%FC
which the version of URL-DECODE above won't decode correctly - it'll expect
%C3%A4%C3%B6%C3%BC
instead. Unfortunately, the browsers don't tell you which charset they're using... :(
The right way to do it would be to add a second optional argument for the charset to URL-DECODE and make the default value user-configurable on a per-request basis. Does that sound OK? I'll probably add something like this in the next days.
Cheers, Edi.
PS: For LispWorks use EXTERNAL-FORMAT:DECODE-EXTERNAL-STRING and EXTERNAL-FORMAT:ENCODE-LISP-STRING but see the recent discussion on the LW mailing list w.r.t. delivered applications: