Hi!
On Thu, 28 Jul 2005 03:11:52 +0400, Ivan Shvedunov ivan4th@gmail.com wrote:
Well, I've promised this patch somewhat earlier, but I didn't have time to complete it...
Thanks for the patch. See my comments below.
I've discovered several problems with TBNL's handling of UTF-8. Namely, there was a problem with url-decode in util.lisp which was turning UTF-8 urlencoded strings into something incomprehensible,
Note that you're calling COERCE twice in your version of URL-DECODE.
and also there was problem with Content-Length in modlisp.lisp which was causing UTF-8 content to be truncated.
The attached patch works only with SBCL. I mean that it shouldn't break other Lisps, but proper unicode hanling is implemented only for SBCL. I've tried to make it work with Allegro demo/LispWorks Personal Edition, but with no luck. Well, concerning Allegro, the problem here is that sockets that are used to talk to mod_lisp are set to latin-1 encoding for some reason, most likely KMRCL needs to be fixed a bit, again, unfortunatelly I just have no time to complete this. As of LispWorks, I just don't know how to turn a string into series of octets and vice versa using current encoding - i.e. I didn't find something like Allegro/SBCL octets-to-string/string-to-octets there.
The file test/test.lisp demonstrates the usage of
external-format:encode-lisp-string
for LispWorks. See also
http://thread.gmane.org/gmane.lisp.lispworks.general/3481
Concerning implementation - I've introduced :tbnl-unicode feature that is set for supported Unicode-aware Lisps in specials.lisp (I'm setting it for Allegro and SBCL, thogh it doesn't help Allegro much).
My main concern is that at the moment the external format is kind of hard-coded into TBNL (or relying on some global setting), so if for example you use UTF-8 you can't serve binary content like JPGs anymore. Wouldn't it be better if content were always sent as a sequence of octets? (That would also solve the AllegroCL problem you mention above.)
Also I've added supporting funcs, bytes-to-string and string-to-bytes (defined only when #+tbnl-unicode) that do the dirty job of string conversion.
I'd prefer if they were called "bytes" and not "octets" because a byte doesn't necessarily have 8 bits. They should also be exported from the TBNL package, shouldn't they?
Thanks, Edi.