Re: [tbnl-devel] UTF-8 problems -- patch

28 Jul 2005

      Hi!

On Thu, 28 Jul 2005 03:11:52 +0400, Ivan Shvedunov <ivan4th@gmail.com> wrote:
...
Well, I've promised this patch somewhat earlier, but I didn't have
time to complete it...
Thanks for the patch.  See my comments below.
...
I've discovered several problems with TBNL's handling of
UTF-8. Namely, there was a problem with url-decode in util.lisp
which was turning UTF-8 urlencoded strings into something
incomprehensible,
Note that you're calling COERCE twice in your version of URL-DECODE.
...
and also there was problem with Content-Length in modlisp.lisp which
was causing UTF-8 content to be truncated.
The attached patch works only with SBCL. I mean that it shouldn't
break other Lisps, but proper unicode hanling is implemented only
for SBCL. I've tried to make it work with Allegro demo/LispWorks
Personal Edition, but with no luck. Well, concerning Allegro, the
problem here is that sockets that are used to talk to mod_lisp are
set to latin-1 encoding for some reason, most likely KMRCL needs to
be fixed a bit, again, unfortunatelly I just have no time to
complete this. As of LispWorks, I just don't know how to turn a
string into series of octets and vice versa using current encoding -
i.e. I didn't find something like Allegro/SBCL
octets-to-string/string-to-octets there.
The file test/test.lisp demonstrates the usage of

  external-format:encode-lisp-string

for LispWorks.  See also

  <http://thread.gmane.org/gmane.lisp.lispworks.general/3481>
...
Concerning implementation - I've introduced :tbnl-unicode feature
that is set for supported Unicode-aware Lisps in specials.lisp (I'm
setting it for Allegro and SBCL, thogh it doesn't help Allegro
much).
My main concern is that at the moment the external format is kind of
hard-coded into TBNL (or relying on some global setting), so if for
example you use UTF-8 you can't serve binary content like JPGs
anymore.  Wouldn't it be better if content were always sent as a
sequence of octets?  (That would also solve the AllegroCL problem you
mention above.)
...
Also I've added supporting funcs, bytes-to-string and
string-to-bytes (defined only when #+tbnl-unicode) that do the dirty
job of string conversion.
I'd prefer if they were called "bytes" and not "octets" because a byte
doesn't necessarily have 8 bits.  They should also be exported from
the TBNL package, shouldn't they?

Thanks,
Edi.

Re: [tbnl-devel] UTF-8 problems -- patch

Edi Weitz