[tbnl-devel] UTF-8 problems -- patch

Ivan Shvedunov

27 Jul 2005 27 Jul '05

11:11 p.m.

Hello. Well, I've promised this patch somewhat earlier, but I didn't have time to complete it... I've discovered several problems with TBNL's handling of UTF-8. Namely, there was a problem with url-decode in util.lisp which was turning UTF-8 urlencoded strings into something incomprehensible, and also there was problem with Content-Length in modlisp.lisp which was causing UTF-8 content to be truncated. The attached patch works only with SBCL. I mean that it shouldn't break other Lisps, but proper unicode hanling is implemented only for SBCL. I've tried to make it work with Allegro demo/LispWorks Personal Edition, but with no luck. Well, concerning Allegro, the problem here is that sockets that are used to talk to mod_lisp are set to latin-1 encoding for some reason, most likely KMRCL needs to be fixed a bit, again, unfortunatelly I just have no time to complete this. As of LispWorks, I just don't know how to turn a string into series of octets and vice versa using current encoding - i.e. I didn't find something like Allegro/SBCL octets-to-string/string-to-octets there. Concerning implementation - I've introduced :tbnl-unicode feature that is set for supported Unicode-aware Lisps in specials.lisp (I'm setting it for Allegro and SBCL, thogh it doesn't help Allegro much). Also I've added supporting funcs, bytes-to-string and string-to-bytes (defined only when #+tbnl-unicode) that do the dirty job of string conversion. Ivan

Attachments:

tbnl-0.5.5-unicode.patch (text/x-patch — 3.8 KB)

Show replies by date

Edi Weitz

28 Jul 28 Jul

5:37 p.m.

Hi! On Thu, 28 Jul 2005 03:11:52 +0400, Ivan Shvedunov <ivan4th@gmail.com> wrote:

...

Well, I've promised this patch somewhat earlier, but I didn't have time to complete it...

Thanks for the patch. See my comments below.

...

I've discovered several problems with TBNL's handling of UTF-8. Namely, there was a problem with url-decode in util.lisp which was turning UTF-8 urlencoded strings into something incomprehensible,

Note that you're calling COERCE twice in your version of URL-DECODE.

...

and also there was problem with Content-Length in modlisp.lisp which was causing UTF-8 content to be truncated.

The attached patch works only with SBCL. I mean that it shouldn't break other Lisps, but proper unicode hanling is implemented only for SBCL. I've tried to make it work with Allegro demo/LispWorks Personal Edition, but with no luck. Well, concerning Allegro, the problem here is that sockets that are used to talk to mod_lisp are set to latin-1 encoding for some reason, most likely KMRCL needs to be fixed a bit, again, unfortunatelly I just have no time to complete this. As of LispWorks, I just don't know how to turn a string into series of octets and vice versa using current encoding - i.e. I didn't find something like Allegro/SBCL octets-to-string/string-to-octets there.

The file test/test.lisp demonstrates the usage of external-format:encode-lisp-string for LispWorks. See also <http://thread.gmane.org/gmane.lisp.lispworks.general/3481>

...

Concerning implementation - I've introduced :tbnl-unicode feature that is set for supported Unicode-aware Lisps in specials.lisp (I'm setting it for Allegro and SBCL, thogh it doesn't help Allegro much).

My main concern is that at the moment the external format is kind of hard-coded into TBNL (or relying on some global setting), so if for example you use UTF-8 you can't serve binary content like JPGs anymore. Wouldn't it be better if content were always sent as a sequence of octets? (That would also solve the AllegroCL problem you mention above.)

...

Also I've added supporting funcs, bytes-to-string and string-to-bytes (defined only when #+tbnl-unicode) that do the dirty job of string conversion.

I'd prefer if they were called "bytes" and not "octets" because a byte doesn't necessarily have 8 bits. They should also be exported from the TBNL package, shouldn't they? Thanks, Edi.

Ivan Shvedunov

6:03 p.m.

Hi. On 7/28/05, Edi Weitz <edi@agharta.de> wrote:

...

Hi!

On Thu, 28 Jul 2005 03:11:52 +0400, Ivan Shvedunov <ivan4th@gmail.com> wrote:

...
Well, I've promised this patch somewhat earlier, but I didn't have time to complete it...

Thanks for the patch. See my comments below.

You're welcome :)

...

...
I've discovered several problems with TBNL's handling of UTF-8. Namely, there was a problem with url-decode in util.lisp which was turning UTF-8 urlencoded strings into something incomprehensible,

Note that you're calling COERCE twice in your version of URL-DECODE.

Well, I hope that (coerce bytes '(vector (unsigned-byte 8))) in bytes-to-string doesn't add much overhead when bytes are already '(vector (unsigned-byte 8)), but it allows one to pass just a vector of numbers there without making Lisp complain about it.

...

...
and also there was problem with Content-Length in modlisp.lisp which was causing UTF-8 content to be truncated.

The attached patch works only with SBCL. I mean that it shouldn't break other Lisps, but proper unicode hanling is implemented only for SBCL. I've tried to make it work with Allegro demo/LispWorks Personal Edition, but with no luck. Well, concerning Allegro, the problem here is that sockets that are used to talk to mod_lisp are set to latin-1 encoding for some reason, most likely KMRCL needs to be fixed a bit, again, unfortunatelly I just have no time to complete this. As of LispWorks, I just don't know how to turn a string into series of octets and vice versa using current encoding - i.e. I didn't find something like Allegro/SBCL octets-to-string/string-to-octets there.

The file test/test.lisp demonstrates the usage of

external-format:encode-lisp-string

for LispWorks. See also

<http://thread.gmane.org/gmane.lisp.lispworks.general/3481>

Thanks for pointer, I'll look at it.

...

...
Concerning implementation - I've introduced :tbnl-unicode feature that is set for supported Unicode-aware Lisps in specials.lisp (I'm setting it for Allegro and SBCL, thogh it doesn't help Allegro much).

My main concern is that at the moment the external format is kind of hard-coded into TBNL (or relying on some global setting), so if for example you use UTF-8 you can't serve binary content like JPGs anymore. Wouldn't it be better if content were always sent as a sequence of octets? (That would also solve the AllegroCL problem you mention above.)

I think this will be DEFINITELY better. I just haven't studied TBNL sources enough and don't know whether this will require a lot of changes. Well, it's possible to make simple versions of bytes-to-string and string-to-bytes funcs for non-Unicode lisps (utilizing char-code/code-char) and then convert the code to binary output mode.

...

...
Also I've added supporting funcs, bytes-to-string and string-to-bytes (defined only when #+tbnl-unicode) that do the dirty job of string conversion.

I'd prefer if they were called "bytes" and not "octets" because a byte doesn't necessarily have 8 bits.

? They _are_ called "bytes"...

...

They should also be exported from the TBNL package, shouldn't they?

Yes, I think they can be useful. I'll try to build a more elaborate patch, but probably this will happen no earlier than next week. Ivan.

Edi Weitz

6:18 p.m.

On Thu, 28 Jul 2005 22:03:56 +0400, Ivan Shvedunov <ivan4th@gmail.com> wrote:

...

Well, I hope that (coerce bytes '(vector (unsigned-byte 8))) in bytes-to-string doesn't add much overhead when bytes are already '(vector (unsigned-byte 8)),

Me too... :)

...

but it allows one to pass just a vector of numbers there without making Lisp complain about it.

OK.

...

...
I'd prefer if they were called "bytes" and not "octets" because a byte doesn't necessarily have 8 bits.

? They _are_ called "bytes"...

I meant: I'd prefer if they were called "octets" and not "bytes," sorry. <http://en.wikipedia.org/wiki/Byte>

...

I'll try to build a more elaborate patch, but probably this will happen no earlier than next week.

Thanks! Cheers, Edi.

7280

Age (days ago)

7281

Last active (days ago)

List overview

Download

3 comments

2 participants

participants (2)

Edi Weitz
Ivan Shvedunov