[tbnl-devel] Re: New version 0.8.7 (Was: Content length with multibyte character encodings)

29 Nov 2005


      ...
On Tue, 29 Nov 2005 00:18:08 +0200, Ignas Mikalajunas <ignas.mikalajunas@gmail.com> wrote:
...
Content length is calculated by calling (length content) which
produces wrong results with unicode characters in the string. Piso
on #lisp proposed a solution - using (length (string-to-octets
string :external-format :utf-8)) which translates to just (length
(string-to-octets string :external-format)) in the code.
I won't do that because it's most likely a terrible performance hog if
you convert each page to octets be default (assuming that most users
already send octets).
Sorry i was not aware of that. If i understand you correctly the right
way is converting all of my pages (they all are utf-8) to octets
before sending them to tbnl?
...
I also don't understand why
(length (string-to-octets string :external-format :utf-8))
translates to
(length (string-to-octets string :external-format))
Because the first one is cl-user:string-to-octets and the second one is
tbnl:string-to-octets.
...
...
because with current setup browsers that strictly adhere to the
content-lenght (IE 6.0, Opera) would trim 1 character of the
responses body for each UTF-8 character in it.
Nope, that's not how UTF-8 works.
What i meant was:
(length "ąčęė") returns 4
though (lenght (string-to-octests "ąčęė")) is 8.
Which means that tbnl would try to fit an 8 octet body with a content
length of 4 and IE/Opera would display that as "ąč". That's how it
works on SBCL.

  Ignas