April 2009 - babel-devel@common-lisp.net - mailman3.common-lisp.net

[babel-devel] Changes
by Dan Weinreb 25 Apr '09

25 Apr '09

There are the changes I had to make in tests.lisp in order to get the tests to pass, in the latest ITA version of Clozure Common Lisp (formerly known as OpenMCL). CCL does not support having a character with code #\udcf0. The reader signals a condition if it sees this. Unfortunately, using #-ccl does not seem to solve the problem, presumably since the #- macro is working by calling "read" and it is not suppressing unhandled conditions, or something like that. It might be hard to fix that in a robust way. In order to make progress, I had to just comment these out. I do not suggest merging that into the official sources, but it would be very nice if we could find a way to write tests.lisp in such a way that these tests would apply when the characters are supported, and not when they are not. The (or (code-char ..) ...) change, on the other hand, I think should be made in the official sources. The Hyperspec says clearly that code-char is allowed to return nil. What do you think? -- Dan Index: trunk/qres/lisp/libs/babel/tests/tests.lisp =================================================================== --- trunk/qres/lisp/libs/babel/tests/tests.lisp (revision 249746) +++ trunk/qres/lisp/libs/babel/tests/tests.lisp (revision 262389) @@ -259,22 +259,25 @@ #(97 98 99)) -(defstest utf-8b.1 - (string-to-octets (coerce #(#\a #\b #\udcf0) 'unicode-string) - :encoding :utf-8b) - #(97 98 #xf0)) - -(defstest utf-8b.2 - (octets-to-string (ub8v 97 98 #xcd) :encoding :utf-8b) - #(#\a #\b #\udccd)) - -(defstest utf-8b.3 - (octets-to-string (ub8v 97 #xf0 #xf1 #xff #x01) :encoding :utf-8b) - #(#\a #\udcf0 #\udcf1 #\udcff #\udc01)) - -(deftest utf-8b.4 () - (let* ((octets (coerce (loop repeat 8192 collect (random (+ #x82))) - '(array (unsigned-byte 8) (*)))) - (string (octets-to-string octets :encoding :utf-8b))) - (is (equalp octets (string-to-octets string :encoding :utf-8b))))) +;; CCL does not suppport Unicode characters between d800 and e000. +;(defstest utf-8b.1 +; (string-to-octets (coerce #(#\a #\b #\udcf0) 'unicode-string) +; :encoding :utf-8b) +; #(97 98 #xf0)) + +;; CCL does not suppport Unicode characters between d800 and e000. +;(defstest utf-8b.2 +; (octets-to-string (ub8v 97 98 #xcd) :encoding :utf-8b) +; #(#\a #\b #\udccd)) + +;; CCL does not suppport Unicode characters between d800 and e000. +;(defstest utf-8b.3 +; (octets-to-string (ub8v 97 #xf0 #xf1 #xff #x01) :encoding :utf-8b) +; #(#\a #\udcf0 #\udcf1 #\udcff #\udc01)) + +;(deftest utf-8b.4 () +; (let* ((octets (coerce (loop repeat 8192 collect (random (+ #x82))) +; '(array (unsigned-byte 8) (*)))) +; (string (octets-to-string octets :encoding :utf-8b))) +; (is (equalp octets (string-to-octets string :encoding :utf-8b))))) ;;; The following tests have been adapted from SBCL's @@ -338,5 +341,6 @@ (let ((string (make-string unicode-char-code-limit))) (dotimes (i unicode-char-code-limit) - (setf (char string i) (code-char i))) + ;; CCL does not suppport Unicode characters between d800 and e000. + (setf (char string i) (or (code-char i) #\a))) (let ((string2 (octets-to-string (string-to-octets string :encoding enc

4 11

Re: [babel-devel] Changes
by Luís Oliveira 14 Apr '09

14 Apr '09

On Tue, Apr 14, 2009 at 10:17 AM, David Lichteblau <david(a)lichteblau.com> wrote: > In Allegro CL and LispWorks, the situation is very different. They use > UTF-16 to represent Lisp strings in memory, so surrogates aren't just > forbidden in Lisp strings, user code actually needs to work with > surrogates to be able to use all of Unicode. My understanding was that they use UCS-2, i.e., they are limited to the BMP. AFAICT, their external formats don't produce surrogates in Lisp strings. (ACL doesn't. Didn't test Lispworks but its documentation specifically mentions the BMP and UCS-2.) They don't seem to have functions to deal with surrogates either. > SBCL has 21 bit characters like CCL and currently has characters for the > surrogate code points. But I am not aware of any consensus that this is > the right thing to do. Personally I think it's a bug and SBCL should be > changed to do it like CCL. I don't feel that strongly about either option. (I feel it's more important that the various Lisps agree on one of them.) That said, so far I haven't heard any compelling arguments in favor of having CODE-CHAR return NIL for such code points. > As far as I understand, the only Lisp with 21 bit characters whose > author thinks that SBCL's behaviour is correct is ECL, but I failed to > understand the reasoning behind that when it was being discussed > on comp.lang.lisp. [Assuming you're referring to the "Unicode and Common Lisp" thread. <ab100561-286c-4570-aabc-72fd877f22ae(a)v18g2000pro.googlegroups.com> <http://groups.google.com/group/comp.lang.lisp/browse_thread/thread/97ff103a…>] I couldn't find where Juanjo argued for or against CODE-CHAR returning NIL for surrogates. His (somewhat unrelated) main point, IIUC, is that you shouldn't try to support full Unicode when all you have is 16-bit characters and instead restrict Unicode handling to the BMP. (U+0000 through U+FFFF.) > (As a side note, I find it a huge hassle to write code portable between > the Lisp implementations with Unicode support. For CXML, I needed read > time conditionals checking for UTF-16 Lisps. And it still doesn't > actually work, because most of the other free libraries like Babel, > CL-Unicode, and in turn, CL-PPCRE, expect 21 bit characters and are > effectively broken on Allegro and LispWorks.) I would argue that you're trying too hard. Would you support UTF-8 on CMUCL too? (Yes, I am aware that Unicode support for CMUCL is eminent. Using actual UTF-16... With surrogates... *sigh*) Just ignore/replace code points equal to or greater than CHAR-CODE-LIMIT. Does that mean CXML won't pass the test suites for Allegro and Lispworks? So be it. If enough Allegro or Lispworks costumers have the need to deal with characters outside the BMP, they'll complain and it'll be fixed. Duane Rettig hints in that c.l.l thread that Allegro might support full 21-bit Unicode characters in the future. (But perhaps I'm being too optimistic. Perhaps you really really need to use CXML+Lispworks/Allegro along with characters outside the BMP? Then ignore what I said above, I guess.) My plan for Babel was not to assume 21-bit characters, but to punt on characters above the CHAR-CODE-LIMIT. (It doesn't do that properly at the moment. That's definitely a bug that will have to be fixed.) Would you argue that it'd be better for Babel to instead use UTF-16 on Lispworks/Allegro? (Not a rhetorical question; if that turns out to be a good idea I'd change Babel in that direction.) What about UTF-8 for Lisps with 8-bit characters? I suspect that restricting oneself to a subset of Unicode is more robust and more manageable for portable programs. > While I have no ideas regarding UTF-8b, I think it worth pointing out > that for the important use case of file names, there is a different way of > achieving a round-trip involving file names in "might be UTF-8" format. > > The idea is to interpret invalid UTF-8 bytes in Latin 1, but prefix them > with the code point 0. > > On encoding back to a file name, such null characters would be stripped > again. That sounds like worse a hack than UTF-8B because if you convert such a string into another encoding you'll get bogus characters with no indication of error instead of, say, replacement characters. (That seems to be a big advantage of representing invalid bytes as invalid characters. Doesn't that make sense?) -- Luís Oliveira http://student.dei.uc.pt/~lmoliv/

1 0

[babel-devel] Unicode issues, esp security
by Dan Weinreb 13 Apr '09

13 Apr '09

Luis, From two Unicode experts I have consulted come the following comments: See: http://www.unicode.org/reports/tr36/ Cases like this, in which an illegal sequence is explicitly transformed into another illegal sequence, would meet with a lot of resistance from folks who care about security. It's important not to do anything outside the definition. Your objection to CODE-CHAR returning NIL is incompatible with the Unicode concept of "Noncharacters". See the Unicode report section 16.7. -- Dan

4 4