
There are the changes I had to make in tests.lisp in order to get the tests to pass, in the latest ITA version of Clozure Common Lisp (formerly known as OpenMCL). CCL does not support having a character with code #\udcf0. The reader signals a condition if it sees this. Unfortunately, using #-ccl does not seem to solve the problem, presumably since the #- macro is working by calling "read" and it is not suppressing unhandled conditions, or something like that. It might be hard to fix that in a robust way. In order to make progress, I had to just comment these out. I do not suggest merging that into the official sources, but it would be very nice if we could find a way to write tests.lisp in such a way that these tests would apply when the characters are supported, and not when they are not. The (or (code-char ..) ...) change, on the other hand, I think should be made in the official sources. The Hyperspec says clearly that code-char is allowed to return nil. What do you think? -- Dan Index: trunk/qres/lisp/libs/babel/tests/tests.lisp =================================================================== --- trunk/qres/lisp/libs/babel/tests/tests.lisp (revision 249746) +++ trunk/qres/lisp/libs/babel/tests/tests.lisp (revision 262389) @@ -259,22 +259,25 @@ #(97 98 99)) -(defstest utf-8b.1 - (string-to-octets (coerce #(#\a #\b #\udcf0) 'unicode-string) - :encoding :utf-8b) - #(97 98 #xf0)) - -(defstest utf-8b.2 - (octets-to-string (ub8v 97 98 #xcd) :encoding :utf-8b) - #(#\a #\b #\udccd)) - -(defstest utf-8b.3 - (octets-to-string (ub8v 97 #xf0 #xf1 #xff #x01) :encoding :utf-8b) - #(#\a #\udcf0 #\udcf1 #\udcff #\udc01)) - -(deftest utf-8b.4 () - (let* ((octets (coerce (loop repeat 8192 collect (random (+ #x82))) - '(array (unsigned-byte 8) (*)))) - (string (octets-to-string octets :encoding :utf-8b))) - (is (equalp octets (string-to-octets string :encoding :utf-8b))))) +;; CCL does not suppport Unicode characters between d800 and e000. +;(defstest utf-8b.1 +; (string-to-octets (coerce #(#\a #\b #\udcf0) 'unicode-string) +; :encoding :utf-8b) +; #(97 98 #xf0)) + +;; CCL does not suppport Unicode characters between d800 and e000. +;(defstest utf-8b.2 +; (octets-to-string (ub8v 97 98 #xcd) :encoding :utf-8b) +; #(#\a #\b #\udccd)) + +;; CCL does not suppport Unicode characters between d800 and e000. +;(defstest utf-8b.3 +; (octets-to-string (ub8v 97 #xf0 #xf1 #xff #x01) :encoding :utf-8b) +; #(#\a #\udcf0 #\udcf1 #\udcff #\udc01)) + +;(deftest utf-8b.4 () +; (let* ((octets (coerce (loop repeat 8192 collect (random (+ #x82))) +; '(array (unsigned-byte 8) (*)))) +; (string (octets-to-string octets :encoding :utf-8b))) +; (is (equalp octets (string-to-octets string :encoding :utf-8b))))) ;;; The following tests have been adapted from SBCL's @@ -338,5 +341,6 @@ (let ((string (make-string unicode-char-code-limit))) (dotimes (i unicode-char-code-limit) - (setf (char string i) (code-char i))) + ;; CCL does not suppport Unicode characters between d800 and e000. + (setf (char string i) (or (code-char i) #\a))) (let ((string2 (octets-to-string (string-to-octets string :encoding enc

[Sending a copy to the openmcl-devel mailing list.] On Wed, Apr 8, 2009 at 9:51 PM, Dan Weinreb <dlw@itasoftware.com> wrote:
CCL does not support having a character with code #\udcf0. The reader signals a condition if it sees this. Unfortunately, using #-ccl does not seem to solve the problem, presumably since the #- macro is working by calling "read" and it is not suppressing unhandled conditions, or something like that. It might be hard to fix that in a robust way.
Interesting. It seems that #-ccl works fine for CCL's #\ but not for Babel's #\ which is defined in babel/src/sharp-backslash.lisp and it's what we're using within the test suite. That is of course my fault. I now see in CLHS that *READ-SUPRESS* should be honoured by each reader and I had missed that. What's the rationale behind not supporting the High Surrogate Area (D800–DBFF)? I can see how that might make sense in that Unicode states that this area does not have any character assignments. But, FWIW, the other three Lisps with full unicode support that I'm familiar with -- SBCL, CLISP and ECL -- handle this area just fine. The disadvantage of not handling this area is that we can't implement the UTF-8B encoding. What's the advantage?
In order to make progress, I had to just comment these out. I do not suggest merging that into the official sources, but it would be very nice if we could find a way to write tests.lisp in such a way that these tests would apply when the characters are supported, and not when they are not.
I'll fix the #\ reader macro and that should take care of that annoyance. (For some reason, in my system, tests.lisp appears to load fine with some old CCL 1.2 snapshot.)
The (or (code-char ..) ...) change, on the other hand, I think should be made in the official sources. The Hyperspec says clearly that code-char is allowed to return nil.
I see. For our purposes, though, it seems that if CODE-CHAR returns NIL, we should signal a test failure immediately. -- Luís Oliveira http://student.dei.uc.pt/~lmoliv/

Luís Oliveira wrote:
I see. For our purposes, though, it seems that if CODE-CHAR returns NIL, we should signal a test failure immediately.
I don't understand why. If code-char is allowed to return nil, explicitly, in the CL standard, why consider that to be a babel test failure? Shouldn't it be possible to run the regression test under CCL and have it succeed if babel does not have bugs? -- Dan

On Fri, Apr 10, 2009 at 11:56 PM, Dan Weinreb <dlw@itasoftware.com> wrote:
I don't understand why. If code-char is allowed to return nil, explicitly, in the CL standard, why consider that to be a babel test failure?
Suppose (code-char 237) returned NIL instead of #\í. That's allowed by the CL standard, but I'm positive some Babel test should fail because of that. One might argue that Babel's expectation of being able to encode every code point as a character is not reasonable, but that's the current expectation and the test suite reflects that. (And it passes in all Lisps except CCL.) If it helps, we can split such a test away from the roundtrip test, though, and mark it as an expected failure on CCL, for example. -- Luís Oliveira http://student.dei.uc.pt/~lmoliv/

On Sat, 11 Apr 2009, Lu�s Oliveira wrote:
On Fri, Apr 10, 2009 at 11:56 PM, Dan Weinreb <dlw@itasoftware.com> wrote:
I don't understand why.� If code-char is allowed to return nil, explicitly, in the CL standard, why consider that to be a babel test failure?
Suppose (code-char 237) returned NIL instead of #\�. That's allowed by the CL standard, but I'm positive some Babel test should fail because of that.
Assuming that the implementation in question used Unicode (or some subset of it) and that CHAR-CODE-LIMIT was > 237, it's hard to see how this case (where a character is associated with a code in Unicode) is analogous to the case that we're discussing (where Unicode says that no character is or ever can be associated with a particular code.) The spec does quite clearly say that CODE-CHAR is allowed to return NIL if no character with the specified code attribute exists or can be created. CCL's implementation of CODE-CHAR returns NIL in many (unfortunately not all) cases where the Unicode standard says that no character corresponds to its code argument; other implementations currently do not return NIL in this case. There are a variety of arguments in favor of and against either behavior, ANSI CL allows either behavior, and code can't portably assume either behavior. I believe that it's preferable for CODE-CHAR to return NIL in cases where it can reliably and efficiently detect that its argument doesn't denote a character, and CCL does this. Other implementations behave differently, and there may be reasons that I can't think of for finding that behavior preferable. I'm not really sure that I understand the point of this email thread and I'm sure that I must have missed some context, but some part of it seems to be an attempt to convince me (or someone) that CODE-CHAR should never return NIL because of some combination of: - in other implementations, it never returns NIL - there is some otherwise useful code which fails (or its test suite fails) because it assumes that CODE-CHAR always returns a non-NIL value. If I understand this much correctly, then I can only say that I didn't personally find these arguments persuasive when I was trying to decide how CODE-CHAR should behave in CCL a few years ago and don't find them persuasive now. If there were a lot of otherwise useful code out there that made the same non-portable assumption and if it was really hard to write character-encoding utilities without assuming that all codes between 0 and CHAR-CODE-LIMIT denote characters, then I'd be less dismissive of this than I'm being. As it is, I'm sorry that I can't say anything more constructive than "I hope that you or someone will have the opportunity to change your code to remove non-portable assumptions that make it less useful with CCL than it would otherwise be." If the point of this email thread is something else ... well, I'm sorry to have missed that point and will try to say something more responsive if/when I understand what that point is.

Gary Byers wrote:
I'm not really sure that I understand the point of this email thread and I'm sure that I must have missed some context, but some part of it seems to be an attempt to convince me (or someone) that CODE-CHAR should never return NIL Sorry, Gary. The context is the babel test suite. It failed on CCL because it was depending on code-char never returning nil, and also because it includes uses of #\u with values that are not characters in CCL.
I was making some performance improvements in babel and wanted to make sure the test suite still passed, and I ran into this problem. I had to comment out the #\u's (+# didn't work because they're using their own #+) and modify the test using code-char to ignore cases where it returns nil. -- Dan -- ________________________________________ Daniel Weinreb http://danweinreb.org/blog/ Discussion about the future of Lisp: ilc2009.scheming.org

On Sun, Apr 12, 2009 at 9:10 AM, Gary Byers <gb@clozure.com> wrote:
Suppose (code-char 237) returned NIL instead of #\í. That's allowed by the CL standard, but I'm positive some Babel test should fail because of that.
Assuming that the implementation in question used Unicode (or some subset of it) and that CHAR-CODE-LIMIT was > 237, it's hard to see how this case (where a character is associated with a code in Unicode) is analogous to the case that we're discussing (where Unicode says that no character is or ever can be associated with a particular code.)
It's analogous because, in both cases, Babel is expecting CODE-CHAR to return non-NIL. In both cases, if CODE-CHAR returns NIL, code will break (e.g. the UTF-8B decoder). And, to be clear, the code breaks not because of the assumption per se, but because it really needs/wants to use some of those character codes.
The spec does quite clearly say that CODE-CHAR is allowed to return NIL if no character with the specified code attribute exists or can be created. CCL's implementation of CODE-CHAR returns NIL in many (unfortunately not all) cases where the Unicode standard says that no character corresponds to its code argument; other implementations currently do not return NIL in this case. There are a variety of arguments in favor of and against either behavior, ANSI CL allows either behavior, and code can't portably assume either behavior.
Again, you might argue that Babel's expectation is wrong and you might be right. But that's the current expectation and Babel's test suite should reflect that. There's a couple of other non-portable assumptions that Babel makes. E.g. it expects char codes to be Unicode or a subset thereof.
I believe that it's preferable for CODE-CHAR to return NIL in cases where it can reliably and efficiently detect that its argument doesn't denote a character, and CCL does this. Other implementations behave differently, and there may be reasons that I can't think of for finding that behavior preferable.
The main advantage seems to be the ability to deal with mis-encoded text non-destructively. (Through UTF-8B, UTF-32, or some other encoding.) But perhaps that is a bad idea altogether?
I'm not really sure that I understand the point of this email thread and I'm sure that I must have missed some context, but some part of it seems to be an attempt to convince me (or someone) that CODE-CHAR should never return NIL because of some combination of:
- in other implementations, it never returns NIL - there is some otherwise useful code which fails (or its test suite fails) because it assumes that CODE-CHAR always returns a non-NIL value.
I'm sorry. The lack of context was entirely my fault. Should have described what was going on when I added openmcl-devel to the Cc list. Let me try to sum things up. Babel is a charset encoding/decoding library. One of its main goals is to provide consistent behaviour across the Lisps it supports, particularly with regard to error handling. I believe it has largely succeeded to accomplish said goal; this problem is the first inconsistency that I know of. Which is why I thought I should present this issue to the openmcl-devel list. I suppose I was indeed trying to get the CCL developers to change its behaviour (or accept patches in that direction) in the hopes of providing consistent behaviour for Babel users. I guess I'll have to instead add a note to Babel's documentation saying something like "UTF-8B does not work on Clozure CL". It's unfortunate, but not that big a deal, really.
If I understand this much correctly, then I can only say that I didn't personally find these arguments persuasive when I was trying to decide how CODE-CHAR should behave in CCL a few years ago and don't find them persuasive now.
Fair enough. I don't have any more arguments. (Though, I might stress again that the main problem is not that we assume that CODE-CHAR always returns non-NIL, it's that we really do want to use some character codes that CCL forbids.)
If there were a lot of otherwise useful code out there that made the same non-portable assumption and if it was really hard to write character-encoding utilities without assuming that all codes between 0 and CHAR-CODE-LIMIT denote characters, then I'd be less dismissive of this than I'm being. As it is, I'm sorry that I can't say anything more constructive than "I hope that you or someone will have the opportunity to change your code to remove non-portable assumptions that make it less useful with CCL than it would otherwise be."
Again, I'm curious how UTF-8B might be implemented when CODE-CHAR returns NIL for #xDC80 through #xDCFF. -- Luís Oliveira http://student.dei.uc.pt/~lmoliv/

On Sun, 12 Apr 2009, Lu�s Oliveira wrote:
Again, I'm curious how UTF-8B might be implemented when CODE-CHAR returns NIL for #xDC80 through #xDCFF.
Let's assume that we have something that reads a sequence of 1 or more UTF-8-encoded bytes from a stream (and that we have variants that do the same for bytes in foreign memory, a lisp vector, etc.) If it gets an EOF while trying to read the first byte of a sequence, it returns NIL; otherwise, it returns an unsigned integer less than #x110000. If it can tell that a sequence is malformed (overlong, whatever), it returns the CHAR-CODE of the Unicode replacement character (#xfffd), it does not reject encoded values that correspond to UTF-16 surrogate pairs or other non-character code points. (defun read-utf-8-code (stream) "Try to read 1 or more octets from stream. Return NIL if EOF is encountered when reading the first octet, otherwise, return an unsigned integer less than #x110000. If a malformed UTF-8 sequence is detected, return the character code of #\Replacement_Character; otherwise, return encoded value." (let* ((b0 (read-byte stream nil nil))) (when b0 (if (< b0 #x80) b0 (if (< b0 #xc2) (char-code #\Replacement_Character) (let* ((b1 (read-byte stream nil nil))) (if (null b1) ;;[Lots of other details to get right, not shown] ))))))) This (or something very much like it) has to exist in order to support UTF-8; the elided details are surprisingly complicated (if we want to reject malformed sequences.) I wasn't able to find a formal definition of UTF-8B anywhere; the informal descriptions that I saw suggested that it's a way of embedding binary data in UTF-8-encoded character data, with the binary data encoded in the low 8 bits of 16-bit codes whose high 8 bits contained #xdc. If the binary data is in fact embedded in the low 7 bits of codes in the range #xdc80-#xdc8f or something else, then the following parameters would need to change: (defparameter *utf-8b-binary-data-byte* (byte 8 0)) (defparameter *utf-8b-binary-marker-byte* (byte 13 8)) (defparameter *utf-8b-binary-marker-value* #xdc) PROCESS-BINARY and PROCESS-CHARACTER do whatever it is that you want to do with a byte of binary data or a character. A real decoder might want to take these functions - or a single function that processed either a byte or character - as arguments. This is just #\Replacement_Character in CCL: (defparameter *replacement-character* (code-char #xfffd)) (defun decode-utf-8b-stream (stream) (do* ((code (read-utf-8-code stream) (read-utf-8-code stream))) ((null code)) ; eof (if (eql *utf-8b-binary-marker-value* (ldb *utf-8b-binary-marker-byte* code)) (process-binary (ldb *utf-8b-binary-data-byte* code)) (process-character (or (code-char code) *replacement-character*))))) Isn't that the basic idea, whether the details/parameters are right or not ?
-- Lu�s Oliveira http://student.dei.uc.pt/~lmoliv/

On Sun, Apr 12, 2009 at 9:35 PM, Gary Byers <gb@clozure.com> wrote:
I wasn't able to find a formal definition of UTF-8B anywhere; the informal descriptions that I saw suggested that it's a way of embedding binary data in UTF-8-encoded character data, with the binary data encoded in the low 8 bits of 16-bit codes whose high 8 bits contained #xdc.
IIUC, UTF-8B is meant as a way of converting random bytes that are *probably* in UTF-8 format into an Unicode string in such a way that it's possible to reconstruct the original byte sequence later on. The "spec" for UTF-8B is in this email message from Markus Kuhn: <http://mail.nl.linux.org/linux-utf8/2000-07/msg00040.html> which I should have mentioned when I first brought this up. (Sorry, again.)
(defun decode-utf-8b-stream (stream) (do* ((code (read-utf-8-code stream) (read-utf-8-code stream))) ((null code)) ; eof (if (eql *utf-8b-binary-marker-value* (ldb *utf-8b-binary-marker-byte* code)) (process-binary (ldb *utf-8b-binary-data-byte* code)) (process-character (or (code-char code) *replacement-character*)))))
Isn't that the basic idea, whether the details/parameters are right or not ?
That would work. But it certainly seems much more convenient to use Lisp strings directly. I'll try to illustrate that with a concrete example. These days, unix pathnames seem to be often encoded in UTF-8 but IIUC they can really be any random sequence of bytes -- or at least that seems to be the case on Linux. Suppose I was implementing a directory browser in Lisp. If I could use UTF-8B to convert unix pathnames into Lisp strings, it'd be straightforward to use Lisp pathnames, pass them around, manipulate them with the standard string and pathname functions, and still be able to access the respective files through syscalls later on. In this scenario, my program wouldn't have trouble handling badly formed UTF-8 or other binary junk. The same applies to environment variables, command line arguments, and so on. Does any of that make sense? -- Luís Oliveira http://student.dei.uc.pt/~lmoliv/

On Sun, Apr 12, 2009 at 9:10 AM, Gary Byers <gb@clozure.com> wrote:
If I understand this much correctly, then I can only say that I didn't personally find these arguments persuasive when I was trying to decide how CODE-CHAR should behave in CCL a few years ago and don't find them persuasive now.
It seems the discussion has run out of steam. Just to conclude it, I should ask: is it still the case that UTF-8B is not an argument compelling enough to make you consider a patch changing CODE-CHAR's behaviour, as well as the various encode- and decode-functions? (Such a patch would change CODE-CHAR to accept any code point, and deal with invalid code points explicitely in the UTF encoders and decoders.) -- Luís Oliveira http://student.dei.uc.pt/~lmoliv/

On Tue, 21 Apr 2009, Lu�s Oliveira wrote:
On Sun, Apr 12, 2009 at 9:10 AM, Gary Byers <gb@clozure.com> wrote:
If I understand this much correctly, then I can only say that I didn't personally find these arguments persuasive when I was trying to decide how CODE-CHAR should behave in CCL a few years ago and don't find them persuasive now.
It seems the discussion has run out of steam. Just to conclude it, I should ask: is it still the case that UTF-8B is not an argument compelling enough to make you consider a patch changing CODE-CHAR's behaviour, as well as the various encode- and decode-functions? (Such a patch would change CODE-CHAR to accept any code point, and deal with invalid code points explicitely in the UTF encoders and decoders.)
Yes, that is still the case. Table 2-3 (in Section 2-4) in the Unicode spec describes how various classes of code points do and do not map to abstract characters in Unicode, and I think that it's undesirable for CODE-CHAR in a CL implementation that purports to use Unicode as its internal encoding to return a character object for codes that that table says do not denote a Unicode character. CCL's CODE-CHAR returns NIL for surrogates and (in recent versions) a couple of permanant noncharacter codes. As I've said, I'd feel better about it if CCL's CODE-CHAR returned NIL for all (all 66) permanent-noncharacter codes, and if it cost nothing (in terms of time or space), I think that it'd be desirable for CODE-CHAR to return NIL for codes that're reserved as of the current version of the Unicode standard (or whatever version the lisp uses.) In the latter case, you may be able to get away with treating reserved codes as if they denoted defined characters - you wouldn't have the same issues with UTF-encoding them as would exist for surrogates, for instance - but you can't meaningfully treat a "reserved character" as if it was a defined character: ? (upper-case-p #\A) => T (in Unicode 5.1 and all prior and future versions) ? (upper-case-p (code-char #xd0000)) => unknown; as of Unicode 5.1, there's no such character I think that it'd be more consistent to say "AFAIK, there's no such character" than it would be to claim that there is and that it is or is not an upper-case character. Since CODE-CHAR is sometimes on or near a critical performance path, it's not clear that making it 100% accurate is worth whatever that would cost in terms of time/space. It's clear to me that catching and rejecting surrogate code points as non-characters is worth the extra effort.
-- Lu�s Oliveira http://student.dei.uc.pt/~lmoliv/

Hello again, On Wed, Apr 8, 2009 at 9:51 PM, Dan Weinreb <dlw@itasoftware.com> wrote:
CCL does not support having a character with code #\udcf0. The reader signals a condition if it sees this. Unfortunately, using #-ccl does not seem to solve the problem, presumably since the #- macro is working by calling "read" and it is not suppressing unhandled conditions, or something like that. It might be hard to fix that in a robust way.
As I've mentioned before, this was a bug in Babel's #\ reader. I've pushed a fix to the repository along with a regression test. I've also disabled the problematic UTF-8B tests using #-ccl.
The (or (code-char ..) ...) change, on the other hand, I think should be made in the official sources. The Hyperspec says clearly that code-char is allowed to return nil.
I've changed TEST-UNICODE-ROUNDTRIP not to try and encode non-characters. HTH. -- Luís Oliveira http://student.dei.uc.pt/~lmoliv/
participants (4)
-
Dan Weinreb
-
Daniel Weinreb
-
Gary Byers
-
Luís Oliveira