On Sun, Apr 12, 2009 at 9:10 AM, Gary Byers gb@clozure.com wrote:
Suppose (code-char 237) returned NIL instead of #\í. That's allowed by the CL standard, but I'm positive some Babel test should fail because of that.
Assuming that the implementation in question used Unicode (or some subset of it) and that CHAR-CODE-LIMIT was > 237, it's hard to see how this case (where a character is associated with a code in Unicode) is analogous to the case that we're discussing (where Unicode says that no character is or ever can be associated with a particular code.)
It's analogous because, in both cases, Babel is expecting CODE-CHAR to return non-NIL. In both cases, if CODE-CHAR returns NIL, code will break (e.g. the UTF-8B decoder). And, to be clear, the code breaks not because of the assumption per se, but because it really needs/wants to use some of those character codes.
The spec does quite clearly say that CODE-CHAR is allowed to return NIL if no character with the specified code attribute exists or can be created. CCL's implementation of CODE-CHAR returns NIL in many (unfortunately not all) cases where the Unicode standard says that no character corresponds to its code argument; other implementations currently do not return NIL in this case. There are a variety of arguments in favor of and against either behavior, ANSI CL allows either behavior, and code can't portably assume either behavior.
Again, you might argue that Babel's expectation is wrong and you might be right. But that's the current expectation and Babel's test suite should reflect that. There's a couple of other non-portable assumptions that Babel makes. E.g. it expects char codes to be Unicode or a subset thereof.
I believe that it's preferable for CODE-CHAR to return NIL in cases where it can reliably and efficiently detect that its argument doesn't denote a character, and CCL does this. Other implementations behave differently, and there may be reasons that I can't think of for finding that behavior preferable.
The main advantage seems to be the ability to deal with mis-encoded text non-destructively. (Through UTF-8B, UTF-32, or some other encoding.) But perhaps that is a bad idea altogether?
I'm not really sure that I understand the point of this email thread and I'm sure that I must have missed some context, but some part of it seems to be an attempt to convince me (or someone) that CODE-CHAR should never return NIL because of some combination of:
- in other implementations, it never returns NIL - there is some otherwise useful code which fails (or its test suite fails) because it assumes that CODE-CHAR always returns a non-NIL value.
I'm sorry. The lack of context was entirely my fault. Should have described what was going on when I added openmcl-devel to the Cc list.
Let me try to sum things up. Babel is a charset encoding/decoding library. One of its main goals is to provide consistent behaviour across the Lisps it supports, particularly with regard to error handling. I believe it has largely succeeded to accomplish said goal; this problem is the first inconsistency that I know of.
Which is why I thought I should present this issue to the openmcl-devel list. I suppose I was indeed trying to get the CCL developers to change its behaviour (or accept patches in that direction) in the hopes of providing consistent behaviour for Babel users. I guess I'll have to instead add a note to Babel's documentation saying something like "UTF-8B does not work on Clozure CL". It's unfortunate, but not that big a deal, really.
If I understand this much correctly, then I can only say that I didn't personally find these arguments persuasive when I was trying to decide how CODE-CHAR should behave in CCL a few years ago and don't find them persuasive now.
Fair enough. I don't have any more arguments. (Though, I might stress again that the main problem is not that we assume that CODE-CHAR always returns non-NIL, it's that we really do want to use some character codes that CCL forbids.)
If there were a lot of otherwise useful code out there that made the same non-portable assumption and if it was really hard to write character-encoding utilities without assuming that all codes between 0 and CHAR-CODE-LIMIT denote characters, then I'd be less dismissive of this than I'm being. As it is, I'm sorry that I can't say anything more constructive than "I hope that you or someone will have the opportunity to change your code to remove non-portable assumptions that make it less useful with CCL than it would otherwise be."
Again, I'm curious how UTF-8B might be implemented when CODE-CHAR returns NIL for #xDC80 through #xDCFF.