Re: [babel-devel] [Openmcl-devel] Changes

12 Apr 2009

      On Sun, Apr 12, 2009 at 9:10 AM, Gary Byers <gb@clozure.com> wrote:
...
...
Suppose (code-char 237) returned NIL instead of #\í. That's allowed by
the CL standard, but I'm positive some Babel test should fail because
of that.
Assuming that the implementation in question used Unicode (or some
subset of it) and that CHAR-CODE-LIMIT was > 237, it's hard to see how
this case (where a character is associated with a code in Unicode) is
analogous to the case that we're discussing (where Unicode says that no
character is or ever can be associated with a particular code.)
It's analogous because, in both cases, Babel is expecting CODE-CHAR to
return non-NIL. In both cases, if CODE-CHAR returns NIL, code will
break (e.g. the UTF-8B decoder). And, to be clear, the code breaks not
because of the assumption per se, but because it really needs/wants to
use some of those character codes.
...
The spec does quite clearly say that CODE-CHAR is allowed to return
NIL if no character with the specified code attribute exists or can
be created.  CCL's implementation of CODE-CHAR returns NIL in many
(unfortunately not all) cases where the Unicode standard says that
no character corresponds to its code argument; other implementations
currently do not return NIL in this case.  There are a variety of
arguments in favor of and against either behavior, ANSI CL allows
either behavior, and code can't portably assume either behavior.
Again, you might argue that Babel's expectation is wrong and you might
be right. But that's the current expectation and Babel's test suite
should reflect that. There's a couple of other non-portable
assumptions that Babel makes. E.g. it expects char codes to be Unicode
or a subset thereof.
...
I believe that it's preferable for CODE-CHAR to return NIL in
cases where it can reliably and efficiently detect that its argument
doesn't denote a character, and CCL does this.  Other implementations
behave differently, and there may be reasons that I can't think of
for finding that behavior preferable.
The main advantage seems to be the ability to deal with mis-encoded
text non-destructively. (Through UTF-8B, UTF-32, or some other
encoding.) But perhaps that is a bad idea altogether?
...
I'm not really sure that I
understand the point of this email thread and I'm sure that I must
have missed some context, but some part of it seems to be an attempt
to convince me (or someone) that CODE-CHAR should never return NIL
because of some combination of:
 - in other implementations, it never returns NIL
 - there is some otherwise useful code which fails (or its test suite
   fails) because it assumes that CODE-CHAR always returns a non-NIL
   value.
I'm sorry. The lack of context was entirely my fault. Should have
described what was going on when I added openmcl-devel to the Cc list.

Let me try to sum things up. Babel is a charset encoding/decoding
library. One of its main goals is to provide consistent behaviour
across the Lisps it supports, particularly with regard to error
handling. I believe it has largely succeeded to accomplish said goal;
this problem is the first inconsistency that I know of.

Which is why I thought I should present this issue to the
openmcl-devel list. I suppose I was indeed trying to get the CCL
developers to change its behaviour (or accept patches in that
direction) in the hopes of providing consistent behaviour for Babel
users. I guess I'll have to instead add a note to Babel's
documentation saying something like "UTF-8B does not work on Clozure
CL". It's unfortunate, but not that big a deal, really.
...
If I understand this much correctly, then I can only say that I didn't
personally find these arguments persuasive when I was trying to decide
how CODE-CHAR should behave in CCL a few years ago and don't find them
persuasive now.
Fair enough. I don't have any more arguments. (Though, I might stress
again that the main problem is not that we assume that CODE-CHAR
always returns non-NIL, it's that we really do want to use some
character codes that CCL forbids.)
...
If there were a lot of otherwise useful code out there that made the
same non-portable assumption and if it was really hard to write
character-encoding utilities without assuming that all codes between
0 and CHAR-CODE-LIMIT denote characters, then I'd be less dismissive
of this than I'm being.  As it is, I'm sorry that I can't say anything
more constructive than "I hope that you or someone will have the opportunity
to change your code to remove non-portable assumptions
that make it less useful with CCL than it would otherwise be."
Again, I'm curious how UTF-8B might be implemented when CODE-CHAR
returns NIL for #xDC80 through #xDCFF.

-- 
Luís Oliveira
http://student.dei.uc.pt/~lmoliv/