On Tue, 21 Apr 2009, Lu�s Oliveira wrote:
On Sun, Apr 12, 2009 at 9:10 AM, Gary Byers gb@clozure.com wrote:
If I understand this much correctly, then I can only say that I didn't personally find these arguments persuasive when I was trying to decide how CODE-CHAR should behave in CCL a few years ago and don't find them persuasive now.
It seems the discussion has run out of steam. Just to conclude it, I should ask: is it still the case that UTF-8B is not an argument compelling enough to make you consider a patch changing CODE-CHAR's behaviour, as well as the various encode- and decode-functions? (Such a patch would change CODE-CHAR to accept any code point, and deal with invalid code points explicitely in the UTF encoders and decoders.)
Yes, that is still the case.
Table 2-3 (in Section 2-4) in the Unicode spec describes how various classes of code points do and do not map to abstract characters in Unicode, and I think that it's undesirable for CODE-CHAR in a CL implementation that purports to use Unicode as its internal encoding to return a character object for codes that that table says do not denote a Unicode character. CCL's CODE-CHAR returns NIL for surrogates and (in recent versions) a couple of permanant noncharacter codes. As I've said, I'd feel better about it if CCL's CODE-CHAR returned NIL for all (all 66) permanent-noncharacter codes, and if it cost nothing (in terms of time or space), I think that it'd be desirable for CODE-CHAR to return NIL for codes that're reserved as of the current version of the Unicode standard (or whatever version the lisp uses.) In the latter case, you may be able to get away with treating reserved codes as if they denoted defined characters - you wouldn't have the same issues with UTF-encoding them as would exist for surrogates, for instance - but you can't meaningfully treat a "reserved character" as if it was a defined character:
? (upper-case-p #\A) => T (in Unicode 5.1 and all prior and future versions)
? (upper-case-p (code-char #xd0000)) => unknown; as of Unicode 5.1, there's no such character
I think that it'd be more consistent to say "AFAIK, there's no such character" than it would be to claim that there is and that it is or is not an upper-case character. Since CODE-CHAR is sometimes on or near a critical performance path, it's not clear that making it 100% accurate is worth whatever that would cost in terms of time/space. It's clear to me that catching and rejecting surrogate code points as non-characters is worth the extra effort.
-- Lu�s Oliveira http://student.dei.uc.pt/~lmoliv/