Dan Weinreb dlw@itasoftware.com writes:
Thanks for that link.
Cases like this, in which an illegal sequence is explicitly transformed into another illegal sequence, would meet with a lot of resistance from folks who care about security.
Assuming you're referring to UTF-8B, it should be pointed out (as James already did) that it's not specified by Unicode and I would add that it certainly isn't a general-purpose encoding.
James also points out that UTF-8B in fact follows the guidelines put forward by TR36. Not that surprising since UTF-8B was, after all, proposed by a Unicode expert.
It's important not to do anything outside the definition. Your objection to CODE-CHAR returning NIL is incompatible with the Unicode concept of "Noncharacters". See the Unicode report section 16.7.
Well, that section says that the "Unicode Standard sets aside 66 noncharacter code points", and proceeds to specify them. CCL's CODE-CHAR returns *non-NIL* for all of those codes -- at least in the oldish version I have installed. A few comments about that:
1. Though Gary has hinted that he would like CCL to return NIL for these codes, it's probably a good thing that CODE-CHAR currently returns non-NIL for noncharacters. In the next paragraph from that section, the standard says that "applications are free to use any of these noncharacter code points internally".
2. Surrogate code points are not "noncharacters". The extra code points used by UTF-8B to represent invalid bytes are a subset of the surrogate code points. This distinction is probably not very useful, though.