This started with Douglas Miles remarking that 2 tests in the ansi test suite have been failing ever since we increased CHAR-CODE-LIMIT to #x10000.That change was associated with our intended support for Unicode.
As it turns out, Unicode has defined characters which conflict with a requirement from the CLHS: CLHS requires that characters be defined in exact pairs, if they have 'case'. This means that the functions char-upcase and char-downcase can be used to retrieve the character's "other case" equivalent.
However, in Unicode there are characters which don't have that property: for example LATIN SMALL LETTER DOTLESS I maps to the LATIN CAPITAL LETTER I, just as the LATIN SMALL LETTER I. Obviously, the capital can't map back to both. See here the issue with CLHS compliance emerge.
Three possible solutions have come up:
1a. Be CLHS compliant, but not Unicode 1b. Same as (1a), but provide specific Unicode up/down casing functions 2. Be Unicode compliant and not CLHS.
SBCL (and from Sam Steingold's remark CLISP too) chooses 1a: I haven't found a function to do Unicode up/down casing; it defines the uppercase of the dotless i to be itself (caseless). This solution results in CLHS compliance, but in my opinion, isn't the solution with the least surprise: if you decide you want to upcase a string - without in-depth awareness of the issue - you're suddenly faced with a string which is upcased, except for a number of characters.
I would propose we - documentedly - diverge from the CLHS on this issue: we follow Unicode and the upper case version of the dotless i is just the capital i. From a user perspective, this seems like the solution of least surprise: I would expect people who use characters which can't be round-tripped to understand about that.
Sam suggests there's an issue with symbol i/o. He's probably referring to *readtable-case* and *print-case*. My reasoning here too is that people using characters like these in their symbols would expect to be familiar with both the casing behaviour of the Common Lisp reader/printer and the behaviour of their letters in such circumstances: if a string were uppercased in a certain way, wouldn't it be extremely weird if your symbols wouldn't too - given that your Common Lisp claims Unicode support?
So, my proposal here is to diverge. What are your opinions?
Bye,
Erik.
I believe compliance trumps everything else and that's how abcl distinguished itself from any other JVM based implementation of CL.
From this point of view 1b seems the most reasonable.
There are two types of users:
- a programmer writing code, and he/she shouldn't be surprise by anything;
- a client of the programmer, and he/she shouldn't see any difference one way or the other.
Date: Sun, 4 Apr 2010 14:38:40 +0200 From: ehuels@gmail.com To: armedbear-devel@common-lisp.net Subject: [armedbear-devel] Unicode support vs spec conformance
This started with Douglas Miles remarking that 2 tests in the ansi test suite have been failing ever since we increased CHAR-CODE-LIMIT to #x10000.That change was associated with our intended support for Unicode.
As it turns out, Unicode has defined characters which conflict with a requirement from the CLHS: CLHS requires that characters be defined in exact pairs, if they have 'case'. This means that the functions char-upcase and char-downcase can be used to retrieve the character's "other case" equivalent.
However, in Unicode there are characters which don't have that property: for example LATIN SMALL LETTER DOTLESS I maps to the LATIN CAPITAL LETTER I, just as the LATIN SMALL LETTER I. Obviously, the capital can't map back to both. See here the issue with CLHS compliance emerge.
Three possible solutions have come up:
1a. Be CLHS compliant, but not Unicode 1b. Same as (1a), but provide specific Unicode up/down casing functions 2. Be Unicode compliant and not CLHS.
SBCL (and from Sam Steingold's remark CLISP too) chooses 1a: I haven't found a function to do Unicode up/down casing; it defines the uppercase of the dotless i to be itself (caseless). This solution results in CLHS compliance, but in my opinion, isn't the solution with the least surprise: if you decide you want to upcase a string - without in-depth awareness of the issue - you're suddenly faced with a string which is upcased, except for a number of characters.
I would propose we - documentedly - diverge from the CLHS on this issue: we follow Unicode and the upper case version of the dotless i is just the capital i. From a user perspective, this seems like the solution of least surprise: I would expect people who use characters which can't be round-tripped to understand about that.
Sam suggests there's an issue with symbol i/o. He's probably referring to *readtable-case* and *print-case*. My reasoning here too is that people using characters like these in their symbols would expect to be familiar with both the casing behaviour of the Common Lisp reader/printer and the behaviour of their letters in such circumstances: if a string were uppercased in a certain way, wouldn't it be extremely weird if your symbols wouldn't too - given that your Common Lisp claims Unicode support?
So, my proposal here is to diverge. What are your opinions?
Bye,
Erik.
armedbear-devel mailing list armedbear-devel@common-lisp.net http://common-lisp.net/cgi-bin/mailman/listinfo/armedbear-devel
_________________________________________________________________ Hotmail has tools for the New Busy. Search, chat and e-mail from your inbox. http://www.windowslive.com/campaign/thenewbusy?ocid=PID28326::T:WLMTAGL:ON:W...
armedbear-devel@common-lisp.net