On Tue, Apr 14, 2009 at 10:17 AM, David Lichteblau david@lichteblau.com wrote:
In Allegro CL and LispWorks, the situation is very different. They use UTF-16 to represent Lisp strings in memory, so surrogates aren't just forbidden in Lisp strings, user code actually needs to work with surrogates to be able to use all of Unicode.
My understanding was that they use UCS-2, i.e., they are limited to the BMP. AFAICT, their external formats don't produce surrogates in Lisp strings. (ACL doesn't. Didn't test Lispworks but its documentation specifically mentions the BMP and UCS-2.) They don't seem to have functions to deal with surrogates either.
SBCL has 21 bit characters like CCL and currently has characters for the surrogate code points. But I am not aware of any consensus that this is the right thing to do. Personally I think it's a bug and SBCL should be changed to do it like CCL.
I don't feel that strongly about either option. (I feel it's more important that the various Lisps agree on one of them.) That said, so far I haven't heard any compelling arguments in favor of having CODE-CHAR return NIL for such code points.
As far as I understand, the only Lisp with 21 bit characters whose author thinks that SBCL's behaviour is correct is ECL, but I failed to understand the reasoning behind that when it was being discussed on comp.lang.lisp.
[Assuming you're referring to the "Unicode and Common Lisp" thread. ab100561-286c-4570-aabc-72fd877f22ae@v18g2000pro.googlegroups.com http://groups.google.com/group/comp.lang.lisp/browse_thread/thread/97ff103aee76ada2]
I couldn't find where Juanjo argued for or against CODE-CHAR returning NIL for surrogates. His (somewhat unrelated) main point, IIUC, is that you shouldn't try to support full Unicode when all you have is 16-bit characters and instead restrict Unicode handling to the BMP. (U+0000 through U+FFFF.)
(As a side note, I find it a huge hassle to write code portable between the Lisp implementations with Unicode support. For CXML, I needed read time conditionals checking for UTF-16 Lisps. And it still doesn't actually work, because most of the other free libraries like Babel, CL-Unicode, and in turn, CL-PPCRE, expect 21 bit characters and are effectively broken on Allegro and LispWorks.)
I would argue that you're trying too hard. Would you support UTF-8 on CMUCL too? (Yes, I am aware that Unicode support for CMUCL is eminent. Using actual UTF-16... With surrogates... *sigh*) Just ignore/replace code points equal to or greater than CHAR-CODE-LIMIT.
Does that mean CXML won't pass the test suites for Allegro and Lispworks? So be it. If enough Allegro or Lispworks costumers have the need to deal with characters outside the BMP, they'll complain and it'll be fixed. Duane Rettig hints in that c.l.l thread that Allegro might support full 21-bit Unicode characters in the future. (But perhaps I'm being too optimistic. Perhaps you really really need to use CXML+Lispworks/Allegro along with characters outside the BMP? Then ignore what I said above, I guess.)
My plan for Babel was not to assume 21-bit characters, but to punt on characters above the CHAR-CODE-LIMIT. (It doesn't do that properly at the moment. That's definitely a bug that will have to be fixed.)
Would you argue that it'd be better for Babel to instead use UTF-16 on Lispworks/Allegro? (Not a rhetorical question; if that turns out to be a good idea I'd change Babel in that direction.) What about UTF-8 for Lisps with 8-bit characters? I suspect that restricting oneself to a subset of Unicode is more robust and more manageable for portable programs.
While I have no ideas regarding UTF-8b, I think it worth pointing out that for the important use case of file names, there is a different way of achieving a round-trip involving file names in "might be UTF-8" format.
The idea is to interpret invalid UTF-8 bytes in Latin 1, but prefix them with the code point 0.
On encoding back to a file name, such null characters would be stripped again.
That sounds like worse a hack than UTF-8B because if you convert such a string into another encoding you'll get bogus characters with no indication of error instead of, say, replacement characters. (That seems to be a big advantage of representing invalid bytes as invalid characters. Doesn't that make sense?)