Hello,
does anyone have a global view on the status of unicode support in the main CL implementations (SBCL, CMU-CL, ECL, CCL, CLISP, ABCL, ACL, LispWorks). By that, I mean mostly being able to encode source files in utf8 with the following concerns:
- do I need to do something special at the implementation-level ? - do I need to do something special at the ASDF system level ? - could this break the code of people using my libraries ?
Thank you.
Excellent question. Time for a CDR?
MA
On Sep 26, 2012, at 09:46 , Didier Verna wrote:
Hello,
does anyone have a global view on the status of unicode support in the main CL implementations (SBCL, CMU-CL, ECL, CCL, CLISP, ABCL, ACL, LispWorks). By that, I mean mostly being able to encode source files in utf8 with the following concerns:
- do I need to do something special at the implementation-level ?
- do I need to do something special at the ASDF system level ?
- could this break the code of people using my libraries ?
Thank you.
-- Resistance is futile. You will be jazzimilated.
Scientific site: http://www.lrde.epita.fr/~didier Music (Jazz) site: http://www.didierverna.com
pro mailing list pro@common-lisp.net http://lists.common-lisp.net/cgi-bin/mailman/listinfo/pro
-- Marco Antoniotti, Associate Professor tel. +39 - 02 64 48 79 01 DISCo, Università Milano Bicocca U14 2043 http://bimib.disco.unimib.it Viale Sarca 336 I-20126 Milan (MI) ITALY
Please note that I am not checking my Spam-box anymore. Please do not forward this email without asking me first.
On Wed, Sep 26, 2012 at 3:46 AM, Didier Verna didier@lrde.epita.fr wrote:
does anyone have a global view on the status of unicode support in the main CL implementations (SBCL, CMU-CL, ECL, CCL, CLISP, ABCL, ACL, LispWorks). By that, I mean mostly being able to encode source files in utf8 with the following concerns:
- do I need to do something special at the implementation-level ?
- do I need to do something special at the ASDF system level ?
- could this break the code of people using my libraries ?
So far as I can tell, all these implementations support Unicode, though some of them can be explicitly compiled without.
ASDF, since release 2.21 (April 2012) supports Unicode. The recommended, backwards-compatible, incantation is (in your defsystem, or any specific component): #+asdf-unicode :encoding #+asdf-unicode :utf-8.
I have the intention of making utf-8 the default eventually, but last we checked (in April this year), that would break 7 files out of all of quicklisp, the authors of which have never replied regarding fixing them.
Also, if you (asdf:load-system :asdf-encodings) explicitly and early, you can use :encoding :latin1, or :encoding euc-jp, or whichever encoding your implementation supports (asdf-encodings at this time won't transcode things for you).
An example system that explicitly uses this UTF-8 support is lambda-reader, which I published earlier this year (last edited in April also), based on an initial implementation by Brian Mastenbrook.
—♯ƒ • François-René ÐVB Rideau •Reflection&Cybernethics• http://fare.tunes.org To send men to the firing squad, judicial proof is unnecessary... These procedures are an archaic bourgeois detail. This is a revolution! And a revolutionary must become a cold killing machine motivated by pure hate. — Che Guevara
I think it might be worthwhile to look at unicode beyond just seeing if files can encoded as utf8.
The concept of "unicode support" is pretty loaded. What does it mean? Does unicode support mean that one can operate on strings stored in a particular fashion? Does it mean functions like LENGTH handle overlaying characters correctly (e.g., any character plus a circumflex overlaying character... does that have length 1 or 2?)? Do the printers support stuff like right-to-left printing?
See http://stackoverflow.com/a/6163129 for details on why unicode support isn't a simple concept.
Cheers,
Robert Smith
On Wed, Sep 26, 2012 at 10:54 AM, Faré fahree@gmail.com wrote:
On Wed, Sep 26, 2012 at 3:46 AM, Didier Verna didier@lrde.epita.fr wrote:
does anyone have a global view on the status of unicode support in the main CL implementations (SBCL, CMU-CL, ECL, CCL, CLISP, ABCL, ACL, LispWorks). By that, I mean mostly being able to encode source files in utf8 with the following concerns:
- do I need to do something special at the implementation-level ?
- do I need to do something special at the ASDF system level ?
- could this break the code of people using my libraries ?
So far as I can tell, all these implementations support Unicode, though some of them can be explicitly compiled without.
ASDF, since release 2.21 (April 2012) supports Unicode. The recommended, backwards-compatible, incantation is (in your defsystem, or any specific component): #+asdf-unicode :encoding #+asdf-unicode :utf-8.
I have the intention of making utf-8 the default eventually, but last we checked (in April this year), that would break 7 files out of all of quicklisp, the authors of which have never replied regarding fixing them.
Also, if you (asdf:load-system :asdf-encodings) explicitly and early, you can use :encoding :latin1, or :encoding euc-jp, or whichever encoding your implementation supports (asdf-encodings at this time won't transcode things for you).
An example system that explicitly uses this UTF-8 support is lambda-reader, which I published earlier this year (last edited in April also), based on an initial implementation by Brian Mastenbrook.
—♯ƒ • François-René ÐVB Rideau •Reflection&Cybernethics• http://fare.tunes.org To send men to the firing squad, judicial proof is unnecessary... These procedures are an archaic bourgeois detail. This is a revolution! And a revolutionary must become a cold killing machine motivated by pure hate. — Che Guevara
pro mailing list pro@common-lisp.net http://lists.common-lisp.net/cgi-bin/mailman/listinfo/pro
On 26 September 2012 20:23, Robert Smith quad@symbo1ics.com wrote:
I think it might be worthwhile to look at unicode beyond just seeing if files can encoded as utf8.
The concept of "unicode support" is pretty loaded. What does it mean? Does unicode support mean that one can operate on strings stored in a particular fashion? Does it mean functions like LENGTH handle overlaying characters correctly (e.g., any character plus a circumflex overlaying character... does that have length 1 or 2?)? Do the printers support stuff like right-to-left printing?
I think CL standard is pretty clear on what LENGTH does -- Unicode doesn't come into it, /unless/ you happen to be on an implementation that supports custom sequence types and defined one that understands combining characters.
The only place where standard really hooks into Unicode is external formats. Most (all?) of the tricky unicode stuff should IMO be separate functions, instead of introducing subtleties to standard ones.
I think some crucial questions are:
* What is CHAR-CODE-LIMIT?
* Are there holes in the char-code range?
* Which external formats are supported?
* Can strings contain arbitrary codepoints, or only things that represent fully-fledged characters? (Can UTF-8b be supported?)
* Can users define new external formats?
* Are multiple line-ending conventions supported?
* BOM?
* Are the character names there?
* Is the unicode database the implementation needs to have anyways accessible via a documented API?
* Is everything that should be O(1) O(1), or are some things O(N) with Unicode?
* Are there multiple string representations? (Eg. one for 0-255 range, one for full code-char range.)
Cheers,
-- Nikodemus