#65: UTF-32 strings support ------------------------+--------------------------------------------------- Reporter: ehuelsmann | Owner: somebody Type: defect | Status: new Priority: major | Milestone: Component: other | Version: Keywords: | ------------------------+--------------------------------------------------- ABCL uses Java char[]s to represent its strings. However, the char type can only represent values in the BMP (Basic Multilingual Plane), because only the BMP can be represented using 16 bits.
For supplementary characters (all Unicode chars outside the BMP), it uses a pair of surrogate characters (UTF-16).
Common Lisp programs don't expect this and need strings to be represented using complete characters.
#65: UTF-32 strings support -------------------------+-------------------------------------------------- Reporter: ehuelsmann | Owner: nobody Type: defect | Status: new Priority: major | Milestone: Component: libraries | Version: Resolution: | Keywords: -------------------------+-------------------------------------------------- Changes (by ehuelsmann):
* owner: somebody => nobody * component: other => libraries
#65: UTF-32 strings support -------------------------+-------------------------------------------------- Reporter: ehuelsmann | Owner: nobody Type: defect | Status: new Priority: major | Milestone: Component: libraries | Version: Resolution: | Keywords: -------------------------+--------------------------------------------------
Comment(by ehuelsmann):
Relevant in this discussion is the article about [http://java.sun.com/developer/technicalArticles/Intl/Supplementary/ supplementary characters (code points > #xFFFF) in Java].
#65: UTF-32 strings support -------------------------+-------------------------------------------------- Reporter: ehuelsmann | Owner: nobody Type: defect | Status: new Priority: major | Milestone: unscheduled Component: libraries | Version: Resolution: | Keywords: -------------------------+-------------------------------------------------- Changes (by ehuelsmann):
* milestone: => unscheduled
Comment:
Not planned.
#65: UTF-32 strings support ------------------------+--------------------------------------------------- Reporter: ehuelsmann | Owner: nobody Type: defect | Status: new Priority: major | Milestone: unscheduled Component: libraries | Version: Keywords: | ------------------------+---------------------------------------------------
Comment(by mevenson):
I think it is possible to use FLEXI-STREAMS to handle UTF32 strings.
#65: UTF-32 strings support ------------------------+--------------------------------------------------- Reporter: ehuelsmann | Owner: nobody Type: defect | Status: new Priority: major | Milestone: 1.2.0 Component: libraries | Version: 1.1.0 Keywords: | ------------------------+--------------------------------------------------- Changes (by mevenson):
* version: => 1.1.0 * milestone: unscheduled => 1.2.0
#65: UTF-32 strings support ------------------------+--------------------------------------------------- Reporter: ehuelsmann | Owner: nobody Type: defect | Status: new Priority: major | Milestone: 1.2.0 Component: libraries | Version: 1.1.0 Keywords: | ------------------------+---------------------------------------------------
Comment(by ehuelsmann):
On #lisp, pjb writes on this subject:
... you must be careful that in most CL implementations, characters are unicode characters (not even code-points in a number of implementations!), and therefore we are talking of real strings of characters (32-bit each usually), not vector of utf-8 bytes. (For some things, you may need to deal with vectors of bytes instead of strings, and there, lisp macros and reader macros can come handy to ease manipulations of those vectors of bytes that usually represent ASCII or UTF-8 encoded characters).
Where I ask:[[BR]] pjb: how's that possible? Some far-east "characters" will consist of multiple code points, with up to 6 or 7 "modifier" code points; how can all that fit into 32-bits, if each code point is 21-bit in itself?
and pjb answers:[[BR]] ehu: that's what I mean, some implementation may choose to represent those characters as a pointer to a sequence of code points.
armedbear-ticket@common-lisp.net