On 26 September 2012 20:23, Robert Smith quad@symbo1ics.com wrote:
I think it might be worthwhile to look at unicode beyond just seeing if files can encoded as utf8.
The concept of "unicode support" is pretty loaded. What does it mean? Does unicode support mean that one can operate on strings stored in a particular fashion? Does it mean functions like LENGTH handle overlaying characters correctly (e.g., any character plus a circumflex overlaying character... does that have length 1 or 2?)? Do the printers support stuff like right-to-left printing?
I think CL standard is pretty clear on what LENGTH does -- Unicode doesn't come into it, /unless/ you happen to be on an implementation that supports custom sequence types and defined one that understands combining characters.
The only place where standard really hooks into Unicode is external formats. Most (all?) of the tricky unicode stuff should IMO be separate functions, instead of introducing subtleties to standard ones.
I think some crucial questions are:
* What is CHAR-CODE-LIMIT?
* Are there holes in the char-code range?
* Which external formats are supported?
* Can strings contain arbitrary codepoints, or only things that represent fully-fledged characters? (Can UTF-8b be supported?)
* Can users define new external formats?
* Are multiple line-ending conventions supported?
* BOM?
* Are the character names there?
* Is the unicode database the implementation needs to have anyways accessible via a documented API?
* Is everything that should be O(1) O(1), or are some things O(N) with Unicode?
* Are there multiple string representations? (Eg. one for 0-255 range, one for full code-char range.)
Cheers,
-- Nikodemus