Re: [pro] [Q] unicode support

29 Sep 2012

      On 26 September 2012 20:23, Robert Smith <quad@symbo1ics.com> wrote:
...
I think it might be worthwhile to look at unicode beyond just seeing
if files can encoded as utf8.
...
The concept of "unicode support" is pretty loaded. What does it mean?
Does unicode support mean that one can operate on strings stored in a
particular fashion? Does it mean functions like LENGTH handle
overlaying characters correctly (e.g., any character plus a circumflex
overlaying character... does that have length 1 or 2?)? Do the
printers support stuff like right-to-left printing?
I think CL standard is pretty clear on what LENGTH does -- Unicode
doesn't come into it, /unless/ you happen to be on an implementation
that supports custom sequence types and defined one that understands
combining characters.

The only place where standard really hooks into Unicode is external
formats. Most (all?) of the tricky unicode stuff should IMO be
separate functions, instead of introducing subtleties to standard
ones.

I think some crucial questions are:

* What is CHAR-CODE-LIMIT?

* Are there holes in the char-code range?

* Which external formats are supported?

* Can strings contain arbitrary codepoints, or only things that
represent fully-fledged characters? (Can UTF-8b be supported?)

* Can users define new external formats?

* Are multiple line-ending conventions supported?

* BOM?

* Are the character names there?

* Is the unicode database the implementation needs to have anyways
accessible via a documented API?

* Is everything that should be O(1) O(1), or are some things O(N) with Unicode?

* Are there multiple string representations? (Eg. one for 0-255 range,
one for full code-char range.)

Cheers,

 -- Nikodemus

Re: [pro] [Q] unicode support

Nikodemus Siivola