On 10/1/10 6:35 AM, Helmut Eller wrote:
- Raymond Toy [2010-10-01 10:20] writes:
What is the length of *s* or (prin1-to-string *s*) now? Should it be 3 not 4?
Good question. The answer now is 4, not 3. There are 4 code units in the string, so that is the length. Length would be really slow if it had to scan the whole string looking for surrogate pairs and counting them as one instead of two.
Is that the reason for the problem? Confusion between emacs and lisp on the length of the string? It does appear that the string only has 3 characters, as displayed by emacs.
Very likely, Emacs uses something like utf-8 internally and counts code points not code units (expect for line endings which is probably a different issue).
Doesn't acl have this problem too? It also uses 16-bit strings like cmucl.
Allegro has no lisp:codepoint function and (code-char #x10000) returns nil.
The lisp:codepoint function was just a convenience for creating the necessary surrogate pair.
In Java, strings have a length method which returns code units and a codePointCount method for the other use. Maybe CMUCL has something like that and we should use it in SWANK.
CMUCL doesn't currently have a codePointCount function, we that's easy enough to add if slime wants it. Here's one:
(defun codepoint-count (string) "Return the number of code points in the string. The string MUST be a valid UTF-16 string." (do ((len (length string)) (index 0 (1+ index)) (count 0 (1+ count))) ((>= index len) count) (multiple-value-bind (codepoint wide) (lisp:codepoint string index) (declare (ignore codepoint)) (when wide (incf index)))))
Ray