Re: [slime-devel] CMUCL unicode strings breaks slime

1 Oct 2010


      On 10/1/10 11:45 AM, Helmut Eller wrote:
...
* Raymond Toy [2010-10-01 15:18] writes:
...
CMUCL doesn't currently have a codePointCount function, we that's easy
enough to add if slime wants it.  Here's one:
(defun codepoint-count (string)
  "Return the number of code points in the string.  The string MUST be
  a valid UTF-16 string."
  (do ((len (length string))
       (index 0 (1+ index))
       (count 0 (1+ count)))
      ((>= index len)
       count)
    (multiple-value-bind (codepoint wide)
  (lisp:codepoint string index)
      (declare (ignore codepoint))
      (when wide (incf index)))))
I hope this is faster than it looks :-).
I doubt it. :-)  But if we assume the string is a valid UTF-16 string,
then this code could be made faster since lisp:codepoint does some extra
checks since the index could be pointing at the trailing surrogate.  We
proceed in sequence, so we're never at the trailing surrogate.
(Actually, this code is probably buggy if the string is not a valid
utf-16 string.)
...
What does read-sequence if the input stream contains surrogate pairs?
Swank uses code like
(let* ((buffer (make-string length))
         (count (read-sequence buffer stream)))
    buffer)
where length is the number of code points as computed by Emacs.
If read-sequence also works on code units than we can't send surrogate
pairs from Emacs -> Lisp.
Oh, that's a problem.  In the example, length is 3, but the string
actually has 4 code units, so read-sequence only reads 3 code units,
completely missing the last code unit.

Ray