This has been happening for some time, and it's annoying enough that I want to fix it. With CMUCL 20b and slime 2010-09-20, try the following:
(defvar *s* (make-string 4)) *s* (setf (lisp:codepoint *s* 0) #x10000))
Upto now, everything is ok. Now print the string:
*s*
At this point, the string is displayed, with a rectangular box for the codepoint #x10000 followed by two ^@ for the two null characters. (Recall that unicode strings in cmucl are utf-16 strings, so the first two elements of *s* are the surrogate pair for #x10000.)
I get an error from xemacs about itimer:
itimer "run-at-time<2>" signaled: (wrong-type-argument listp write-string) Wrong type argument: listp, :write-string
And *slime-events* buffer is kind of garbled:
(:emacs-rex (swank:listener-eval "(print *s*)\n") "COMMON-LISP-USER" :repl-thread 18) (:write-string "\n"𐀀
* Raymond Toy [2010-10-01 00:11] writes:
This has been happening for some time, and it's annoying enough that I want to fix it. With CMUCL 20b and slime 2010-09-20, try the following:
(defvar *s* (make-string 4)) *s* (setf (lisp:codepoint *s* 0) #x10000))
Upto now, everything is ok. Now print the string:
*s*
At this point, the string is displayed, with a rectangular box for the codepoint #x10000 followed by two ^@ for the two null characters. (Recall that unicode strings in cmucl are utf-16 strings, so the first two elements of *s* are the surrogate pair for #x10000.)
What is the length of *s* or (prin1-to-string *s*) now? Should it be 3 not 4?
Helmut
On 10/1/10 1:46 AM, Helmut Eller wrote:
- Raymond Toy [2010-10-01 00:11] writes:
This has been happening for some time, and it's annoying enough that I want to fix it. With CMUCL 20b and slime 2010-09-20, try the following:
(defvar *s* (make-string 4)) *s* (setf (lisp:codepoint *s* 0) #x10000))
Upto now, everything is ok. Now print the string:
*s*
At this point, the string is displayed, with a rectangular box for the codepoint #x10000 followed by two ^@ for the two null characters. (Recall that unicode strings in cmucl are utf-16 strings, so the first two elements of *s* are the surrogate pair for #x10000.)
What is the length of *s* or (prin1-to-string *s*) now? Should it be 3 not 4?
Good question. The answer now is 4, not 3. There are 4 code units in the string, so that is the length. Length would be really slow if it had to scan the whole string looking for surrogate pairs and counting them as one instead of two.
Is that the reason for the problem? Confusion between emacs and lisp on the length of the string? It does appear that the string only has 3 characters, as displayed by emacs.
Doesn't acl have this problem too? It also uses 16-bit strings like cmucl.
Ray
* Raymond Toy [2010-10-01 10:20] writes:
What is the length of *s* or (prin1-to-string *s*) now? Should it be 3 not 4?
Good question. The answer now is 4, not 3. There are 4 code units in the string, so that is the length. Length would be really slow if it had to scan the whole string looking for surrogate pairs and counting them as one instead of two.
Is that the reason for the problem? Confusion between emacs and lisp on the length of the string? It does appear that the string only has 3 characters, as displayed by emacs.
Very likely, Emacs uses something like utf-8 internally and counts code points not code units (expect for line endings which is probably a different issue).
Doesn't acl have this problem too? It also uses 16-bit strings like cmucl.
Allegro has no lisp:codepoint function and (code-char #x10000) returns nil. Similar situation in ABCL just that it returns #\null.
In Java, strings have a length method which returns code units and a codePointCount method for the other use. Maybe CMUCL has something like that and we should use it in SWANK.
Helmut
On 10/1/10 6:35 AM, Helmut Eller wrote:
- Raymond Toy [2010-10-01 10:20] writes:
What is the length of *s* or (prin1-to-string *s*) now? Should it be 3 not 4?
Good question. The answer now is 4, not 3. There are 4 code units in the string, so that is the length. Length would be really slow if it had to scan the whole string looking for surrogate pairs and counting them as one instead of two.
Is that the reason for the problem? Confusion between emacs and lisp on the length of the string? It does appear that the string only has 3 characters, as displayed by emacs.
Very likely, Emacs uses something like utf-8 internally and counts code points not code units (expect for line endings which is probably a different issue).
Doesn't acl have this problem too? It also uses 16-bit strings like cmucl.
Allegro has no lisp:codepoint function and (code-char #x10000) returns nil.
The lisp:codepoint function was just a convenience for creating the necessary surrogate pair.
In Java, strings have a length method which returns code units and a codePointCount method for the other use. Maybe CMUCL has something like that and we should use it in SWANK.
CMUCL doesn't currently have a codePointCount function, we that's easy enough to add if slime wants it. Here's one:
(defun codepoint-count (string) "Return the number of code points in the string. The string MUST be a valid UTF-16 string." (do ((len (length string)) (index 0 (1+ index)) (count 0 (1+ count))) ((>= index len) count) (multiple-value-bind (codepoint wide) (lisp:codepoint string index) (declare (ignore codepoint)) (when wide (incf index)))))
Ray
* Raymond Toy [2010-10-01 15:18] writes:
CMUCL doesn't currently have a codePointCount function, we that's easy enough to add if slime wants it. Here's one:
(defun codepoint-count (string) "Return the number of code points in the string. The string MUST be a valid UTF-16 string." (do ((len (length string)) (index 0 (1+ index)) (count 0 (1+ count))) ((>= index len) count) (multiple-value-bind (codepoint wide) (lisp:codepoint string index) (declare (ignore codepoint)) (when wide (incf index)))))
I hope this is faster than it looks :-).
What does read-sequence if the input stream contains surrogate pairs? Swank uses code like
(let* ((buffer (make-string length)) (count (read-sequence buffer stream))) buffer)
where length is the number of code points as computed by Emacs. If read-sequence also works on code units than we can't send surrogate pairs from Emacs -> Lisp.
Helmut
On 10/1/10 11:45 AM, Helmut Eller wrote:
- Raymond Toy [2010-10-01 15:18] writes:
CMUCL doesn't currently have a codePointCount function, we that's easy enough to add if slime wants it. Here's one:
(defun codepoint-count (string) "Return the number of code points in the string. The string MUST be a valid UTF-16 string." (do ((len (length string)) (index 0 (1+ index)) (count 0 (1+ count))) ((>= index len) count) (multiple-value-bind (codepoint wide) (lisp:codepoint string index) (declare (ignore codepoint)) (when wide (incf index)))))
I hope this is faster than it looks :-).
I doubt it. :-) But if we assume the string is a valid UTF-16 string, then this code could be made faster since lisp:codepoint does some extra checks since the index could be pointing at the trailing surrogate. We proceed in sequence, so we're never at the trailing surrogate. (Actually, this code is probably buggy if the string is not a valid utf-16 string.)
What does read-sequence if the input stream contains surrogate pairs? Swank uses code like
(let* ((buffer (make-string length)) (count (read-sequence buffer stream))) buffer)
where length is the number of code points as computed by Emacs. If read-sequence also works on code units than we can't send surrogate pairs from Emacs -> Lisp.
Oh, that's a problem. In the example, length is 3, but the string actually has 4 code units, so read-sequence only reads 3 code units, completely missing the last code unit.
Ray
* Raymond Toy [2010-10-01 19:49] writes:
Oh, that's a problem. In the example, length is 3, but the string actually has 4 code units, so read-sequence only reads 3 code units, completely missing the last code unit.
I think we have the following options:
1) Don't support code points beyond 16 bits. Clean and easy.
2) Introduce variants of length and read-sequence that use the same notion of character as Emacs. Kinda messy and probably slow, but relatively easy.
3) Switch from character streams to binary streams so that we can use byte counts instead of character counts. This has several advantages: - surrogate pairs are no problem - don't need flexi-streams for Lispworks - it would be easier to switch encoding after connecting - read/write-sequence is probably faster on byte streams disadvantageous: - more consing, and Emacs's GC isn't that good - need a string-to/from-bytearray function for every backend - breaks third party backends
Helmut
On Sat, 02 Oct 2010 09:45:30 +0200 Helmut Eller heller@common-lisp.net wrote:
Honestly I don't know which of your three solutions would be best, but only a question about solution 1:
- Don't support code points beyond 16 bits. Clean and easy.
What would happen when a CL implementation internally using UTF-32 and supporting 24-bit code-points would issue a valid 24-bit character? Would this only affect REPL input, or display?
Thanks,
* Matthew Mondor [2010-10-02 09:06] writes:
- Don't support code points beyond 16 bits. Clean and easy.
What would happen when a CL implementation internally using UTF-32 and supporting 24-bit code-points would issue a valid 24-bit character? Would this only affect REPL input, or display?
It would work as it does now. As long as Emacs and Lisp agree on the number of characters there's no problem.
Helmut
On 10/2/10 3:45 AM, Helmut Eller wrote:
- Raymond Toy [2010-10-01 19:49] writes:
Oh, that's a problem. In the example, length is 3, but the string actually has 4 code units, so read-sequence only reads 3 code units, completely missing the last code unit.
I think we have the following options:
- Don't support code points beyond 16 bits. Clean and easy.
Yes. I only ever use codepoints outside the BMP when testing unicode. But it is annoying that slime breaks.
- Introduce variants of length and read-sequence that use the same notion of character as Emacs. Kinda messy and probably slow, but relatively easy.
I don't know slime internals, but wouldn't you only need a special version of length and read-sequence for cmucl with unicode? The normal length/read-sequence would be fine for everyone else.
- Switch from character streams to binary streams so that we can use byte counts instead of character counts. This has several advantages:
- surrogate pairs are no problem
- don't need flexi-streams for Lispworks
Why does Lispworks need flexi-streams? Does this have to do with using read-byte on character streams or read-char on binary streams?
- it would be easier to switch encoding after connecting - read/write-sequence is probably faster on byte streams
disadvantageous: - more consing, and Emacs's GC isn't that good - need a string-to/from-bytearray function for every backend
Doesn't every backend already have such a function? Of course, someone has to hook that up, but at least it doesn't have to be written from scratch.
- breaks third party backends
Sounds like a show stopper to me.
Ray
* Raymond Toy [2010-10-02 16:30] writes:
- Introduce variants of length and read-sequence that use the same notion of character as Emacs. Kinda messy and probably slow, but relatively easy.
I don't know slime internals, but wouldn't you only need a special version of length and read-sequence for cmucl with unicode? The normal length/read-sequence would be fine for everyone else.
Well, I wrote the string with prin1 to a file with CMUCL and read it back with
(with-open-file (f "/tmp/x.utf8" :external-format :utf-8) (length (read f)))
in other implementations. The results are CMUCL, Allegro: 4 SBCL, CCL, CLISP: 3, ABCL: 6
Lispworks didn't want to read the file and ECL didn't accept utf-8 (probably pilot error).
ABCL probably ignores the external format argument, but if they use a Java-like representation then it would be 4.
Anyway CMUCL and Allegro have the same problem.
- Switch from character streams to binary streams so that we can use byte counts instead of character counts. This has several advantages:
- surrogate pairs are no problem
- don't need flexi-streams for Lispworks
Why does Lispworks need flexi-streams?
Because Lispworks' socket interface has no (documented) option to specify an encoding. You can write your own stream classes and flexi-streams does that (presumably on top of binary streams).
Does this have to do with using read-byte on character streams or read-char on binary streams?
No. At least SLIME doesn't need that.
- it would be easier to switch encoding after connecting - read/write-sequence is probably faster on byte streams
disadvantageous: - more consing, and Emacs's GC isn't that good - need a string-to/from-bytearray function for every backend
Doesn't every backend already have such a function? Of course, someone has to hook that up, but at least it doesn't have to be written from scratch.
Yes, most Lisps will have something like that but we still need to figure out how it's called for each one.
- breaks third party backends
Sounds like a show stopper to me.
That's the least of my worries.
Helmut
On 10/2/10 3:45 AM, Helmut Eller wrote:
- Raymond Toy [2010-10-01 19:49] writes:
Oh, that's a problem. In the example, length is 3, but the string actually has 4 code units, so read-sequence only reads 3 code units, completely missing the last code unit.
I think we have the following options:
Do you have a preference for any of the options (besides option 1). I'd like to make this work, because it's really annoying when slime crashes. I usually remember not to do these things, but when an error is thrown and slime brings up the debugger and displays the string on the backtrace, slime crashes, just when I really needed to know what happened.
Ray
* Raymond Toy [2010-10-06 01:07] writes:
On 10/2/10 3:45 AM, Helmut Eller wrote:
- Raymond Toy [2010-10-01 19:49] writes:
Oh, that's a problem. In the example, length is 3, but the string actually has 4 code units, so read-sequence only reads 3 code units, completely missing the last code unit.
I think we have the following options:
Do you have a preference for any of the options (besides option 1). I'd like to make this work, because it's really annoying when slime crashes. I usually remember not to do these things, but when an error is thrown and slime brings up the debugger and displays the string on the backtrace, slime crashes, just when I really needed to know what happened.
[Ideally the different Lisp implementations should have the same notion of "character". That CMUCL thinks of characters as Unicode code units while SBCL uses code points is IMO and unfortunate development. In Scheme (R6RS) they say that a Scheme character should correspond to one Unicode scalar value, which seems to be the ranges [0, #xD7FF] and [#xE000, #x10FFFF]. Java and .NET use code units. It would not be the worst idea to adopt one standard; the earlier we do that the less it costs.]
For now option 2) is probably the simplest.
In the long run, byte streams would be more flexible. In theory we could use something like HTTP chunking, if it's worth the complexity.
Helmut
On 10/6/10 3:34 AM, Helmut Eller wrote:
- Raymond Toy [2010-10-06 01:07] writes:
On 10/2/10 3:45 AM, Helmut Eller wrote:
- Raymond Toy [2010-10-01 19:49] writes:
Oh, that's a problem. In the example, length is 3, but the string actually has 4 code units, so read-sequence only reads 3 code units, completely missing the last code unit.
I think we have the following options:
Do you have a preference for any of the options (besides option 1). I'd like to make this work, because it's really annoying when slime crashes. I usually remember not to do these things, but when an error is thrown and slime brings up the debugger and displays the string on the backtrace, slime crashes, just when I really needed to know what happened.
[Ideally the different Lisp implementations should have the same notion of "character". That CMUCL thinks of characters as Unicode code units while SBCL uses code points is IMO and unfortunate development. In Scheme (R6RS) they say that a Scheme character should correspond to one Unicode scalar value, which seems to be the ranges [0, #xD7FF] and [#xE000, #x10FFFF]. Java and .NET use code units. It would not be the worst idea to adopt one standard; the earlier we do that the less it costs.]
It was a tradeoff between space usage (16-bit strings vs 32-bit strings), compiler complexity (managing 8-bit and 32-bit strings) and user complexity (base-strings vs strings).
For now option 2) is probably the simplest.
Ok. Can you give some hints on where to start looking at this?
In the long run, byte streams would be more flexible. In theory we could use something like HTTP chunking, if it's worth the complexity.
If you ever start working on this approach, let me know and I'll try to help out.
Ray
On 10/6/10 11:51 AM, Helmut Eller wrote:
- Raymond Toy [2010-10-06 12:27] writes:
For now option 2) is probably the simplest.
Ok. Can you give some hints on where to start looking at this?
read-message and write-message in swank-rpc.lisp.
The following change seems to work. I can now create a string with characters outside the bmp and slime doesn't die.
CL-USER> (map 'string #'code-char '(55296 56320 81 82 83)) "𐀀QRS"
I don't know what the "proper" way to integrate this would be, though. I'll need help with that.
Ray
#+cmu (defun codepoint-length (string) "Return the number of code points in the string. The string MUST be a valid UTF-16 string." (do ((len (length string)) (index 0 (1+ index)) (count 0 (1+ count))) ((>= index len) count) (multiple-value-bind (codepoint wide) (lisp:codepoint string index) (declare (ignore codepoint)) (when wide (incf index))))) #-cmu (defun codepoint-length (string) (length string))
(defun write-message (message package stream) (let* ((string (prin1-to-string-for-emacs message package)) (length (codepoint-length string))) (let ((*print-pretty* nil)) (format stream "~6,'0x" length)) (write-string string stream) (finish-output stream)))
On 10/6/10 6:59 PM, Raymond Toy wrote:
On 10/6/10 11:51 AM, Helmut Eller wrote:
- Raymond Toy [2010-10-06 12:27] writes:
For now option 2) is probably the simplest.
Ok. Can you give some hints on where to start looking at this?
read-message and write-message in swank-rpc.lisp.
The following change seems to work. I can now create a string with characters outside the bmp and slime doesn't die.
CL-USER> (map 'string #'code-char '(55296 56320 81 82 83)) "𐀀QRS"
I don't know what the "proper" way to integrate this would be, though. I'll need help with that.
Here is an updated change. It uses definterface/defimplementation to implement this.
Is this the right way to do this? I tested this with cmucl and ccl, and slime does the right thing with cmucl without breaking ccl.
If this is not the right way, please let me know. Otherwise, I'll check this in soon.
Ray
Index: ChangeLog =================================================================== RCS file: /project/slime/cvsroot/slime/ChangeLog,v retrieving revision 1.2142 diff -u -r1.2142 ChangeLog --- ChangeLog 20 Sep 2010 16:09:13 -0000 1.2142 +++ ChangeLog 9 Oct 2010 16:54:52 -0000 @@ -1,3 +1,16 @@ +2010-10-09 Raymond Toy toy.raymond@gmail.com + + * swank-cmucl.lisp (codepoint-length): Implement codepoint-length + to return the number of codepoints in cmucl's utf-16 strings. + + * swank-backend.lisp (:swank-backend): Export codepoint-length. + (codepoint-length): definterface codepoint-length. Default is to + use LENGTH. + + * swank-rpc.lisp (write-message): Call + swank-backend:codepoint-length to get the correct length for + emacs. + 2010-09-20 Stas Boukarev stassats@gmail.com
* swank-cmucl.lisp (character-completion-set): Implement. Requires Index: swank-rpc.lisp =================================================================== RCS file: /project/slime/cvsroot/slime/swank-rpc.lisp,v retrieving revision 1.6 diff -u -r1.6 swank-rpc.lisp --- swank-rpc.lisp 14 Apr 2010 17:51:30 -0000 1.6 +++ swank-rpc.lisp 9 Oct 2010 16:48:13 -0000 @@ -92,7 +92,7 @@
(defun write-message (message package stream) (let* ((string (prin1-to-string-for-emacs message package)) - (length (length string))) + (length (swank-backend:codepoint-length string))) (let ((*print-pretty* nil)) (format stream "~6,'0x" length)) (write-string string stream) Index: swank-backend.lisp =================================================================== RCS file: /project/slime/cvsroot/slime/swank-backend.lisp,v retrieving revision 1.201 diff -u -r1.201 swank-backend.lisp --- swank-backend.lisp 18 Sep 2010 09:34:05 -0000 1.201 +++ swank-backend.lisp 9 Oct 2010 16:49:04 -0000 @@ -43,7 +43,8 @@ #:emacs-inspect #:label-value-line #:label-value-line* - #:with-symbol)) + #:with-symbol) + (:export #:codepoint-length))
(defpackage :swank-mop (:use) @@ -1317,3 +1318,12 @@ "Request saving a heap image to the file FILENAME. RESTART-FUNCTION, if non-nil, should be called when the image is loaded. COMPLETION-FUNCTION, if non-nil, should be called after saving the image.") + +;;; Codepoint length + +(definterface codepoint-length (string) + "Return the number of codepoints in the string. With some Lisps + like cmucl, LENGTH returns the number of UTF-16 code units, but + other Lisps return the number of codeponts. The slime protocol + wants string lengths in terms of codepoints." + (length string)) Index: swank-cmucl.lisp =================================================================== RCS file: /project/slime/cvsroot/slime/swank-cmucl.lisp,v retrieving revision 1.231 diff -u -r1.231 swank-cmucl.lisp --- swank-cmucl.lisp 20 Sep 2010 16:09:13 -0000 1.231 +++ swank-cmucl.lisp 9 Oct 2010 16:50:49 -0000 @@ -2576,3 +2576,16 @@ (loop for n in names when (funcall matchp prefix n) collect n))) + +(defimplementation codepoint-length (string) + "Return the number of code points in the string. The string MUST be + a valid UTF-16 string." + (do ((len (length string)) + (index 0 (1+ index)) + (count 0 (1+ count))) + ((>= index len) + count) + (multiple-value-bind (codepoint wide) + (lisp:codepoint string index) + (declare (ignore codepoint)) + (when wide (incf index)))))
* Raymond Toy [2010-10-09 16:59] writes:
If this is not the right way, please let me know. Otherwise, I'll check this in soon.
definterface automatically exports the symbol so it's not necessary to mention it in the defpackage form. In doc-strings, the first line should be a single sentence (as in Emacs, so that apropos can show a readable summary). And there's a typo "codeponts". Rest is fine.
Helmut
On 10/9/10 1:08 PM, Helmut Eller wrote:
- Raymond Toy [2010-10-09 16:59] writes:
If this is not the right way, please let me know. Otherwise, I'll check this in soon.
definterface automatically exports the symbol so it's not necessary to mention it in the defpackage form. In doc-strings, the first line should be a single sentence (as in Emacs, so that apropos can show a readable summary). And there's a typo "codeponts". Rest is fine.
Thanks for the tips. I'll fix these up, run a few tests, and check it in if all goes well.
Thanks for helping me solve this problem with cmucl and slime.
Ray
On 10/9/10 1:08 PM, Helmut Eller wrote:
- Raymond Toy [2010-10-09 16:59] writes:
If this is not the right way, please let me know. Otherwise, I'll check this in soon.
definterface automatically exports the symbol so it's not necessary to mention it in the defpackage form. In doc-strings, the first line should be a single sentence (as in Emacs, so that apropos can show a readable summary). And there's a typo "codeponts". Rest is fine.
I fixed these and tried to do a commit. CVS says I can't because '"commit" requires write access to the repository'
My power is gone! Well, I guess I never really had it. :-)
Ray
* Raymond Toy [2010-10-09 18:45] writes:
I fixed these and tried to do a commit. CVS says I can't because '"commit" requires write access to the repository'
My power is gone! Well, I guess I never really had it. :-)
Are you sure that your CVS/Root points to :ext:rtoy@common-lisp.net:/project/slime/cvsroot and not the public repo?
Helmuz
On 10/9/10 3:04 PM, Helmut Eller wrote:
- Raymond Toy [2010-10-09 18:45] writes:
I fixed these and tried to do a commit. CVS says I can't because '"commit" requires write access to the repository'
My power is gone! Well, I guess I never really had it. :-)
Are you sure that your CVS/Root points to :ext:rtoy@common-lisp.net:/project/slime/cvsroot and not the public repo?
Oops. That was it. Changes committed.
Thanks,
Ray