I just committed changes to the wire format. The new format is defined on bytes and not characters.
Counting characters was problematic, especially with Lisps that use UTF16 internally (Allegro, CMUCL, JVM based Lisps). Emacs counts the length of strings in Unicode code points, while in UTF16 a single code point may occupy either 1 or 2 indexes (code units) and so CL:LENGTH may return something different as Emacs expected. For the same reason we can't use READ-SEQUENCE to read a specified number of code points.
The new format looks so:
| byte0 | 3 bytes length | | ... payload ... |
The 3 bytes length header specify the length of the payload in bytes. The playload is an s-exp encoded as UTF8 text. byte0 is currently always 0; other values are reserved for future use. Robert Brown said he'd like to use compression, so byte0 might come in handy.
The change breaks backward compatibility. When upgrading, make sure that both Lisp and Emacs use the new format.
I did some light testing with most of the backends and provided a portable version for the utf8 encoding/decoding. I didn't test SCL and CormanCL.
Third party backends, for Clojure etc., are obviously broken now. So if you need those, wait until somebody fixes them and the dust has settled.
Helmut
On Sun, 06 Nov 2011 18:13:07 +0100 Helmut Eller heller@common-lisp.net wrote:
The new format looks so:
| byte0 | 3 bytes length | | ... payload ... |
The 3 bytes length header specify the length of the payload in bytes. The playload is an s-exp encoded as UTF8 text. byte0 is currently always 0; other values are reserved for future use. Robert Brown said he'd like to use compression, so byte0 might come in handy.
I think that's a nice idea
The change breaks backward compatibility. When upgrading, make sure that both Lisp and Emacs use the new format.
Did the previous protocol also have such a 0x00 byte prefix? If so, it would be possible to use 0x01 this time and provide more backwards compatibility...
If not, I could perhaps suggest that the byte in the future be used to hold a protocol version as well as flags (such as as you mentioned, for gzip-compressed, etc).
* Matthew Mondor [2011-11-06 18:43] writes:
Did the previous protocol also have such a 0x00 byte prefix? If so, it would be possible to use 0x01 this time and provide more backwards compatibility...
No. We used something like (format nil "~6,'0x" <length>) for the header so most of the time the first byte was (char-code #\0) which is 48.
If not, I could perhaps suggest that the byte in the future be used to hold a protocol version as well as flags (such as as you mentioned, for gzip-compressed, etc).
I think one bit should be a "streaming bit", with the intention that the payload should be concatenated to the next frame until the first frame with the streaming bit cleared comes along. That would allow messages of arbitrary length and also frames of fixed size (for easier buffering).
3 or 4 bits should probably be a "type code" for the payload. Type 0 would be "a s-exp in utf8". Type 1 probably "just plain bytes to be interpreted by a higher level". I'm not even sure that we need something else :-). Anyway, that would leave 6 or 13 other type codes for future use and 3-4 unused bits.
Helmut
On Sun, 06 Nov 2011 12:13:07 -0500, Helmut Eller heller@common-lisp.net wrote:
Counting characters was problematic, especially with Lisps that use UTF16 internally (Allegro, CMUCL, JVM based Lisps). Emacs counts the length of strings in Unicode code points, while in UTF16 a single code point may occupy either 1 or 2 indexes (code units) and so CL:LENGTH may return something different as Emacs expected. For the same reason we can't use READ-SEQUENCE to read a specified number of code points.
The new format looks so:
| byte0 | 3 bytes length | | ... payload ... |
The 3 bytes length header specify the length of the payload in bytes.
Is there a reason to start using a binary encoding of the message length? This makes the messages less easy to inspect, and less easy to write integration tests for.
The playload is an s-exp encoded as UTF8 text.
Normalising on utf-8 and counting bytes sounds like it would solve the original issue without changing to a binary encoding of the message length.
On Sun, 06 Nov 2011 23:04:44 -0500 "Hugo Duncan" hugo@hugoduncan.org wrote:
Is there a reason to start using a binary encoding of the message length? This makes the messages less easy to inspect, and less easy to write integration tests for.
The playload is an s-exp encoded as UTF8 text.
Normalising on utf-8 and counting bytes sounds like it would solve the original issue without changing to a binary encoding of the message length.
My assumption was that it was meant for performance, as large read syscalls could be used when the message length is known in advance (i.e. READ-SEQUENCE on a bytes stream), but I'm not sure.
* Hugo Duncan [2011-11-07 04:04] writes:
On Sun, 06 Nov 2011 12:13:07 -0500, Helmut Eller heller@common-lisp.net wrote:
Counting characters was problematic, especially with Lisps that use UTF16 internally (Allegro, CMUCL, JVM based Lisps). Emacs counts the length of strings in Unicode code points, while in UTF16 a single code point may occupy either 1 or 2 indexes (code units) and so CL:LENGTH may return something different as Emacs expected. For the same reason we can't use READ-SEQUENCE to read a specified number of code points.
The new format looks so:
| byte0 | 3 bytes length | | ... payload ... |
The 3 bytes length header specify the length of the payload in bytes.
Is there a reason to start using a binary encoding of the message length?
No deep reason. We actually used binary encoding before we used hex-strings. That worked fine with latin-1 but not with utf-8. I guess it's just instinct; now that we explicitly work on a byte stream it's even more natural. Should probably have used network byte order.
This makes the messages less easy to inspect, and less easy to write integration tests for.
Only marginally. Shifting 3 bytes together is not exactly rocket since.
The playload is an s-exp encoded as UTF8 text.
Normalising on utf-8 and counting bytes sounds like it would solve the original issue without changing to a binary encoding of the message length.
Right. It would not be backward compatible, tho.
Helmut
On Mon, 07 Nov 2011 03:11:50 -0500, Helmut Eller heller@common-lisp.net wrote:
- Hugo Duncan [2011-11-07 04:04] writes:
Is there a reason to start using a binary encoding of the message length?
No deep reason. We actually used binary encoding before we used hex-strings. That worked fine with latin-1 but not with utf-8. I guess it's just instinct; now that we explicitly work on a byte stream it's even more natural. Should probably have used network byte order.
This makes the messages less easy to inspect, and less easy to write integration tests for.
Only marginally. Shifting 3 bytes together is not exactly rocket since.
It isn't rocket science, but does remove the possibility of visual verification, and of being able to send messages via scripting or a simple console. HTTP, SIP, SMTP, and STOMP are all good examples of protocols with text headers, and I think appropriate, if one looks at swank as a sort of control protocol.
What is the gain from changing to a binary header? As far as I can see it is just saving a few bytes.
The playload is an s-exp encoded as UTF8 text.
Normalising on utf-8 and counting bytes sounds like it would solve the original issue without changing to a binary encoding of the message length.
Right. It would not be backward compatible, tho.
It seems to be worth solving encoding issues, so something has to give.
Given this is a breaking change, I also see the desire to introduce an extension mechanism at the same time. I would argue a text based header/value extension would be more appropriate.
At the end of the day, I realise the balance between the respective merits of binary and text headers is somewhat subjective.
Hugo
* Hugo Duncan [2011-11-07 13:56] writes:
This makes the messages less easy to inspect, and less easy to write integration tests for.
Only marginally. Shifting 3 bytes together is not exactly rocket since.
It isn't rocket science, but does remove the possibility of visual verification, and of being able to send messages via scripting or a simple console. HTTP, SIP, SMTP, and STOMP are all good examples of protocols with text headers, and I think appropriate, if one looks at swank as a sort of control protocol.
If HTTP is so great then why are WebSockets specified as a binary protocol?
What is the gain from changing to a binary header? As far as I can see it is just saving a few bytes.
Aesthetics?
Normalising on utf-8 and counting bytes sounds like it would solve the original issue without changing to a binary encoding of the message length.
Right. It would not be backward compatible, tho.
It seems to be worth solving encoding issues, so something has to give.
Given this is a breaking change, I also see the desire to introduce an extension mechanism at the same time. I would argue a text based header/value extension would be more appropriate.
A fixed sized header is a clear efficiency win. It's also easier to implement.
At the end of the day, I realise the balance between the respective merits of binary and text headers is somewhat subjective.
Yes, it's also rather academic as there are only 2 or 3 people that need to debug this.
Helmut
On Mon, 07 Nov 2011 11:40:39 -0500, Helmut Eller heller@common-lisp.net wrote:
- Hugo Duncan [2011-11-07 13:56] writes:
This makes the messages less easy to inspect, and less easy to write integration tests for.
Only marginally. Shifting 3 bytes together is not exactly rocket since.
It isn't rocket science, but does remove the possibility of visual verification, and of being able to send messages via scripting or a simple console. HTTP, SIP, SMTP, and STOMP are all good examples of protocols with text headers, and I think appropriate, if one looks at swank as a sort of control protocol.
If HTTP is so great then why are WebSockets specified as a binary protocol?
The websockets protocol has text headers: http://tools.ietf.org/html/draft-ietf-hybi-thewebsocketprotocol-17
Given this is a breaking change, I also see the desire to introduce an extension mechanism at the same time. I would argue a text based header/value extension would be more appropriate.
A fixed sized header is a clear efficiency win. It's also easier to implement.
The efficiency difference is negligible, imho. The implementation isn't any easier in clojure at least.
At the end of the day, I realise the balance between the respective merits of binary and text headers is somewhat subjective.
Yes, it's also rather academic as there are only 2 or 3 people that need to debug this.
Being one of these people, it effects me [1]. As you say, it is not rocket science and I have a working version with the new protocol. For what it is worth, I still think it is the wrong direction.
* Hugo Duncan [2011-11-07 17:16] writes:
If HTTP is so great then why are WebSockets specified as a binary protocol?
The websockets protocol has text headers: http://tools.ietf.org/html/draft-ietf-hybi-thewebsocketprotocol-17
Have you actually read that document? After switching from HTTP to WebSocket the header format becomes binary.
Given this is a breaking change, I also see the desire to introduce an extension mechanism at the same time. I would argue a text based header/value extension would be more appropriate.
A fixed sized header is a clear efficiency win. It's also easier to implement.
The efficiency difference is negligible, imho. The implementation isn't any easier in clojure at least.
Slime never had variable sized headers. So I'm not sure what you are comparing.
At the end of the day, I realise the balance between the respective merits of binary and text headers is somewhat subjective.
Yes, it's also rather academic as there are only 2 or 3 people that need to debug this.
Being one of these people, it effects me [1]. As you say, it is not rocket science and I have a working version with the new protocol. For what it is worth, I still think it is the wrong direction.
Would be interesting to know what took more time: implementing the change or this discussion.
Helmut
On Mon, 07 Nov 2011 14:25:35 -0500, Helmut Eller heller@common-lisp.net wrote:
Would be interesting to know what took more time: implementing the change or this discussion.
For the record, updating the tests took the longest, but that hardly seems relevant. My apologies if you have seen this discussion as wasting your time.
My 2c
I do not claim any of this to be scientific opinion or backed by any data.
I do not yet have any lisp stuff in production (as in web servers). The day I do. I'd like to be confident that I can troubleshoot slime connectivity issues if there are any (I don't yet know how that setup is going to be). I use slime, but in dev I don't care much about slime connectivity issues.
To this day I still find the ability to do a simple telnet to execute http very useful. It takes away all of client side issues. This is point #1
If all of slime is turned binary and serves the purpose of performance, then that's ok. Doing only a header in binary for performance seems dubious. This is point #2.
My opinion (again) is that a. http, html, javascript, css are all crappy (in the sense they did not take into account the accumulated knowledge at the point of their design or lack thereof) b. they are all enormously successful (for some measure of success for whatever related and/or disjoint reasons) c. they are (or were) all purely text based
I think #c has a lot to do with #b (don't ask for proof cause I have none) even though it's not the only reason. This is point #3.
An unrelated point - I saw somewhere a mention of fixed versus variable length as if that has anything to do with text versus binary. Those are orthogonal. The point may be that 4 bytes of binary gets 4bn versus that requiring maybe 10 or 11 bytes in utf8 if it were fixed length.
I only lurk on the slime list. I don't contribute to the code or the discussion (due to priorities, laziness, and incompetence). I am writing this cause I am afraid this is the wrong decision.
As I said all of the above is unscientific, but i hope it is considered.
Thanks for reading -Antony
* Antony Sequeira [2011-11-08 06:51] writes:
My 2c
I do not claim any of this to be scientific opinion or backed by any data.
I do not yet have any lisp stuff in production (as in web servers). The day I do. I'd like to be confident that I can troubleshoot slime connectivity issues if there are any (I don't yet know how that setup is going to be). I use slime, but in dev I don't care much about slime connectivity issues.
See that's the problem with imaginary problems: we need to solve them too in particular the irrational ones. I'm giving up. I'll juts change the header back.
Helmut