[pro] write-char vs. 8-bit bytes

Paul Tarvydas

10 Apr 2014 10 Apr '14

2:31 p.m.

I'm using sbcl to write-char a 16-bit unsigned integer to a socket as two separate unsigned 8-bit bytes, for example 141 should appear as #x00 #x8d. SBCL appears to convert the #x8d into a two-byte utf-8 char, resulting in 3 bytes written to the stream \#x00 #xcd #x8d. What is the proper incantation to achieve this? (SBCL on Windows, if that matters). thanks pt

Show replies by date

Antoniotti Marco

10 Apr 10 Apr

3:05 p.m.

On Apr 10, 2014, at 16:31 , Paul Tarvydas <paultarvydas@gmail.com> wrote:

...

I'm using sbcl to write-char a 16-bit unsigned integer to a socket as two separate unsigned 8-bit bytes, for example 141 should appear as

#x00 #x8d.

SBCL appears to convert the #x8d into a two-byte utf-8 char, resulting in 3 bytes written to the stream

\#x00 #xcd #x8d.

What is the proper incantation to achieve this? (SBCL on Windows, if that matters).

It may not be very helpful, but the “right incantation” would be to write a CDR that specified the behavior of implementations that deal with UTF* and UNICODE. Any takers? Cheers — MA

Tom Emerson

3:12 p.m.

I started such a thing a while ago, but never got it to the point of submitting. There is a student interested in enhancing the Unicode support in SBCL for GSOC 14: perhaps I can integrate that into his project, at least partially. -tree Sent from my iPhone

...

On Apr 10, 2014, at 11:05, Antoniotti Marco <antoniotti.marco@disco.unimib.it> wrote:

...
On Apr 10, 2014, at 16:31 , Paul Tarvydas <paultarvydas@gmail.com> wrote:

I'm using sbcl to write-char a 16-bit unsigned integer to a socket as two separate unsigned 8-bit bytes, for example 141 should appear as

#x00 #x8d.

SBCL appears to convert the #x8d into a two-byte utf-8 char, resulting in 3 bytes written to the stream

\#x00 #xcd #x8d.

What is the proper incantation to achieve this? (SBCL on Windows, if that matters).

It may not be very helpful, but the “right incantation” would be to write a CDR that specified the behavior of implementations that deal with UTF* and UNICODE.

Any takers?

Cheers ― MA

_______________________________________________ pro mailing list pro@common-lisp.net http://common-lisp.net/cgi-bin/mailman/listinfo/pro

Pascal J. Bourguignon

11 Apr 11 Apr

11:02 a.m.

Antoniotti Marco <antoniotti.marco@disco.unimib.it> writes:

...

On Apr 10, 2014, at 16:31 , Paul Tarvydas <paultarvydas@gmail.com> wrote:

...
I'm using sbcl to write-char a 16-bit unsigned integer to a socket as two separate unsigned 8-bit bytes, for example 141 should appear as

#x00 #x8d.

SBCL appears to convert the #x8d into a two-byte utf-8 char, resulting in 3 bytes written to the stream

\#x00 #xcd #x8d.

What is the proper incantation to achieve this? (SBCL on Windows, if that matters).

It may not be very helpful, but the “right incantation” would be to write a CDR that specified the behavior of implementations that deal with UTF* and UNICODE.

No, not in this case. -- __Pascal Bourguignon__ http://www.informatimago.com/ "Le mercure monte ? C'est le moment d'acheter !"

Antoniotti Marco

11:37 a.m.

New subject: [pro] CDR for UTF* and UNICODE handling in :element-type (Re: write-char vs. 8-bit bytes)

On 2014-04-11, 13:02 , "Pascal J. Bourguignon" <pjb@informatimago.com> wrote:

...

Antoniotti Marco <antoniotti.marco@disco.unimib.it> writes:

...
On Apr 10, 2014, at 16:31 , Paul Tarvydas <paultarvydas@gmail.com> wrote:

...
I'm using sbcl to write-char a 16-bit unsigned integer to a socket as two separate unsigned 8-bit bytes, for example 141 should appear as

#x00 #x8d.

SBCL appears to convert the #x8d into a two-byte utf-8 char, resulting in 3 bytes written to the stream

\#x00 #xcd #x8d.

What is the proper incantation to achieve this? (SBCL on Windows, if that matters).

It may not be very helpful, but the ³right incantation² would be to write a CDR that specified the behavior of implementations that deal with UTF* and UNICODE.

No, not in this case.

Why not? Cheers ‹ MA

Bob Cassels

8:18 p.m.

New subject: [pro] CDR for UTF* and UNICODE handling in :element-type (Re: write-char vs. 8-bit bytes)

Because this is a binary write, not a character write. It has nothing to do with Unicode or anything else. Unless the original problem has to do with writing UCS-2 or UTF-16, but there was nothing in the original question that had anything to do with characters, other than the incorrect use of write-char to write a binary value. On Apr 11, 2014, at 7:37 AM, Antoniotti Marco <antoniotti.marco@disco.unimib.it> wrote:

...

On 2014-04-11, 13:02 , "Pascal J. Bourguignon" <pjb@informatimago.com> wrote:

...
Antoniotti Marco <antoniotti.marco@disco.unimib.it> writes:

...
On Apr 10, 2014, at 16:31 , Paul Tarvydas <paultarvydas@gmail.com> wrote:

...
I'm using sbcl to write-char a 16-bit unsigned integer to a socket as two separate unsigned 8-bit bytes, for example 141 should appear as

#x00 #x8d.

SBCL appears to convert the #x8d into a two-byte utf-8 char, resulting in 3 bytes written to the stream

\#x00 #xcd #x8d.

What is the proper incantation to achieve this? (SBCL on Windows, if that matters).

It may not be very helpful, but the ³right incantation² would be to write a CDR that specified the behavior of implementations that deal with UTF* and UNICODE.

No, not in this case.

Why not?

Cheers ‹ MA

_______________________________________________ pro mailing list pro@common-lisp.net http://common-lisp.net/cgi-bin/mailman/listinfo/pro

Antoniotti Marco

8:32 p.m.

New subject: [pro] CDR for UTF* and UNICODE handling in :element-type (Re: write-char vs. 8-bit bytes)

On Apr 11, 2014, at 22:18 , Bob Cassels <bobcassels@netscape.net> wrote:

...

Because this is a binary write, not a character write. It has nothing to do with Unicode or anything else. Unless the original problem has to do with writing UCS-2 or UTF-16, but there was nothing in the original question that had anything to do with characters, other than the incorrect use of write-char to write a binary value.

I understand that my original message was not on spot. In fact I changed the subject line in my response… The issue, in any case, appears to be the handling of characters nevertheless. Maybe Paul can clarify what he was really trying to do. In any case… I am the only person who thinks that a “sub-standard” on these issues may be a Good Thing? Cheers — MA

Faré

8:57 p.m.

New subject: [pro] CDR for UTF* and UNICODE handling in :element-type (Re: write-char vs. 8-bit bytes)

On Fri, Apr 11, 2014 at 4:32 PM, Antoniotti Marco <antoniotti.marco@disco.unimib.it> wrote:

...

I understand that my original message was not on spot. In fact I changed the subject line in my response… The issue, in any case, appears to be the handling of characters nevertheless. Maybe Paul can clarify what he was really trying to do.

In any case… I am the only person who thinks that a “sub-standard” on these issues may be a Good Thing?

For basic survival, ASDF since 2.21 has support for encodings, and asdf::*utf-8-external-format* (now exported by uiop) will let you portably handle utf-8 streams (falling back to :default on 8-bit implementations). UCS-2 and UTF-16 are not universally supported, but asdf-encodings can help you find your implementation's external-format for them if available. Or for portable behavior reimplementing things the hard way, you could use either cl-unicode and flexi-streams, or babel and streams of (unsigned-byte 8). If you can't convince the community to choose between babel and cl-unicode and whichever other alternatives may exist, what makes you think you can get yet another incompatible standard widely adopted? https://xkcd.com/927/ PS: all implementations that accept unicode accept :external-format :utf-8... except clisp, that requires you to use 'charset:utf-8. If you want to work towards a common external-format, start here —♯ƒ • François-René ÐVB Rideau •Reflection&Cybernethics• http://fare.tunes.org #.(let((q`(#0=lambda(q)(labels((p(q)(if(atom q)(intern(#1=reverse(string q)))(#1#(mapcar #'p q))))(q(q)(subst q(eq q q)'(#2=defun p(&aux #2#nufed ,@''#3=etouq(xua& #3#)p tsil)((#0#(q)(setq q 't tsil q nufed(eval(list q (list'quote q)))))#3#)))))(nconc(q q)(p(q q)))))))(eval`(,q',q)))

Antoniotti Marco

9:28 p.m.

New subject: [pro] CDR for UTF* and UNICODE handling in :element-type (Re: write-char vs. 8-bit bytes)

On 2014-04-11, 22:57 , "Faré" <fahree@gmail.com> wrote:

...

On Fri, Apr 11, 2014 at 4:32 PM, Antoniotti Marco <antoniotti.marco@disco.unimib.it> wrote:

...
I understand that my original message was not on spot. In fact I changed the subject line in my responseŠ The issue, in any case, appears to be the handling of characters nevertheless. Maybe Paul can clarify what he was really trying to do.

In any caseŠ I am the only person who thinks that a ³sub-standard² on these issues may be a Good Thing?

For basic survival, ASDF since 2.21 has support for encodings, and asdf::*utf-8-external-format* (now exported by uiop) will let you portably handle utf-8 streams (falling back to :default on 8-bit implementations). UCS-2 and UTF-16 are not universally supported, but asdf-encodings can help you find your implementation's external-format for them if available. Or for portable behavior reimplementing things the hard way, you could use either cl-unicode and flexi-streams, or babel and streams of (unsigned-byte 8).

I am aware of all the things ³out there², and yet, having a number of libraries or even a single library is not the same as ³having a standard².

...

If you can't convince the community to choose between babel and cl-unicode and whichever other alternatives may exist, what makes you think you can get yet another incompatible standard widely adopted? https://xkcd.com/927/

I am not advocating the proverbial 15th incompatible standard. Since by now people should know what they are doing, it would be nicer to have a document that summarized things up. Didn¹t the ANSI spec essentially came about in that way?

...

PS: all implementations that accept unicode accept :external-format :utf-8... except clisp, that requires you to use 'charset:utf-8. If you want to work towards a common external-format, start here

I said ³any takers?². I am just the customer telling the market what it would be nice to have :) and that is the reason why I will not build the 15th ³standard² (or the next library external encoding library). The question I am posing to the authors of the libraries you mentioned is why they don¹t sit down and write such a summary collaborative document and agree on a common interface. Of course the usual responses may be put forth (time, money or both) so my request may be moot. I am aware of that. And yet, why not asking? Cheers ‹ MA P.S. Networking anybody? Multiprocessing?

Faré

10:37 p.m.

New subject: [pro] CDR for UTF* and UNICODE handling in :element-type (Re: write-char vs. 8-bit bytes)

...

I am aware of all the things ³out there², and yet, having a number of libraries or even a single library is not the same as ³having a standard².

If there were a single open-source go-to library, and it were stable, that would be a *de facto* standard that you could then codify. But then, the advantages of the codification would be dubious, since you already have the library, and all the codification creates is opportunity for divergently buggy reimplementations. IETF requires two independent interoperating implementations before it declares a standard adopted. What does interoperating means here? The two libraries must use the same package? That's conflict. Different packages but the same symbol names? That's not interoperation. There's no way to win this standardization game.

...

...
If you can't convince the community to choose between babel and cl-unicode and whichever other alternatives may exist, what makes you think you can get yet another incompatible standard widely adopted? https://xkcd.com/927/

I am not advocating the proverbial 15th incompatible standard. Since by now people should know what they are doing, it would be nicer to have a document that summarized things up. Didn¹t the ANSI spec essentially came about in that way?

CL was standardized by trying to compromise between existing implementations; that was both its success (starting from something that exists, and making it converge a bit) and its limitation (with the horrible practice of underspecified functionality, which creates as many portability landmines). If you want to restart a similar process, good luck trying to put the developers of these implementations on the same mailing-list: maintained: abcl allegro ccl clisp cmucl lispworks mkcl sbcl semi-maintained: ecl gcl scl in development: clasp mocl sacla unmaintained: corman genera mcl xcl Back in the days, there was were big customers and the threat of reduced DARPA funding that put everyone in the same room. No such incentive today.

...

...
PS: all implementations that accept unicode accept :external-format :utf-8... except clisp, that requires you to use 'charset:utf-8. If you want to work towards a common external-format, start here

I said ³any takers?². I am just the customer telling the market what it would be nice to have :) and that is the reason why I will not build the 15th ³standard² (or the next library external encoding library). The question I am posing to the authors of the libraries you mentioned is why they don¹t sit down and write such a summary collaborative document and agree on a common interface. Of course the usual responses may be put forth (time, money or both) so my request may be moot. I am aware of that. And yet, why not asking?

Well, given enough million dollars, I'm sure you can convince people to sit in the same room. Not that I would recommend that as a good investment.

...

P.S. Networking anybody? Multiprocessing?

iolib? lparallel? How many man-years of lisp experts would you fund to quibble over language lawyering vs actually improving those libraries? —♯ƒ • François-René ÐVB Rideau •Reflection&Cybernethics• http://fare.tunes.org So capitalism has been making everyone poorer for centuries? How fabulously rich our ancestors must have been before all this destruction!

Antoniotti Marco

14 Apr 14 Apr

10:02 a.m.

New subject: [pro] CDR for UTF* and UNICODE handling in :element-type (Re: write-char vs. 8-bit bytes)

On Apr 12, 2014, at 24:37 , Faré <fahree@gmail.com<mailto:fahree@gmail.com>> wrote: I am aware of all the things ³out there², and yet, having a number of libraries or even a single library is not the same as ³having a standard². If there were a single open-source go-to library, and it were stable, that would be a *de facto* standard that you could then codify. But then, the advantages of the codification would be dubious, since you already have the library, and all the codification creates is opportunity for divergently buggy reimplementations. Yes. But if a consensus is there about the meaning of a particular API there will be something to discuss about. IETF requires two independent interoperating implementations before it declares a standard adopted. What does interoperating means here? The two libraries must use the same package? That's conflict. Different packages but the same symbol names? That's not interoperation. There's no way to win this standardization game. There are several point of view about who “wins” and who “loses”. About the request for two different interoperable implementations, you could, in principle, deal with it at the package level, thanks to the nickname facility (well, it would work better if it were more “standardized” :) ). If you can't convince the community to choose between babel and cl-unicode and whichever other alternatives may exist, what makes you think you can get yet another incompatible standard widely adopted? https://xkcd.com/927/ I am not advocating the proverbial 15th incompatible standard. Since by now people should know what they are doing, it would be nicer to have a document that summarized things up. Didn¹t the ANSI spec essentially came about in that way? CL was standardized by trying to compromise between existing implementations; that was both its success (starting from something that exists, and making it converge a bit) and its limitation (with the horrible practice of underspecified functionality, which creates as many portability land mines). Yes. Exaclty. And moping up these land mines is not something that can be done at the library level alone. If you want to restart a similar process, good luck trying to put the developers of these implementations on the same mailing-list: maintained: abcl allegro ccl clisp cmucl lispworks mkcl sbcl semi-maintained: ecl gcl scl in development: clasp mocl sacla unmaintained: corman genera mcl xcl I am in no position to “restart" a similar process. I am posting here because this is where - I believe - most people interested in such things may be listening. If the thing gets “restarted”, good. If not, it will not hurt me individually. Back in the days, there was were big customers and the threat of reduced DARPA funding that put everyone in the same room. No such incentive today. Yes. But this argument has been circulated many times before, as I said. Maybe some collaborative effort can be put together nevertheless. PS: all implementations that accept unicode accept :external-format :utf-8... except clisp, that requires you to use 'charset:utf-8. If you want to work towards a common external-format, start here I said ³any takers?². I am just the customer telling the market what it would be nice to have :) and that is the reason why I will not build the 15th ³standard² (or the next library external encoding library). The question I am posing to the authors of the libraries you mentioned is why they don¹t sit down and write such a summary collaborative document and agree on a common interface. Of course the usual responses may be put forth (time, money or both) so my request may be moot. I am aware of that. And yet, why not asking? Well, given enough million dollars, I'm sure you can convince people to sit in the same room. Not that I would recommend that as a good investment. Nobody has million of dollars for such a thing. We all know that (although Google may pour a couple of million into it, hint, hint :) :) ). P.S. Networking anybody? Multiprocessing? iolib? lparallel? As good as they are, they are not “specifications”. I.e., I cannot write a separate (as buggy as you want) implementation of them. Neither I can write a separate bordeaux-threads. In fact, as an example, lparallel is just one of the “concurrency” libraries “out there” (cfr., http://www.cliki.net/concurrency). How many man-years of lisp experts would you fund to quibble over language lawyering vs actually improving those libraries? Language lawyering is not necessarily a bad thing IMHO, moreover, it may lead to better designs and shared knowledge. Your question may be turned around and asked in the following way: How many man-years of lisp experts have gone into building all the concurrency libraries listed on http://www.cliki.net/concurrency? How many man years should we devote to debating how to choose which one to improve on? Note that I do not think that those years spent on building concurrency libraries are lost (and I say that from the vantage point of a person who suffers from a severe case of NIH-syndrome :) ), but instead of picking a winner, it may be best at this point to stand back, take a deep breath and maybe produce something that at least a few of the actors can recognize as a good specification. If the specification will be based on the parallel API, hey! more power to it! Of course this is a general statement that applies, in my intentions, especially to all the dark corners of the ANSI spec before everything else. At least that would be my priority (i.e., the :external-format thingy is just an example). Cheers — MA

Matthew Mondor

1:14 p.m.

New subject: [pro] CDR for UTF* and UNICODE handling in :element-type (Re: write-char vs. 8-bit bytes)

On Mon, 14 Apr 2014 10:02:46 +0000 Antoniotti Marco <antoniotti.marco@disco.unimib.it> wrote:

...

I am aware of all the things ³out there², and yet, having a number of libraries or even a single library is not the same as ³having a standard².

Sorry for complaining, but unfortunately this post was very difficult to follow (it was not obvious what was quoted, and there seems to be some character encoding issues). Thanks, -- Matt

Antoni Grzymała

1:30 p.m.

New subject: [pro] CDR for UTF* and UNICODE handling in :element-type (Re: write-char vs. 8-bit bytes)

Tako rzecze Matthew Mondor (2014-04-14, 09:14):

...

...
I am aware of all the things ³out there², and yet, having a number of libraries or even a single library is not the same as ³having a standard².

Sorry for complaining, but unfortunately this post was very difficult to follow (it was not obvious what was quoted, and there seems to be some character encoding issues).

Yeah, I had to switch to the html part (plain being my default) and all was more or less nicely quoted/indented (albeit with some single-byte vs utf8 issues apparent in quotes): https://antoszka.pl/tmp/1d87edc3032609cc579b6116d7584fd0459dd3fd.png vs. the more or less unreadable default plain-text part: https://antoszka.pl/tmp/a943281be7299d02bc17b59dfbf20a1b547951f3.png so I suppose Marco could have a look at what his e-mail client is actually outputting :). -- [アントシカ]

Pascal J. Bourguignon

1:53 p.m.

New subject: [pro] CDR for UTF* and UNICODE handling in :element-type (Re: write-char vs. 8-bit bytes)

Antoniotti Marco <antoniotti.marco@disco.unimib.it> writes:

...

As good as they are, they are not “specifications”. I.e., I cannot write a separate (as buggy as you want) implementation of them. Neither I can write a separate bordeaux-threads. In fact, as an example, lparallel is just one of the “concurrency” libraries “out there” (cfr., http://www.cliki.net/concurrency).

How many man-years of lisp experts would you fund to quibble over language lawyering vs actually improving those libraries?

Language lawyering is not necessarily a bad thing IMHO, moreover, it may lead to better designs and shared knowledge. Your question may be turned around and asked in the following way:

How many man-years of lisp experts have gone into building all the concurrency libraries listed on http://www.cliki.net/concurrency? How many man years should we devote to debating how to choose which one to improve on?

Note that I do not think that those years spent on building concurrency libraries are lost (and I say that from the vantage point of a person who suffers from a severe case of NIH-syndrome :) ), but instead of picking a winner, it may be best at this point to stand back, take a deep breath and maybe produce something that at least a few of the actors can recognize as a good specification. If the specification will be based on the parallel API, hey! more power to it!

Of course this is a general statement that applies, in my intentions, especially to all the dark corners of the ANSI spec before everything else. At least that would be my priority (i.e., the :external-format thingy is just an example).

I think we should be careful to distinguish system libraries providing fundamental services, from framework libraries. A portability library, such as Bordeaux-threads or Closer-Mop, is mostly a system library, since it provides only a common API for services already provided by the various implementations. Other libraries, that may seem functionally redundant, are more often in the framework category. They may not be available on all platform or implementations, they may provide a strong structuring to the project. For those frameworks, you very definitely may want to avoid them and re-invent your own subset framework. Or adopt them and exploit them. http://codeisbeautiful.org/why-the-not-invented-here-syndrom-is-often-a-good... http://www.joelonsoftware.com/articles/fog0000000007.html Sometimes, it's more important to know entirely the code of a library or application you've written yourself, to be able to make it evolve swiftly (evolve, not grow), rather than reusing a big framework library with hundreds of dependencies, possibly written in unfamiliar programming styles, where you would spend weeks to find bugs or implement simple changes for the sheer complexity and size of the body of code it represents. That said, I'm definitely in favor of (sub-)standardizing the API of system-level services. Last century, I even started a cl-posix project to define a Common Lisp POSIX API on sourceforge (which unfortunately has been garbage collected by sourceforge for lack of activity a long time ago). Two years ago, I reported on the behavior of CL implementations on handling logical pathnames on POSIX systems. AFAIK, this hasn't moved implementers to improve the situation, I should have followed up with a CDR to let users require it from implementors with a simple message: please implement CDR-x. I may do that eventually. -- __Pascal Bourguignon__ http://www.informatimago.com/ "Le mercure monte ? C'est le moment d'acheter !"

Steve Haflich

12 Apr 12 Apr

4:12 a.m.

New subject: [pro] CDR for UTF* and UNICODE handling in :element-type (Re: write-char vs. 8-bit bytes)

There is very little for a substandard to specify without overreaching (or unnecessarily duplicating) other, more universal specifications. Back when X3J13 spent a lot of time considering I18N, Unicode didn't yet exist. Unicode has been a big success in dealing with a very difficult problem, and had existed, X3J13 probably would have specified it (if an implementation chooses to support more than the set of basic chars) just as Java did years later. Further, UTF-8 didn't yet exist, but if it had, implementations and perhaps even X3J13 would have adopted it as a default. But that's the history of a different universe, and there exist some historical implementations that have character code points different from Unicode. Every Common Lisp implementation implements a character set and maps those characters onto nonnegative integer code points. That mapping is not specified, and although Unicode (or perhaps just its intersection with ASCII) would be the sane choice in modern times. But this has nothing to do with external formats. Unicode does not define externalization formats -- it defines _only_ the mapping between zillions of characters (most not yet existent) into nonegative integer code points in the range of 21 bits. It can and does do this without the blessing of the Lisp community. UTF-8 defines a mapping of Unicode code points onto a sequence of octets. It was originally defined to support the encoding of arbitrary 32-bit nonnegative integers onto sequences of 1 to 6 octets, but it was subsequently tied closer to Unicode in that it is defined to support on the 21-bit Unicode range, and also that certain code points (e.g. the surrogate pairs) are defined to be errors. (Much of this is explained understandably on the Wikipedia UTF-8 page.) So, UTF-8 is well defined and can work without the blessing of the Lisp community. So, if an implementation supports UTF-8 as an external format, it ought translate whatever it uses for its internal code points into UTF-8 (which represents, of course, Unicode code points). Those internal code points are not the business of any specification, and the UTF-8 translation is already well defined by the Unicode and UTF-8 standards. What's left? Well, there is a little that could still be productively substandardificated. Specifically, the ANS punts nearly completely on what can be used as the value of an :external-format argument. So quasi-portable code can't know what to specify if it wants to join the modern computing community and read/write UTF-8. I think the obvious answer if to draft a substandard for a convention of :keyword names which an implementation ought support for portability. (Allegro does this, and I'd be happy to provide a list of the many encodings and ef names that have been supported for decades.) The most important one is of course :UTF-8, but semistandardizing this along with the many ISO8859-nn encodings plus the several traditional popular Japanese and Chinese encodings. All these encodings are rapidly falling out of usage, but there are historical web pages and other sources that Common Lisp ought be able to internalize (for those implementations that think this is important). Other than external format naming, I can't think of anything that Common Lisp needs to standardize. Yes, the language would have been a better programming-ecology citizen if code points were defined as Unicode, but that would be back incompatible.

Luís Oliveira

11:15 a.m.

New subject: [pro] CDR for UTF* and UNICODE handling in :element-type (Re: write-char vs. 8-bit bytes)

On Sat, Apr 12, 2014 at 5:12 AM, Steve Haflich <shaflich@gmail.com> wrote:

...

Other than external format naming, I can't think of anything that Common Lisp needs to standardize.

The main thing that comes to mind is handling of errors. Implementations differ widely in how they handle encoding and decoding errors. Some ignore them silently, others signal errors. Some provides useful restarts, others don't. That was one of the main motivation behind Babel. -- Luís Oliveira http://kerno.org/~luis/

Antoniotti Marco

13 Apr 13 Apr

10:33 a.m.

New subject: [pro] CDR for UTF* and UNICODE handling in :element-type (Re: write-char vs. 8-bit bytes)

Thank you Steve for your very good summary. IMHO, what it needed is at least the kind of summary document saying what “sub-standard” external formats are available; you say that that would be possible and the only sensible thing to do at this time. I agree and that would already be a good thing to have. The issue of how and whether specifying the mapping to code points would break back compatibility is not clear to me, but then, my character coding fu and its ramifications are very weak. Cheers — MA On Apr 12, 2014, at 06:12 , Steve Haflich <shaflich@gmail.com> wrote:

...

There is very little for a substandard to specify without overreaching (or unnecessarily duplicating) other, more universal specifications.

Back when X3J13 spent a lot of time considering I18N, Unicode didn't yet exist. Unicode has been a big success in dealing with a very difficult problem, and had existed, X3J13 probably would have specified it (if an implementation chooses to support more than the set of basic chars) just as Java did years later. Further, UTF-8 didn't yet exist, but if it had, implementations and perhaps even X3J13 would have adopted it as a default. But that's the history of a different universe, and there exist some historical implementations that have character code points different from Unicode.

Every Common Lisp implementation implements a character set and maps those characters onto nonnegative integer code points. That mapping is not specified, and although Unicode (or perhaps just its intersection with ASCII) would be the sane choice in modern times. But this has nothing to do with external formats. Unicode does not define externalization formats -- it defines _only_ the mapping between zillions of characters (most not yet existent) into nonegative integer code points in the range of 21 bits. It can and does do this without the blessing of the Lisp community.

UTF-8 defines a mapping of Unicode code points onto a sequence of octets. It was originally defined to support the encoding of arbitrary 32-bit nonnegative integers onto sequences of 1 to 6 octets, but it was subsequently tied closer to Unicode in that it is defined to support on the 21-bit Unicode range, and also that certain code points (e.g. the surrogate pairs) are defined to be errors. (Much of this is explained understandably on the Wikipedia UTF-8 page.) So, UTF-8 is well defined and can work without the blessing of the Lisp community.

So, if an implementation supports UTF-8 as an external format, it ought translate whatever it uses for its internal code points into UTF-8 (which represents, of course, Unicode code points). Those internal code points are not the business of any specification, and the UTF-8 translation is already well defined by the Unicode and UTF-8 standards.

What's left? Well, there is a little that could still be productively substandardificated. Specifically, the ANS punts nearly completely on what can be used as the value of an :external-format argument. So quasi-portable code can't know what to specify if it wants to join the modern computing community and read/write UTF-8. I think the obvious answer if to draft a substandard for a convention of :keyword names which an implementation ought support for portability. (Allegro does this, and I'd be happy to provide a list of the many encodings and ef names that have been supported for decades.) The most important one is of course :UTF-8, but semistandardizing this along with the many ISO8859-nn encodings plus the several traditional popular Japanese and Chinese encodings. All these encodings are rapidly falling out of usage, but there are historical web pages and other sources that Common Lisp ought be able to internalize (for those implementations that think this is important).

Other than external format naming, I can't think of anything that Common Lisp needs to standardize. Yes, the language would have been a better programming-ecology citizen if code points were defined as Unicode, but that would be back incompatible.

_______________________________________________ pro mailing list pro@common-lisp.net http://common-lisp.net/cgi-bin/mailman/listinfo/pro

-- Marco Antoniotti, Associate Professor tel. +39 - 02 64 48 79 01 DISCo, Università Milano Bicocca U14 2043 http://bimib.disco.unimib.it Viale Sarca 336 I-20126 Milan (MI) ITALY Please note that I am not checking my Spam-box anymore. Please do not forward this email without asking me first.

Pascal J. Bourguignon

12 Apr 12 Apr

8:40 a.m.

New subject: [pro] CDR for UTF* and UNICODE handling in :element-type (Re: write-char vs. 8-bit bytes)

Antoniotti Marco <antoniotti.marco@disco.unimib.it> writes:

...

On Apr 11, 2014, at 22:18 , Bob Cassels <bobcassels@netscape.net> wrote:

...
Because this is a binary write, not a character write. It has nothing to do with Unicode or anything else. Unless the original problem has to do with writing UCS-2 or UTF-16, but there was nothing in the original question that had anything to do with characters, other than the incorrect use of write-char to write a binary value.

I understand that my original message was not on spot. In fact I changed the subject line in my response… The issue, in any case, appears to be the handling of characters nevertheless. Maybe Paul can clarify what he was really trying to do.

In any case… I am the only person who thinks that a “sub-standard” on these issues may be a Good Thing?

What could be a good thing is a partial standardization of :external-format. But I'd say that a standardization of the semantics of pathnames (including logical pathnames) on posix system would be more urgent. -- __Pascal Bourguignon__ http://www.informatimago.com/ "Le mercure monte ? C'est le moment d'acheter !"

Hans Hübner

10 Apr 10 Apr

3:12 p.m.

Paul, you need to use a stream with :element-type '(unsigned-type 8) if you want to control the binary encoding. If you need to mix binary and text, maybe FLEXI-STREAMS (http://weitz.de/flexi-streams/) is helpful. -Hans 2014-04-10 10:31 GMT-04:00 Paul Tarvydas <paultarvydas@gmail.com>:

...

I'm using sbcl to write-char a 16-bit unsigned integer to a socket as two separate unsigned 8-bit bytes, for example 141 should appear as

#x00 #x8d.

SBCL appears to convert the #x8d into a two-byte utf-8 char, resulting in 3 bytes written to the stream

\#x00 #xcd #x8d.

What is the proper incantation to achieve this? (SBCL on Windows, if that matters).

thanks pt

_______________________________________________ pro mailing list pro@common-lisp.net http://common-lisp.net/cgi-bin/mailman/listinfo/pro

Tom Emerson

3:13 p.m.

I can't easily verify right now, but check the :external-format on your stream: it may be defaulting to UTF-8 and you will need to specify something else. -tree Sent from my iPhone

...

On Apr 10, 2014, at 10:31, Paul Tarvydas <paultarvydas@gmail.com> wrote:

I'm using sbcl to write-char a 16-bit unsigned integer to a socket as two separate unsigned 8-bit bytes, for example 141 should appear as

#x00 #x8d.

SBCL appears to convert the #x8d into a two-byte utf-8 char, resulting in 3 bytes written to the stream

\#x00 #xcd #x8d.

What is the proper incantation to achieve this? (SBCL on Windows, if that matters).

thanks pt

_______________________________________________ pro mailing list pro@common-lisp.net http://common-lisp.net/cgi-bin/mailman/listinfo/pro

raito＠raito.com

11 Apr 11 Apr

2:45 a.m.

I end up doing something like: (defmacro with-open-binary-file (args &rest rest) `(with-open-file (,@args :element-type '(unsigned-byte 8)) ,@rest)) (defun write-word (word out) (write-byte (ldb (byte 8 8) word) out) (write-byte (ldb (byte 8 0) word) out)) only because I'm exclusively writing binary stuff to the files this code serves, and because it parallels the C code that does the same thing fairly well. I'm not writing to a socket stream, but this may help anyway. It might need to account for endianness, but I'm not sure. It's been a while since I've looked at it closely. Neil Gilmore raito@raito.com

...

I can't easily verify right now, but check the :external-format on your stream: it may be defaulting to UTF-8 and you will need to specify something else.

-tree

Sent from my iPhone

...
On Apr 10, 2014, at 10:31, Paul Tarvydas <paultarvydas@gmail.com> wrote:

I'm using sbcl to write-char a 16-bit unsigned integer to a socket as two separate unsigned 8-bit bytes, for example 141 should appear as

#x00 #x8d.

SBCL appears to convert the #x8d into a two-byte utf-8 char, resulting in 3 bytes written to the stream

\#x00 #xcd #x8d.

What is the proper incantation to achieve this? (SBCL on Windows, if that matters).

thanks pt

_______________________________________________ pro mailing list pro@common-lisp.net http://common-lisp.net/cgi-bin/mailman/listinfo/pro

_______________________________________________ pro mailing list pro@common-lisp.net http://common-lisp.net/cgi-bin/mailman/listinfo/pro

Pascal J. Bourguignon

11:03 a.m.

Paul Tarvydas <paultarvydas@gmail.com> writes:

...

I'm using sbcl to write-char a 16-bit unsigned integer to a socket as two separate unsigned 8-bit bytes, for example 141 should appear as

#x00 #x8d.

SBCL appears to convert the #x8d into a two-byte utf-8 char, resulting in 3 bytes written to the stream

\#x00 #xcd #x8d.

What is the proper incantation to achieve this? (SBCL on Windows, if that matters).

http://www.cliki.net/CloserLookAtCharacters -- __Pascal Bourguignon__ http://www.informatimago.com/ "Le mercure monte ? C'est le moment d'acheter !"

4128

Age (days ago)

4132

Last active (days ago)

List overview

Download

21 comments

12 participants

participants (12)

Antoni Grzymała
Antoniotti Marco
Bob Cassels
Faré
Hans Hübner
Luís Oliveira
Matthew Mondor
Pascal J. Bourguignon
Paul Tarvydas
raito＠raito.com
Steve Haflich
Tom Emerson

[pro] write-char vs. 8-bit bytes

tags

participants (12)