On Sun, Apr 12, 2009 at 9:35 PM, Gary Byers gb@clozure.com wrote:
I wasn't able to find a formal definition of UTF-8B anywhere; the informal descriptions that I saw suggested that it's a way of embedding binary data in UTF-8-encoded character data, with the binary data encoded in the low 8 bits of 16-bit codes whose high 8 bits contained #xdc.
IIUC, UTF-8B is meant as a way of converting random bytes that are *probably* in UTF-8 format into an Unicode string in such a way that it's possible to reconstruct the original byte sequence later on.
The "spec" for UTF-8B is in this email message from Markus Kuhn: http://mail.nl.linux.org/linux-utf8/2000-07/msg00040.html which I should have mentioned when I first brought this up. (Sorry, again.)
(defun decode-utf-8b-stream (stream) (do* ((code (read-utf-8-code stream) (read-utf-8-code stream))) ((null code)) ; eof (if (eql *utf-8b-binary-marker-value* (ldb *utf-8b-binary-marker-byte* code)) (process-binary (ldb *utf-8b-binary-data-byte* code)) (process-character (or (code-char code) *replacement-character*)))))
Isn't that the basic idea, whether the details/parameters are right or not ?
That would work. But it certainly seems much more convenient to use Lisp strings directly. I'll try to illustrate that with a concrete example.
These days, unix pathnames seem to be often encoded in UTF-8 but IIUC they can really be any random sequence of bytes -- or at least that seems to be the case on Linux.
Suppose I was implementing a directory browser in Lisp. If I could use UTF-8B to convert unix pathnames into Lisp strings, it'd be straightforward to use Lisp pathnames, pass them around, manipulate them with the standard string and pathname functions, and still be able to access the respective files through syscalls later on. In this scenario, my program wouldn't have trouble handling badly formed UTF-8 or other binary junk.
The same applies to environment variables, command line arguments, and so on.
Does any of that make sense?