On Sun, 12 Apr 2009, Lu�s Oliveira wrote:
Again, I'm curious how UTF-8B might be implemented when CODE-CHAR returns NIL for #xDC80 through #xDCFF.
Let's assume that we have something that reads a sequence of 1 or more UTF-8-encoded bytes from a stream (and that we have variants that do the same for bytes in foreign memory, a lisp vector, etc.) If it gets an EOF while trying to read the first byte of a sequence, it returns NIL; otherwise, it returns an unsigned integer less than #x110000. If it can tell that a sequence is malformed (overlong, whatever), it returns the CHAR-CODE of the Unicode replacement character (#xfffd), it does not reject encoded values that correspond to UTF-16 surrogate pairs or other non-character code points.
(defun read-utf-8-code (stream) "Try to read 1 or more octets from stream. Return NIL if EOF is encountered when reading the first octet, otherwise, return an unsigned integer less than #x110000. If a malformed UTF-8 sequence is detected, return the character code of #\Replacement_Character; otherwise, return encoded value." (let* ((b0 (read-byte stream nil nil))) (when b0 (if (< b0 #x80) b0 (if (< b0 #xc2) (char-code #\Replacement_Character) (let* ((b1 (read-byte stream nil nil))) (if (null b1) ;;[Lots of other details to get right, not shown] )))))))
This (or something very much like it) has to exist in order to support UTF-8; the elided details are surprisingly complicated (if we want to reject malformed sequences.)
I wasn't able to find a formal definition of UTF-8B anywhere; the informal descriptions that I saw suggested that it's a way of embedding binary data in UTF-8-encoded character data, with the binary data encoded in the low 8 bits of 16-bit codes whose high 8 bits contained #xdc. If the binary data is in fact embedded in the low 7 bits of codes in the range #xdc80-#xdc8f or something else, then the following parameters would need to change:
(defparameter *utf-8b-binary-data-byte* (byte 8 0)) (defparameter *utf-8b-binary-marker-byte* (byte 13 8)) (defparameter *utf-8b-binary-marker-value* #xdc)
PROCESS-BINARY and PROCESS-CHARACTER do whatever it is that you want to do with a byte of binary data or a character. A real decoder might want to take these functions - or a single function that processed either a byte or character - as arguments.
This is just #\Replacement_Character in CCL:
(defparameter *replacement-character* (code-char #xfffd))
(defun decode-utf-8b-stream (stream) (do* ((code (read-utf-8-code stream) (read-utf-8-code stream))) ((null code)) ; eof (if (eql *utf-8b-binary-marker-value* (ldb *utf-8b-binary-marker-byte* code)) (process-binary (ldb *utf-8b-binary-data-byte* code)) (process-character (or (code-char code) *replacement-character*)))))
Isn't that the basic idea, whether the details/parameters are right or not ?
-- Lu�s Oliveira http://student.dei.uc.pt/~lmoliv/