Hi,
I have some byte arrays which are UTF8 and some which are UTF8 with byte order markers.
I can convert these arrays to strings using
(babel:octets-to-string foo)
and
(babel:octets-to-string foo :start 3)
respectively, but I'm currently having to figure out whether there is a BOM, like this
(subseq foo 0 3)
#(239 187 191)
If I use (babel:octets-to-string foo) on a byte array with BOM markers, then my SBCL Lisp image dies.
Is there a better way to ask Babel to discover the correct encoding by looking for Byte Order Marks? Ideally I'd like one function call that worked with any array and figured out which encoding was being used automatically and works whether or not a BOM is present?
Sorry if I'm missing something obvious, I'm a Babel newbie .. Any guidance or code samples gratefully received.
Thanks,
Rob.
Hello,
On Wed, Apr 6, 2011 at 11:07 AM, Rob Blackwell rob.blackwell@aws.net wrote:
If I use (babel:octets-to-string foo) on a byte array with BOM markers, then my SBCL Lisp image dies.
Is there a better way to ask Babel to discover the correct encoding by looking for Byte Order Marks? Ideally I’d like one function call that worked with any array and figured out which encoding was being used automatically and works whether or not a BOM is present?
Babel handles BOMs in UTF-16 and UTF-32 properly. It uses them to identify endianness then skips them. I'm not sure what one's supposed to do with BOMs in UTF-8; probably skip them, certainly not crash! This will require some debugging.
Cheers,
Hello again,
On Wed, Apr 6, 2011 at 11:07 AM, Rob Blackwell rob.blackwell@aws.net wrote:
If I use (babel:octets-to-string foo) on a byte array with BOM markers, then my SBCL Lisp image dies.
I've tried this out and it works for me:
CL-USER> (babel:octets-to-string (babel-tests::ub8v 239 187 191 102 111 111)) "foo" CL-USER> (length *) 4
I'm guessing you're using SLIME and you haven't set your slime-net-coding-system to 'utf-8-unix or something similar. Have a look at the *inferior-lisp* when your Lisp crashes to see if that's the case.
HTH,
Luis,
I updated my .emacs as follows and it worked
(set-language-environment "UTF-8") (load (expand-file-name "~/quicklisp/slime-helper.el")) (setq slime-net-coding-system 'utf-8-unix)
I'm still a little confused as to why the length is 4 and not 3 - shouldn’t the byte order mark have been discarded?
Many thanks!
Rob.
-----Original Message----- From: Luís Oliveira [mailto:luismbo@gmail.com] Sent: 12 April 2011 23:23 To: Rob Blackwell Cc: babel-devel@common-lisp.net Subject: Re: [babel-devel] octets-to-string with UTF8 and Byte Order Marker
Hello again,
On Wed, Apr 6, 2011 at 11:07 AM, Rob Blackwell rob.blackwell@aws.net wrote:
If I use (babel:octets-to-string foo) on a byte array with BOM markers, then my SBCL Lisp image dies.
I've tried this out and it works for me:
CL-USER> (babel:octets-to-string (babel-tests::ub8v 239 187 191 102 111 111)) "foo" CL-USER> (length *) 4
I'm guessing you're using SLIME and you haven't set your slime-net-coding-system to 'utf-8-unix or something similar. Have a look at the *inferior-lisp* when your Lisp crashes to see if that's the case.
HTH,
-- Luís Oliveira http://r42.eu/~luis/
Hello,
Sorry for the late reply.
On Thu, Apr 21, 2011 at 10:36 PM, Rob Blackwell rob.blackwell@aws.net wrote:
I'm still a little confused as to why the length is 4 and not 3 - shouldn’t the byte order mark have been discarded?
I'm not sure. I couldn't find any clear indications on how leading BOMs should be handled for UTF-8. The BOM FAQ seems to indicate they should be converted to ZERO WIDTH NON-BREAKING SPACEs, maybe. Any comments? It would perhaps be interesting to check what well established libraries such as ICU do.
Cheers,