Dear all,
asdf-encodings now includes a version of Douglas Crosher's encoding detection algorithm. It is also automatically disabled on implementations without unicode support.
As compared to Douglas's version, the detection algorithm uses :ascii or :latin1 instead of :default as a fallback when no declaration was found and no UTF-n encoding was detected. It also uses a 1024-byte buffer rather than 320-byte buffer, to imitate what Emacs does with respect to the beginning of a file.
I suggest that asdf-encodings is now ready for testing, and invite you to test it.
An example package that uses it is my lambda-reader, heavily modified from Brian Mastenbrook's original to make it hopefully portable to all implementations with or without utf-8 support.
—♯ƒ • François-René ÐVB Rideau •Reflection&Cybernethics• http://fare.tunes.org Sin lies only in hurting other people unnecessarily. All other "sins" are invented nonsense. (Hurting yourself is not sinful — just stupid.) — Robert Heinlein, "Time Enough For Love"
On 04/22/2012 09:38 AM, Faré wrote:
Dear all,
asdf-encodings now includes a version of Douglas Crosher's encoding detection algorithm. It is also automatically disabled on implementations without unicode support.
As compared to Douglas's version, the detection algorithm uses :ascii or :latin1 instead of :default as a fallback when no declaration was found and no UTF-n encoding was detected. It also uses a 1024-byte buffer rather than 320-byte buffer, to imitate what Emacs does with respect to the beginning of a file.
Increasing this to 1024 bytes seems a good idea, just in case it is a few lines down.
Note that a file having all octects below #x80 does not ensure it is ASCII, just that it does not have any UTF-8 specific codes.
The default should still be :default because it may be an encoding the CL implementation is aware of.
I suggest that asdf-encodings is now ready for testing, and invite you to test it.
An example package that uses it is my lambda-reader, heavily modified from Brian Mastenbrook's original to make it hopefully portable to all implementations with or without utf-8 support.
From the lambda-reader source code:
;;; Note that this file uses UTF-8. ;;; But if you use an implementation that does not recognize UTF-8, ;;; and instead has 8-bit characters, it should still work, ;;; and other files should still be able to use its functionality, provided ;;; (1) you do NOT transcode either this file or the files that use it ;;; (2) you do not care that lambdas be read a sequence of characters CEBB ;;; or CE9B for uppercase lambda rather than a single character.
Making code dependent on the file encoding is not recommended, and writing a library that requires code that uses it to be in the same encoding is hardly defensible. If another author decides to do this then the Quicklisp releases become fractured. Please do not let this into a Quicklisp release.
A tool to automatically add the coding file option has been written. There is no need to contact library authors any further, requesting them to recode their files, as I am confident we can work with their code as it is. The tool can also recode files to UTF-8 or attempt to recode to ISO-8859-1.
Having libraries in Quicklisp that are sensitive to the source encoding makes such recoding fragile.
Usage of the ASDF :encoding declaration will likely break the reading of files recoded using this tool as it is not practical to update the system definitions, so the use of the :encoding declaration is not recommended and it really is a liability.
With substitutions and ignoring UTF-8 in comments, the percentage of UTF-8 files required in Quicklisp is below 0.8% and much of this is concentrated into an even smaller number of releases.
Regards Douglas Crosher
Note that a file having all octects below #x80 does not ensure it is ASCII, just that it does not have any UTF-8 specific codes.
Indeed, but that's a good guess, and it works in practice.
The default should still be :default because it may be an encoding the CL implementation is aware of.
I vehemently disagree: my policy is to always favor a deterministic behaviour of either working or breaking everywhere, over working in some places but breaking in other places. This makes debugging things much easier, and is the whole point of an abstraction layer. It's a principle I've strictly adhered to while developing ASDF 2, and will enforce as long as I'm ASDF maintainer. See the ASDF 2 paper: http://common-lisp.net/project/asdf/ilc2010draft.pdf
No one actually wants :default. Everyone knows (or should know) which encoding they are using, and if it's not standard, they should explicitly specify it. Everyone wants to actually use said an implicit or explicit encoding, or if not available, a predictable fallback that only depends on the implementation on not on any configurable setting. The only case where :default should ever be used is when it is the implementation's only option.
From the lambda-reader source code:
;;; Note that this file uses UTF-8. […]
Making code dependent on the file encoding is not recommended,
Lambda-reader is not actually dependend on the file encoding. Actually I went to great pains to make it work independently from the encoding. All it depends on is that client libraries either use the same encoding, or otherwise use another encoding that ends up interning symbols using the same lambda character(s) as used by the library. i.e. transcoding the library and/or its clients so one uses a lambda from utf-8 and the other a lambda iso-8859-7, where each file is read with the appropriate external-format, should work just fine. The comment is still valid assuming you use the pristine utf-8 source code.
I updated the comment to make it more painfully clear what the actual constraints are.
and writing a library that requires code that uses it to be in the same encoding is hardly defensible.
All anyone ever requires is that the characters read by client and server match. I'm making no other demand.
And yet, I see no problem about a file including comments that are aware of what is the encoding it is authored and distributed in. If someone recodes the file, he certainly cannot blame the original author for the file not working anymore, or including outdated comments; whoever modifies the file, including transcoding it, is taking responsibility for it.
If another author decides to do this then the Quicklisp releases become fractured. Please do not let this into a Quicklisp release.
Why would you transcode files in Quicklisp? Then again, if you do, you're taking responsibility for the results. Don't blame authors for what you do.
A tool to automatically add the coding file option has been written. There is no need to contact library authors any further, requesting them to recode their files, as I am confident we can work with their code as it is. The tool can also recode files to UTF-8 or attempt to recode to ISO-8859-1.
Recoding everything to use UTF-8 is an interesting approach for sure. (Although — will you avoid doing that on MCL-specific files?) What does Zach think of it?
For the sake of people getting their source upstream, I still think it's useful to encourage authors to use UTF-8 everywhere (without BOM).
—♯ƒ • François-René ÐVB Rideau •Reflection&Cybernethics• http://fare.tunes.org Procrastination is great. It gives me a lot more time to do things that I'm never going to do.