: Douglas Crosher
I would ask you to reconsider the impact of releasing ASDF with the :encoding declaration bundled and of recommending its use.
I'd like to add :encoding now, if only because if we are to ever use it, we must wait for it to be in an asdf that has already made its way to all/most implementations before it's considered universal enough for libraries to rely on it. Whereas if we don't end up using it, it's easier to remove later.
But you're right: I should not actively encourage its use for now, except for people who know what they are doing and are ready to deal with stricter dependencies and possible future change.
As an example use of asdf with encodings support, I'm using asdf encodings with lambda-reader, that I just hacked to also work in 8-bit mode, with a system to load it in UTF-8, and another to load it in Latin1 (for testing purposes on SBCL). Punning a same file as two encodings is one reason why I like the :encoding feature. (As a bonus, the lambda-reader-8bit.asd includes code to define a file that gets loaded but not compiled, a feature several times asked for in this mailing-list.)
If ASDF is released with the :encoding system definition declaration and further offered as the only solution to authors of portable UTF-8 code then some will no doubt start using this and the :encoding declarations will be in system definition files for tools to deal with and to be supported in future.
That's one of the reason I wanted UTF-8 to be a default that didn't require any :encoding specification. But if it's going to break things for users, then it's not ideal either. And at least 6 libraries in Quicklisp haven't given me feedback when I warned about this issue. I will give them a few months of delay before I change the default and consider breaking them OK because they're unsupported.
The encoding file option […]
I suppose you mean autodetection based on file contents including any Emacs-style declaration.
Indeed it might be the solution... but it forces every non-ASCII file to have a header, unless we have a default like UTF-8. Also, autodetection adds quite a few hundreds of lines to ASDF, especially if we want to do it right, i.e. in a way fully compatible with Emacs.
Also the cost of the component encoding feature is ten to twenty lines of code more than the cost of mere encoding autodetection hook (also a few tens of lines), which frankly is tiny as compared to full autodetection support.
It solves the problem of the system definition file encoding, which the :encoding declaration can't. It can be used by automated recoding tools that are badly needed - there is a path for Quicklisp to automatically recode projects to suit the CL implementation. It can be used by CL implementations for 'load and 'compile-file, and by editors. This seems like the best path for solving the problem.
I like this approach, but it requires more third-party coding. One of my principles in ASDF is to enable people to do things without them having to wait on other people to do things, i.e. make coupling looser. The :encoding feature reduces coupling with having to wait for quality autodetection and quicklisp support.
Once the encoding file option is implemented, the :encoding declaration would seem to just be a liability.
It also provides a slight performance boost by not requiring autodetection. And even Emacs gives precedence to filename-based encoding detection. Additionally, it's much simpler to support and the code to support it is already there.
Code in asdf-encodings doesn't support encoding autodetection yet. I'm sorry, but I think that the code you and pjb posted, while very good starts, are not 100% solutions. A 100% solution would be 100% Emacs-compatible.
Code that already uses the :encoding declaration will not assist other tools that look for the encoding file option.
It provides an API for querying the encoding used by a component.
For example if there are some Quicklisp projects using an :encoding option then recoding their source becomes more problematic or the tools much more complex.
Why would you want to do that? And if you do, is it that difficult editing the .asd file? Or is your problem that .asd files aren't declarative enough? In the latter case, we agree; but then you're welcome to help with XCVB.
Further there is the problem of what to do in future if there is a conflict between the :encoding declaration and an encoding file option.
Just like in Emacs: the external declaration or manual setting takes precedence.
What if someone does recode files and this adds a coding file option but does not track down and update :encoding declarations in scope?
That's a bug. And what if someone writes a wrong coding declaration, or keeps an old declaration after recoding? That's a bug, too. I don't see this as a danger looming on developers; especially since providing a deterministic compilation behavior means that developers will detect such bugs early enough and portably, as opposed to the "it works for me but not for you" hell of the current default behavior. I see an evolving autodetection algorithm as more version hell, and a non-evolving autodetection algorithm as a probable buggy hurdle. Both enabling developers and making them responsible for their choices while giving them predictable deterministic feedback: I call that progress.
For these reasons I think the :encoding declaration is a dead end and a liability, and that it should not be released, and that you should not be encouraging its use.
Sorry, I'm not convinced.
I suggest that the encoding file option is a good plan, and that this be communicated to users, and that the social solution of having everyone use UTF-8 be toned down.
Even with autodetection, we should still encourage UTF-8, since it's one of the few encodings that is universally supported on all modern Lisp implementations. For instance, Lispworks on Unix supports very few encodings. If you want your code to be maximally portable, please use UTF-8.
However, keeping the encodings support separate, even temporarily, has several advantages:
Sure, it's a chunk of code that does not seem to really belong in ASDF.
Thanks for agreeing on that.
Are you sure that Quicklisp does not need any support bundled? Perhaps it can bootstrap from just asdf.lisp, keep itself ASCII clean, build itself, then download and install asdf-encodings which could be ASCII clean too. Sounds like a good plan if it can all work.
As long as we don't depend on encoding for .asd files themselves, then it's better for the systems that depend on non-UTF8 encodings to depend on asdf-encodings.
As for the encoding of .asd files, I propose that we should standardize on UTF-8 (US-ASCII being an acceptable subset of it).
Regarding the hooks, might it be better for them to be lists of functions to call in turn until successful, so that multiple projects can add hooks and still work together?
Sorry, I don't see how this could possibly work. I'd rather have an API that makes it clear that someone must be in charge, than an API that makes a mush of responsibility and is an invitation to catastrophic interactions.
- it allows this particular fast moving code to evolve and be refined
without burdening asdf, and without having to cast in stone design choices made before we fully understand the issues.
Same could be said for the :encoding declaration. You may regret releasing it and having to maintain it, deal with conflicts with future file option solutions, and to deal with authors who keep using it, and with tools that don't work with it!
I understand that, but that's a risk I'm willing to take as ASDF maintainer. It's less than twenty lines of code, and only active package maintainers are going to use this feature in the next few months, so that if I change my mind before next year, I expect that I'll be able to back off if I really want, and have same active maintainers follow me (though of course not without my deservedly losing their future good will).
- it keeps ASDF small for most people, yet allows the extension code
to grow big.
Agreed, and I have been trying to strip it down to a bare minimum that could be bundled, and even this is in a hook that could be replaced when asdf-encodings is loaded.
I believe that "always detect the default" is such bare minimum, with the asdf:*default-encoding* currently being :default, but hopefully :utf-8 in the future (future ouch: when we change it, will that be a defparameter or a defvar?)
As for the specific code you propose,
- I asked on #emacs pointers to how Emacs identifies coding.
I documented the results in comments in asdf-encodings. The Emacs way differs from your code in various ways. If we are going that way, is there any reason not to "just adopt" the Emacs code?
It's written in C, and is a big chunk of code, and puts a lot more weight on auto-detection. My code does the bare minimum to read the file options, and it has been tested on every encoding supported by my system that also supports the characters needed for CL code (ebcdic excluded, but all these also works with another 40 lines of code).
I suppose I should commit something based on your code for now. But that raises the question: won't we want to "improve" the algorithm to make it more like Emacs? If the algorithm changes, won't that create "interesting" side effects and changes in behavior that bite someone in the back?
- Does it make sense for a file to have a UTF-16LE header that
specifies coding: koi8-r ? I don't think so.
Yes, it is inconsistent, but it may be better to pass on the file option anyway so the error is detected. There are only a few cases that are detected from the BOM. Keep in mind that a BOM can be added when not appropriate and most decoders will just ignore it and keep working, so reading and returning the file option seems the best path.
Or a pun file that in UTF-16BE says it's UTF-16LE, and the other way around (or a longer circuit)? I think your algorithm tries both too hard (as in this case), and too little (as in cases where Emacs finds a coding and your code doesn't).
I am not aware of any cases where my code fails to read the file options, and there is a big set of tests available to confirm this.
Keep in mind that detection is not 100% reliable, and there are often multiple encodings that match a file. One concern is that if people start trusting auto-detection and not adding a file option then the mechanism become less reliable
- another tool may not have the same detection algorithm
or make different assumptions.
Exactly the reason why I only like detection so much, with or without declaration.
Reading the file option is reliable, and would likely remain the first thing to check.
- All in all that doesn't mean your code is bad,
but that probably means we should experiment with it and tweak it, before we declare ourselves satisfied with burning it into ASDF (which is somewhat less easy to upgrade than a casual library).
I am not suggesting releasing the code, just making the progress available. The reading of the file options has been well tested though. Other areas needing work are the translations of the external-formats for each CL implementation, and compatibility with the Emacs codings.
OK. Well, at this point I'm accepting patches to asdf-encodings.
—♯ƒ • François-René ÐVB Rideau •Reflection&Cybernethics• http://fare.tunes.org Malthus was right. It's hard to see how the solar system could support much more than 10^28 people or the universe more than 10^50. — John McCarthy