[Note to Robert: plenty of implementation notes in this email; you may want to save it to include the data in the TODO file and/or the manual.]
Dear James,
in case I didn't tell you, it's great to have you back in asdf-devel. You kind of left some time after I didn't take your suggestion to split ASDF into many files (because I felt the concatenation needed to be done by ASDF, or bust); years afterwards, I took your suggestion (after it became trivial for ASDF to handle the concatenation), and there you're back. I can't swear that it was a good idea to wait that long before to split ASDF, but in any case, the result of the splitting is immensely satisfying, when combined with the one-package-per-file approach of quick-build, that ensures some well-ordering of internal dependencies. Thanks for this and your many other suggestions and contributions (I admit I haven't fully grokked you asdf-pathname-test thing yet, despite massively refactoring it already — everytime I debug something, I have to try reunderstand what it's trying to do.)
i have understood from the ensuing discussion that,
the deprecated (2.*) version provided just one implementation for an effective operation - that is, for the behavior of perform given operation and system graph instances.
this effective operation incorporated depth-first graph traversal and post-order operator application.
ASDF is still traversing the graph depth-first for planning, then performing the planned actions in post-traversal order: that's just what it means to traverse a dependency graph.
The problem is that the downward and sideway dependencies were not implemented in an object-oriented way through COMPONENT-DEPENDS-ON, but baked into the infamous TRAVERSE algorithm inherited from ASDF 1. As we fixed bugs during the ASDF 2 days, Robert Goldman and I incrementally refactored the TRAVERSE algorithm into smaller functions until we finally understood what the thing actually does. Then trying to solve the one or two residual bugs that remained, I eventually figured what it instead *should have done* and implemented it, through many and many iterations, until it was just right. Now, with ASDF 3, the dependencies of an ACTION (operation + component) is wholly specified in an objected way through methods on COMPONENT-DEPENDS-ON (misnomer inherited from ASDF 1) — it should have been named ACTION-DEPENDS-ON instead, except that the concept of action was implicit in ASDF 1 and there was no name for it.
The problem was that if you traverse dependencies the ASDF 1 way, many of the actual dependencies end up missing: if module or system B depends on module or system A, then compiling and loading files in B should depend on loading and compiling files in A. People didn't usually notice in a build from scratch, because these dependencies were implicitly in the traversal order of the algorithm and serial nature of the build. But that could cause serious bugs in an incremental or parallel build: then, if files in A were recompiled, files in B wouldn't be. In practice, this means that incremental builds would sometimes fail in strange and mysterious ways, and people would then have to remove their build output files and restart from scratch. To get reliability in building large systems (e.g. at ITA), we needed to rebuilding things from scratch every time.
Robert Goldman, who had tried to fix the cross-system dependency problem, suggested there were such missing dependencies, but I didn't understand. Andreas Fuchs' parallel build (POIU) had dealt with it in an clever undocumented kludge, but I didn't understand either.
While implementing timestamp propagation (big bug #1), I had to massively rewrite the TRAVERSE algorithm, and rewrite POIU to fit the new ASDF internals, which is always a good test of ASDF extensibility. I tried to plainly remove the POIU kluge that I didn't understand (accumulating a huge list of dependencies from transitive parents), and of course the parallel build started failing. Then I realized that these dependencies were needed, and that the correct way to handle this situation without polynomial slowdown was to introduce new nodes: LOAD-OP and COMPILE-OP both need to depend on a node that corresponds to loading (via LOAD-OP) the parent's dependencies; I created a new operation called PARENT-OP and eventually renamed to PARENT-LOAD-OP then PREPARE-OP. Now, PREPARE-OP has to propagate upward, and neither downward nor sideway (or rather, sideway, it propagates to a LOAD-OP). And so I initially tried to kluge TRAVERSE to do that, until it became apparent that the TRAVERSE algorithm could be massively simplified by just moving all that stuff to COMPONENT-DEPENDS-ON, and then instead of an ununderstandable spaghetti mess of a TRAVERSE algorithm, there was a clean small jewel of an algorithm, and a much more versatile object model than before.
Then, I had to also finally understand the subtle difference between IN-ORDER-TO and DO-FIRST, that once again Andreas Fuchs had resolved in a clever undocumented kludge in POIU that I didn't understand and tried to remove, to my failure. The real issue is whether a dependency matters for its side effects in the file system or in the current image. In the former case (typically, compiling a lisp file to FASL), its associated timestamp is that of the most recent output file(s), it need not be redone if already done (i.e. its oldest output file later than its latest dependencies). In the latter case (typically, load an existing FASL), it must be redone in the current image if never done even if already done before in a previous image, and the timestamp associated to it must be that of its most recent file dependency, and not the wall clock time at time of loading. And the trick is that for correctness, we may traverse every action node twice: the first time, we only needed it to have happened once, in another image that created a FASL that we depend on; the second time, we realize we need it in the current image. That is why LOAD-OP depends directly on PREPARE-OP for its side-effects in the current image, and cannot "just" depend on it indirectly through COMPILE-OP, because the latter needs not have happened in the current image, and thus neither do its dependencies, even in-image dependencies (they would have been in-image for the previous image that did the COMPILE-OP). Once again, the DO-FIRST vs IN-ORDER-TO behavior of ASDF 1 was kind of working in the simple cases, but could not possibly handle more complex cases including timestamp propagation.
- that version is to be superseded with one (3.1.0.*) which binds the alternative traversal and application implementations to OPERATION specializations, where each specialization implements a specific combination of traversal and application. (the *-OPERATION classes)
That change ALREADY HAPPENED in ASDF 2.27 (a.k.a. ASDF 3 beta), and at that time, I got all OPERATION extensions in Quicklisp fixed. The problem is that some extensions are not in Quicklisp, and Robert was badly bitten, and so wants a solution to save other people from being bitten as badly.
- the 3.1.0.* version provides no implementation for the OPERATION class.
Starting with ASDF 2.27, OPERATION exists and has a trivial implementation that does not have any implicit dependencies, either through baked in TRAVERSE magic or COMPONENT-DEPENDS-ON methods. It is merely the base class for OPERATIONs, as many existing extensions assume that we don't want to break.
- the observed failure mode has been that in systems for which the definitions include an OPERATION specialization, the intended actions do not occur because no dependencies are observed.
Most extensions to OPERATION did not want the implicit DOWNWARD and SIDEWAY dependencies, but a few did. These extensions were broken by ASDF 3 (aka ASDF 2.27), and had to be fixed. I could only fix those in Quicklisp and at work. I assumed that and an annoucement in asdf-devel would be enough. I was wrong.
The problem is that for backward-compatibility, OPERATION has somehow to be *both* the abstract base class for every operation including those that don't propagate down and left, *and* the old default behavior of those operations that did.
We thought that was impossible; I had let down people with the second assumption with silent subtle breakage. Robert would rather break it loudly for everyone.
- in order to rectify these failures, it is proposed to reinstate the depth-first+post-order mechanism for the OPERATION class in addition to the alternative mechanisms for specializations of that class.
It is proposed to add DOWNWARD and SIDEWAY dependencies magically to operation classes that do not inherit from any of DOWNWARD-OPERATION, SIDEWAY-OPERATION, UPWARD-OPERATION and NON-PROPAGATING-OPERATION, as an emulated case of "non-monotonic inheritance". Ugly, but will work.
as yet further alternatives, it would be possible to implement the combinations of propagation and application by delegating the implementation to a collection of singleton instances. in this case the alternatives could be declared as the default initialization value(s) for the respective (ABSTRACT-)OPERATION subclass. this would no doubt require broader changes to the asdf code base, but would offer the advantage, that the combination could have a degree of independence from the class hierarchy. for the case at hand, it would then be more natural to declare the default behavior for OPERATION than the approach of duplicating the methods of other specializations.
Operations in ASDF 3 are already kind of singleton instances, via MAKE-OPERATION and FIND-OPERATION. However, since old user code here and there may still be using MAKE-INSTANCE, the unicity isn't really guaranteed. That doesn't matter too much so far, because ASDF equates all objects of the same class in its method COMPONENT-OPERATION-TIME, and it is therefore unsafe to ever perform two actions with the same component and semantically different operations of the same class.
Note to Robert: here is how you could do it if you wanted to allow semantically distinct operations of the same class. You'd need to have a protocol to canonicalize them in the *OPERATIONS* memoization table, not by class name, but by CONS of the class name and some CANONICAL-OPERATION-INITARGS. The latter would be a generic function called on the initargs, where any parasite initargs such as FORCE and FORCE-NOT have been removed, since they below to the PLAN, not the OPERATION: the OPERATE protocol would be refined to explicit split arguments to be passed to MAKE-PLAN or to MAKE-OPERATION. The default method for CANONICAL-OPERATION-INITARGS would SORT (a plist->alist of) the initargs, and that would replace the current ORIGINAL-INITARGS slot. For this scheme to work even in presence of undisciplined users using MAKE-INSTANCE on an operation class, the OPERATION class would have an extra slot EFFECTIVE-OPERATION, uninitialized by default (nil or unbound), whose accessor initializes it if it's uninitialized, by looking up a canonical instance in *OPERATIONS*, and if unfound registering the current operation as canonical. Then, each component's COMPONENT-OPERATION-TIME hash-table would be indexed by canonicalized operation object rather than by operation class, and POIU would have to be changed accordingly. Of course, this entire cleanup is incompatible with how SWANK and GBBopen currently abuse slots of operation, so these would have to be fixed first. And that's why I didn't do it. It looks like SWANK can be fixed soon, though, so we'll see.
PS: There was another question asked at the end of my walkthrough, for which I now remember the good answer I didn't have then. Yes, you *could* walk the data structure in two passes and only care to accumulate timestamps and planned flag only in the second pass; but then you would need to remember all the dependency arcs, instead of just being able to have them be implicit in the traversal; That's what POIU does. That would have simplified the 36-line TRAVERSE-ACTION function, but would have necessitate pulling in all the graph data structures from POIU, which would make the code overall code bigger and more complex. As for the ~50 line COMPUTE-ACTION-STAMP method, it's the other meaty part of the algorithm, generalizing and fixing ASDF 1's old OPERATION-DONE-P to correctly handle timestamp propagation, and parametrized by a plan object and a just-done flag.
There there. I hope I've explained all the important details before I go away from the project. Do not hesitate to ask more questions.
—♯ƒ • François-René ÐVB Rideau •Reflection&Cybernethics• http://fare.tunes.org So that there be reality, there must be an observer. "I am, therefore someone thinks." — Faré