If I run several sbcl processes on different nodes in my compute cluster, it might happen that
two different runs notice the same file needs to be recompiled (via asdf),
and they might try to compile it at the same time. What is the best way to prevent this?
You mean that this machines share the same host directory? Interesting.
Yes, the cluster shares some disk, and shares home directory. And I believe two cores
on the same physical host share the /tmp, but I’m not 100% sure about that.
That's an option. It is expensive, though: it means no sharing of fasl
files between hosts. If you have cluster of 200 machines, that means
200x the disk space.
With regard to the question of efficient reuse of fasl files: this is completely irrelevant for my case. My
code takes hours (10 to 12 hours worst case) to run, but only 20 seconds (or less) to compile. I’m very happy to completely
remove the fasl files and regenerate them before each 10 hour run. (note to self: I need to double check that
I do in fact delete the fasl files every time.) Besides, my current flow allows my simply to git-check-in a change, and
re-lauch the code on the cluster in batch. I don’t really want to add an error-prone manual local-build-and-deploy step
if that can be avoided, unless of course there is some great advantage to that approach.
What about instead building your application as an executable and
delivering that to the cluster?
One difficulty about your build-then-deliver suggestion is that my local machine is running mac-os, and the cluster is
running linux. I don’t think I can build linux executables on my mac.
You can have different ASDF_OUTPUT_TRANSLATIONS or
asdf:*output-translations-parameter*
on each machine, or you can indeed have the user cache depend on
uiop:hostname and more.
This is what I’ve ended up doing. And it seems to work. Here is the code
I have inserted into all my scripts.
(let ((home (directory-namestring (user-homedir-pathname)))
(uid (sb-posix:getuid))
(pid (sb-posix:getpid)))
(setf asdf::*user-cache* (ensure-directories-exist (format nil "/tmp~A~D/~D/" home uid pid))))
The Right Thing™ is still to build and test then deploy, rather than
deploy then build.
In response to your suggestion about build then deploy. This seems very dangerous and error prone to me.
For example,what if different hosts want to run the same source code but with different optimization settings?
This is a real possibility, as some of my processes are running with profiling (debug 3) and collecting profiling results,
and others are running super optimized (speed 3) code to try to find the fastest something-or-other.
I don’t even know whether it is possible create the .asd files so that changing a optimization declaration will trigger
everything depending on it to be recompiled. And If I think i’ve written my .asd files as such, how would I know
whether they are really correct?
It is not the case currently, but may very well be in the future that I want different jobs in the cluster running different
git branches of my code code. That would be a nightmare to manage if I try to share fasl files.
Using Bazel, you might even be able to build in parallel on your cluster.
Basel sounds interesting, but I don’t really see the advantage of building in parallel when it only
takes a few seconds to build, but half a day to execute.
I still don't understand why your use case uses deploy-then-build
rather than build-then-deploy.