I wonder if anybody has yet to tread this path.
I have been using CL at on my school's cluster. I usually develop within Slime locally, then move it to the cluster's master node and have a trial run over a remote Slime session, then once everything works, I ditch Slime and invoke my programs on the worker nodes using a bash or Lisp script (yes, we are running without a queue; crazy, huh?). I would very much like to use dispatch the jobs to the worker nodes from within Slime. More to the point, I would like to be able to start jobs on the worker nodes from my Slime connection to the master node. Now, of course I can connect to each node using a tunnel and start the jobs that way, but I want an automated interface as sketched below.
I was thinking about developing something that would work for me and perhaps others. I just wanted to put this out there first to make sure it hasn't already been done and see if there are any suggests or comments on an undertaking like this. Basically all I would do is define a CL client for swank and use that to connect to worker Lisps running on each worker node from the one instance on the master. Then, if I want to send a job to the first node, all I have to do is send some code over the swank connection for the worker to evaluate and return the value. I also thought I'd write a kind of forwarder for the Swank connection so I can jump to a worker node from the master by setting the master to forward all data to the worker, and return anything coming from the worker down its connect to Slime (although, I suppose this could be arbitrarily deep, this is analogous to jumping from server to server using ssh but without the secure-ness). I am fairly convinced that this will also work (sans persistent worker Lisps) for any cluster with a queuing system that allows for interactive jobs (like the Torque/Moab combo that we use on another cluster here) and probably in a EC2 Cloud like environment as well. I'm just saying that it is possible that this could be generally useful for other people as well.
Almost all of this would be on the Swank side (and thank goodness because I feel like a fish out of water in ELisp). However, I have already had trouble understanding the Slime message protocol. I'll figure it out eventually, but are there any good internals guides for Slime/Swank?
Any advice is appreciated,
Zach KS
* Zach Kost-Smith [2011-10-19 16:20] writes:
I wonder if anybody has yet to tread this path.
I have been using CL at on my school's cluster. I usually develop within Slime locally, then move it to the cluster's master node and have a trial run over a remote Slime session, then once everything works, I ditch Slime and invoke my programs on the worker nodes using a bash or Lisp script (yes, we are running without a queue; crazy, huh?). I would very much like to use dispatch the jobs to the worker nodes from within Slime. More to the point, I would like to be able to start jobs on the worker nodes from my Slime connection to the master node. Now, of course I can connect to each node using a tunnel and start the jobs that way, but I want an automated interface as sketched below.
It would certainly be nice to use Lisp/Slime on clusters. I haven't any real experience with this, but it seems to me that most clusters use some middleware like Torque to manage/configure nodes and job queues. So is it necessary/desirable to interact with the middleware?
[...]
Almost all of this would be on the Swank side (and thank goodness because I feel like a fish out of water in ELisp). However, I have already had trouble understanding the Slime message protocol. I'll figure it out eventually, but are there any good internals guides for Slime/Swank?
Not really; there are some comments in slime.el that might be useful. Other than that, the *slime-events* buffer contains a log of the messages between Emacs <-> Lisp which at least plenty of examples for real interactions. Remote debugging is the most complicated issue, but I guess you don't need that initially.
Helmut
I'm using Slime's Swank protocol to do master/slave style processing on machines in a Google data center. I start up a master Lisp and a bunch of worker Lisps, all of them running swank servers. I ssh into the master Lisp and set up an Emacs/Slime connection to it. I run code on the master that farms out s-expressions to worker machines for evaluation using the Swank protocol. I have modified Slime slightly so that debugger breakpoints in the workers are forwarded through the master Lisp back to my Emacs/Slime session, so I can debug problems on both the master and the workers.
The code I'm using has been released as open source by Google:
http://github.com/brown/swank-client
In addition to the Swank client code, I've patched SBCL to print specialized arrays readably, which is important if remote evaluation requests include them. That patch is in an SBCL bug report:
https://bugs.launchpad.net/sbcl/+bug/803665
Also, some SBCL strings are of type base-string and don't like to be printed readably, so I modified slime/swank-rpc.lisp to try non-readable printing when SBCL's print-not-readable condition is signaled.
Most important, however, is a small change to slime/swank.lisp that lets you debug problems on the remote worker Lisps where you are evaluating expressions. If something goes wrong, Swank on the remote worker sends a message back to the master asking it to pop up a debugger window. My change causes Swank on the master to forward this message to Emacs/Slime so that the problem can be debugged. You can see this code in action if you evaluate something like:
(slime-eval '(error "hello world") connection)
Anyway, I've included my patches for slime/swank.lisp and slime/swank-rpc.lisp as an attachment.
bob