I've finally been motivated to do some maintenance on my web server and one of the things that has been bothering me for a long time has been some relatively serious instability problems with my production- ish hunchentoot server. I had been using a pre-1.0-but-post-last- release development snapshot and the main problem has been that after a week or so of use my hunchentoot instance would fairly reliably crash, although it wasn't clear exactly when it would crash. After I time begin to suspect that the main culprit use my use of sb-ext:run- program in my hunchentoot-cgi stuff. hunchentoot-cgi is a hunchentoot handler and associated glue to allow hunchentoot to call CGI programs written in, say, perl that I wrote primarily so I could put gitweb behind my hunchentoot server directly, instead using apache or lighttpd or whatever as a front-end to hunchentoot. Like most of my code, it is, at least in the beginning, SBCL specific and it uses sb- ext:run-program to launch the CGI script. So far, so good, but then I kept getting these crashes and it was driving me crazy.
Today I figured out that if I manually close the process, bad things happen a lot less frequently. This may seem obvious, but if after calling sb-ext:run-program I make an unwind-protect block and call sb- ext:process-wait and sb-ext:process-close, things get a lot better. I guess I should have suspected that something like this might have been the problem when I had errors such as:
[2009-06-10 20:37:03 [ERROR]] The value 1026 is not of type (MOD 1025).
in the logs, but I never put two and two together.
I can now bang on the server (calling CGIs) pretty hard and it generally responds well and hasn't crashed on me yet, although it hasn't been all that long. Nevertheless, a thousand or so hits from ab with the old way of doing things would bring down the server and I can for multiple thousands of hits on the new server with no noticeable problems.
I should caveat this by saying that I was seeing the occasional sockout-timeout, writing to a closed pipe, etc... problems with the 1.0 (or thereabouts, whatever is in luis' repo) release, but upgrading to the latest dev version from the ediware svn repo seems to have fixed even those problems. Well, I still occasionally see errors like the following:
[2009-06-10 22:35:25 [ERROR]] Couldn't write to #<SB-SYS:FD-STREAM for "a socket" {5A0E5BA1}>: Broken pipe [2009-06-10 22:35:26 [ERROR]] Error while processing connection: Couldn't write to #<SB-SYS:FD-STREAM for "a socket" {5A0E5BA1}>: Broken pipe
but they neither bring down the server nor cause ab to freak out with a connection reset by peer error, which is nice.
So I guess this all a long-winded way of saying:
1) If you're going to be using sb-ext:run-program :wait nil and reading from the process' stream, make sure you sb-ext:process-close the process.
2) thanks to the hunchentoot team for the 1.0 release and beyond!
While I'm at it, I'll put in another plug for my hunchentoot add-on modules, hunchentoot-cgi, hunchentoot-auth and hunchentoot-vhost, for calling CGIs from hunchentoot, for providing (more of) an infrastructure for user authentication, and for trivially handling multiple "virtual hosts" with hunchentoot, respectively:
http://git.cyrusharmon.org/cgi-bin/gitweb.cgi?p=hunchentoot-cgi.git http://git.cyrusharmon.org/cgi-bin/gitweb.cgi?p=hunchentoot-auth.git http://git.cyrusharmon.org/cgi-bin/gitweb.cgi?p=hunchentoot-vhost.git
All of which should now be working with hunchentoot 1.0 (and beyond).
Thanks again to Edi and Hans.
Cyrus
Cyrus Harmon ch-tbnl@bobobeach.com writes:
Today I figured out that if I manually close the process, bad things happen a lot less frequently. This may seem obvious,
Indeed - and I have been bitten by this too (I figured it out myself some months ago). So, although obvious, it could be very helpful for some people that you report it here.
but if after calling sb-ext:run-program I make an unwind-protect block and call sb- ext:process-wait and sb-ext:process-close, things get a lot better. I guess I should have suspected that something like this might have been the problem when I had errors such as:
[2009-06-10 20:37:03 [ERROR]] The value 1026 is not of type (MOD 1025).
How do you do the whole construct? I must admit that I did not use UNWIND-PROTECT for my fix. Probably, the right thing is not to use run-program in such cases at all, but introduce a wrapper macro for SBCL (this should therefore be more appropriate for the SBCL list to which I CC this as well).
For example, see the Allegro CL documentation:
"The :osi module (see Operating System Interface Functionality in os-interface.htm) has these new operators relating to running subprocesses: the function command-output and the macros with-command-output and with-command-io. They are higher-level than run-shell-command and shell and are now recommended when the interaction with the subprocess requires input or produces output that must be captured."
Nicolas