Hi Hans,
Just to let you know, the server did not have any further problem after this fix.
So it looks like everything is running smoothly. I'm going to wait a few more days to make sure everything's stable and make a small announce on the ML regarding my software in case it's interesting to someone.
Thanks for your help solving this issue. Mathieu.
2012/8/16 Mathieu Lemoine mlemoine@mentel.com
Hi Hans,
Yes, that is the problem.
Here is the result of my swank session (*cheshire* being the variable holding my acceptor):
CL-USER> (hunchentoot::taskmaster-max-thread-count
(hunchentoot::acceptor-taskmaster *cheshire*)) 100 CL-USER> (hunchentoot::taskmaster-request-count (hunchentoot::acceptor-taskmaster *cheshire*)) 100 CL-USER> (hunchentoot::taskmaster-max-accept-count (hunchentoot::acceptor-taskmaster *cheshire*)) 120 CL-USER> (setf (hunchentoot::taskmaster-request-count (hunchentoot::acceptor-taskmaster *cheshire*)) 0) 0 CL-USER> (hunchentoot::note-free-connection (hunchentoot::acceptor-taskmaster *cheshire*)) 0
After that, the server resumed serving requests without any further problems.
What is weird, however, is that I never have the "Can't handle a new request, too many request threads already" error message in either of my logs.
This may be a long shot, but by reading handle-incoming-connection ( https://github.com/edicl/hunchentoot/blob/master/taskmaster.lisp#L288) and create-request-handler-thread ( https://github.com/edicl/hunchentoot/blob/master/taskmaster.lisp#L355), I think that if an error arises while creating a thread (such as the "getpeername": ENOTCONN [...] message), the counter is not decremented because the thread is never started (thus doesn't reach the dynamic scope of the unwind-protect).
I sent a pull request with the patch for this idea and will keep you posted if that solved the bug or not. I should be fixed by tomorrow since the bug occurs after roughly 9 hours of operations.
Thanks for your help.
2012/8/15 Hans Hübner hans.huebner@gmail.com
Hi Mathieu,
you may want write a log entry in the DO clause of the loop in this function:
https://github.com/edicl/hunchentoot/blob/master/taskmaster.lisp#L250
It could be that somehow, you're running out of worker threads, but the code fails to correctly decrement the free worker count. You can also inspect request-count slot of the taskmaster when you see the hanging server again.
Keep us posted about what you find. -Hans
On Wed, Aug 15, 2012 at 6:02 PM, Mathieu Lemoine mlemoine@mentel.com wrote:
Hi Hans,
I though this was the problem because of the error message I find in my *standard-error* log: Error while creating worker thread for new incoming connection: Socket
error
in "getpeername": ENOTCONN (Transport endpoint is not connected) However, I noticed reading your patches, this is the error message introduced by the patch. Since the error message is present many times, I think your code is
working,
otherwise the acceptor would crash instantly.
Currently, the symptoms I saw are these:
- Timeout when trying to reach port 80 (on which Hunchentoot is
listening)
- No error message in either *standard-error* nor *standard-output* logs
- Only one thread (hunchentoot listener?) running in addition to the
main
process thread (as shown by htop)
This problem occurs in production environment after a few hours of operations. I restarted it once again with a swank server this time. Next time the problem occurs, I will try to dig in it to gather more informations on the state of my software.
I still don't know what is causing the problem or how to reproduce it,
but I
will try to check it up and keep you posted.
Thanks for your time and help. Mathieu.
2012/8/15 Hans Hübner hans.huebner@gmail.com
Hi Mathieu,
I have mined the Subversion repository for the commits that you referenced, and I found them to represent this patch:
Index: taskmaster.lisp
--- taskmaster.lisp (revision 4427) +++ taskmaster.lisp (revision 4435) @@ -127,9 +127,21 @@
#-:lispworks (defmethod handle-incoming-connection ((taskmaster one-thread-per-connection-taskmaster) socket)
- (bt:make-thread (lambda ()
(process-connection (taskmaster-acceptor
taskmaster) socket))
:name (format nil "Hunchentoot worker \(client:
~A)" (client-as-string socket))))
- ;; We are handling all conditions here as we want to make sure that
- ;; the acceptor process never crashes while trying to create a
- ;; worker thread. One such problem exists in
- ;; GET-PEER-ADDRESS-AND-PORT which can signal socket conditions on
- ;; some platforms in certain situations.
- (handler-case
(bt:make-thread (lambda ()
(process-connection (taskmaster-acceptor
taskmaster) socket))
:name (format nil "Hunchentoot worker \(client:
~A)" (client-as-string socket)))
- (error (cond)
;; Need to bind *ACCEPTOR* so that LOG-MESSAGE can do its work.
(let ((*acceptor* (taskmaster-acceptor taskmaster)))
(log-message *lisp-errors-log-level*
"Error while creating worker thread for new
incoming connection: ~A" cond)))))
;; LispWorks implementation
Now, the patch actually does not apply, but it was on trunk and has been evolved into the current version:
https://github.com/edicl/hunchentoot/blob/master/taskmaster.lisp#L355
If I understand you correctly, you say that the code does not work. Can you somehow provide a way to reproduce the problem?
Thanks, Hans
On Wed, Aug 15, 2012 at 3:29 PM, Mathieu Lemoine mlemoine@mentel.com wrote:
Hi,
I am currently hitting the problem described in the following thread:
http://lists.common-lisp.net/pipermail/tbnl-devel/2009-July/004768.html
I'm using SBCL 1.0.58 with hunchentoot 1.2.3 and usocket 0.5.5 (both from quicklisp).
I understand the long-term fix will be some hard work. However, in
this
thread, Hans is referencing a few diff hunks :
http://bknr.net/trac/changeset/4430?format=diff&new=4430 http://bknr.net/trac/changeset/4434?format=diff&new=4434 http://bknr.net/trac/changeset/4435?format=diff&new=4435
Since the old trac is not working anymore, could it be possible to either point me toward the equivalent commits in the github repository or
the
content of these patches.
By the way, Peter, if you're still around, did you need to apply only 4435 or both this and 4434 in order to solve the second issue (unbound variable on *acceptor*)?
If anybody has information with respect to these issues, including
the
corresponding bug report in usocket, so I can do some follow up with them, it would be greatly appreciated.
Thanks in advance for your help and time.
Mathieu.
tbnl-devel site list tbnl-devel@common-lisp.net http://common-lisp.net/mailman/listinfo/tbnl-devel
tbnl-devel site list tbnl-devel@common-lisp.net http://common-lisp.net/mailman/listinfo/tbnl-devel
tbnl-devel site list tbnl-devel@common-lisp.net http://common-lisp.net/mailman/listinfo/tbnl-devel
tbnl-devel site list tbnl-devel@common-lisp.net http://common-lisp.net/mailman/listinfo/tbnl-devel