Hi Hans,
Just to let you know, the server did not have any further problem after this fix.
So it looks like everything is running smoothly. I'm going to wait a few more days to make sure everything's stable and make a small announce on the ML regarding my software in case it's interesting to someone.
Thanks for your help solving this issue.
Mathieu.
Hi Hans,
Yes, that is the problem.
Here is the result of my swank session (*cheshire* being the variable holding my acceptor):
CL-USER> (hunchentoot::taskmaster-max-thread-count (hunchentoot::acceptor-taskmaster *cheshire*))
100
CL-USER> (hunchentoot::taskmaster-request-count (hunchentoot::acceptor-taskmaster *cheshire*))
100
CL-USER> (hunchentoot::taskmaster-max-accept-count (hunchentoot::acceptor-taskmaster *cheshire*))
120
CL-USER> (setf (hunchentoot::taskmaster-request-count (hunchentoot::acceptor-taskmaster *cheshire*)) 0)
0
CL-USER> (hunchentoot::note-free-connection (hunchentoot::acceptor-taskmaster *cheshire*))
0
After that, the server resumed serving requests without any further problems.
What is weird, however, is that I never have the "Can't handle a new request, too many request threads already" error message in either of my logs.
This may be a long shot, but by reading handle-incoming-connection (https://github.com/edicl/hunchentoot/blob/master/taskmaster.lisp#L288) and create-request-handler-thread (https://github.com/edicl/hunchentoot/blob/master/taskmaster.lisp#L355), I think that if an error arises while creating a thread (such as the "getpeername": ENOTCONN [...] message), the counter is not decremented because the thread is never started (thus doesn't reach the dynamic scope of the unwind-protect).
I sent a pull request with the patch for this idea and will keep you posted if that solved the bug or not. I should be fixed by tomorrow since the bug occurs after roughly 9 hours of operations.
Thanks for your help.
2012/8/15 Hans Hübner <hans.huebner@gmail.com>Hi Mathieu,
you may want write a log entry in the DO clause of the loop in this function:
https://github.com/edicl/hunchentoot/blob/master/taskmaster.lisp#L250
It could be that somehow, you're running out of worker threads, but
the code fails to correctly decrement the free worker count. You can
also inspect request-count slot of the taskmaster when you see the
hanging server again.
Keep us posted about what you find.
-Hans
On Wed, Aug 15, 2012 at 6:02 PM, Mathieu Lemoine <mlemoine@mentel.com> wrote:
> Hi Hans,
>
> I though this was the problem because of the error message I find in my
> *standard-error* log:
> Error while creating worker thread for new incoming connection: Socket error
> in "getpeername": ENOTCONN (Transport endpoint is not connected)
> However, I noticed reading your patches, this is the error message
> introduced by the patch.
> Since the error message is present many times, I think your code is working,
> otherwise the acceptor would crash instantly.
>
> Currently, the symptoms I saw are these:
> - Timeout when trying to reach port 80 (on which Hunchentoot is listening)
> - No error message in either *standard-error* nor *standard-output* logs
> - Only one thread (hunchentoot listener?) running in addition to the main
> process thread (as shown by htop)
>
> This problem occurs in production environment after a few hours of
> operations. I restarted it once again with a swank server this time.
> Next time the problem occurs, I will try to dig in it to gather more
> informations on the state of my software.
>
> I still don't know what is causing the problem or how to reproduce it, but I
> will try to check it up and keep you posted.
>
> Thanks for your time and help.
> Mathieu.
>
>
> 2012/8/15 Hans Hübner <hans.huebner@gmail.com>
>>
>> Hi Mathieu,
>>
>> I have mined the Subversion repository for the commits that you
>> referenced, and I found them to represent this patch:
>>
>> Index: taskmaster.lisp
>> ===================================================================
>> --- taskmaster.lisp (revision 4427)
>> +++ taskmaster.lisp (revision 4435)
>> @@ -127,9 +127,21 @@
>>
>> #-:lispworks
>> (defmethod handle-incoming-connection ((taskmaster
>> one-thread-per-connection-taskmaster) socket)
>> - (bt:make-thread (lambda ()
>> - (process-connection (taskmaster-acceptor
>> taskmaster) socket))
>> - :name (format nil "Hunchentoot worker \(client:
>> ~A)" (client-as-string socket))))
>> + ;; We are handling all conditions here as we want to make sure that
>> + ;; the acceptor process never crashes while trying to create a
>> + ;; worker thread. One such problem exists in
>> + ;; GET-PEER-ADDRESS-AND-PORT which can signal socket conditions on
>> + ;; some platforms in certain situations.
>> + (handler-case
>> + (bt:make-thread (lambda ()
>> + (process-connection (taskmaster-acceptor
>> taskmaster) socket))
>> + :name (format nil "Hunchentoot worker \(client:
>> ~A)" (client-as-string socket)))
>> +
>> + (error (cond)
>> + ;; Need to bind *ACCEPTOR* so that LOG-MESSAGE can do its work.
>> + (let ((*acceptor* (taskmaster-acceptor taskmaster)))
>> + (log-message *lisp-errors-log-level*
>> + "Error while creating worker thread for new
>> incoming connection: ~A" cond)))))
>>
>> ;; LispWorks implementation
>>
>> Now, the patch actually does not apply, but it was on trunk and has
>> been evolved into the current version:
>>
>> https://github.com/edicl/hunchentoot/blob/master/taskmaster.lisp#L355
>>
>> If I understand you correctly, you say that the code does not work.
>> Can you somehow provide a way to reproduce the problem?
>>
>> Thanks,
>> Hans
>>
>> On Wed, Aug 15, 2012 at 3:29 PM, Mathieu Lemoine <mlemoine@mentel.com>
>> wrote:
>> > Hi,
>> >
>> > I am currently hitting the problem described in the following thread:
>> > http://lists.common-lisp.net/pipermail/tbnl-devel/2009-July/004768.html
>> >
>> > I'm using SBCL 1.0.58 with hunchentoot 1.2.3 and usocket 0.5.5 (both
>> > from
>> > quicklisp).
>> >
>> > I understand the long-term fix will be some hard work. However, in this
>> > thread, Hans is referencing a few diff hunks :
>> >
>> > http://bknr.net/trac/changeset/4430?format=diff&new=4430
>> > http://bknr.net/trac/changeset/4434?format=diff&new=4434
>> > http://bknr.net/trac/changeset/4435?format=diff&new=4435
>> >
>> > Since the old trac is not working anymore, could it be possible to
>> > either
>> > point me toward the equivalent commits in the github repository or the
>> > content of these patches.
>> >
>> > By the way, Peter, if you're still around, did you need to apply only
>> > 4435
>> > or both this and 4434 in order to solve the second issue (unbound
>> > variable
>> > on *acceptor*)?
>> >
>> > If anybody has information with respect to these issues, including the
>> > corresponding bug report in usocket, so I can do some follow up with
>> > them,
>> > it would be greatly appreciated.
>> >
>> > Thanks in advance for your help and time.
>> >
>> > Mathieu.
>> >
>> > _______________________________________________
>> > tbnl-devel site list
>> > tbnl-devel@common-lisp.net
>> > http://common-lisp.net/mailman/listinfo/tbnl-devel
>>
>> _______________________________________________
>> tbnl-devel site list
>> tbnl-devel@common-lisp.net
>> http://common-lisp.net/mailman/listinfo/tbnl-devel
>
>
>
> _______________________________________________
> tbnl-devel site list
> tbnl-devel@common-lisp.net
> http://common-lisp.net/mailman/listinfo/tbnl-devel
_______________________________________________
tbnl-devel site list
tbnl-devel@common-lisp.net
http://common-lisp.net/mailman/listinfo/tbnl-devel