Hi Hans,

Just to let you know, the server did not have any further problem after this fix.

So it looks like everything is running smoothly. I'm going to wait a few more days to make sure everything's stable and make a small announce on the ML regarding my software in case it's interesting to someone.

Thanks for your help solving this issue.
Mathieu.

2012/8/16 Mathieu Lemoine <mlemoine@mentel.com>
Hi Hans,

Yes, that is the problem.

Here is the result of my swank session (*cheshire* being the variable holding my acceptor):

CL-USER> (hunchentoot::taskmaster-max-thread-count (hunchentoot::acceptor-taskmaster *cheshire*))
100
CL-USER> (hunchentoot::taskmaster-request-count (hunchentoot::acceptor-taskmaster *cheshire*))
100
CL-USER> (hunchentoot::taskmaster-max-accept-count (hunchentoot::acceptor-taskmaster *cheshire*))
120
CL-USER> (setf (hunchentoot::taskmaster-request-count (hunchentoot::acceptor-taskmaster *cheshire*)) 0)
0
CL-USER> (hunchentoot::note-free-connection (hunchentoot::acceptor-taskmaster *cheshire*))
0

After that, the server resumed serving requests without any further problems.

What is weird, however, is that I never have the "Can't handle a new request, too many request threads already" error message in either of my logs.

This may be a long shot, but by reading handle-incoming-connection (https://github.com/edicl/hunchentoot/blob/master/taskmaster.lisp#L288) and create-request-handler-thread (https://github.com/edicl/hunchentoot/blob/master/taskmaster.lisp#L355), I think that if an error arises while creating a thread (such as the "getpeername": ENOTCONN [...] message), the counter is not decremented because the thread is never started (thus doesn't reach the dynamic scope of the unwind-protect).

I sent a pull request with the patch for this idea and will keep you posted if that solved the bug or not. I should be fixed by tomorrow since the bug occurs after roughly 9 hours of operations.

Thanks for your help.


2012/8/15 Hans Hübner <hans.huebner@gmail.com>
Hi Mathieu,

you may want write a log entry in the DO clause of the loop in this function:

https://github.com/edicl/hunchentoot/blob/master/taskmaster.lisp#L250

It could be that somehow, you're running out of worker threads, but
the code fails to correctly decrement the free worker count.  You can
also inspect request-count slot of the taskmaster when you see the
hanging server again.

Keep us posted about what you find.
-Hans

On Wed, Aug 15, 2012 at 6:02 PM, Mathieu Lemoine <mlemoine@mentel.com> wrote:
> Hi Hans,
>
> I though this was the problem because of the error message I find in my
> *standard-error* log:
> Error while creating worker thread for new incoming connection: Socket error
> in "getpeername": ENOTCONN (Transport endpoint is not connected)
> However, I noticed reading your patches, this is the error message
> introduced by the patch.
> Since the error message is present many times, I think your code is working,
> otherwise the acceptor would crash instantly.
>
> Currently, the symptoms I saw are these:
> - Timeout when trying to reach port 80 (on which Hunchentoot is listening)
> - No error message in either *standard-error* nor *standard-output* logs
> - Only one thread (hunchentoot listener?) running in addition to the main
> process thread (as shown by htop)
>
> This problem occurs in production environment after a few hours of
> operations. I restarted it once again with a swank server this time.
> Next time the problem occurs, I will try to dig in it to gather more
> informations on the state of my software.
>
> I still don't know what is causing the problem or how to reproduce it, but I
> will try to check it up and keep you posted.
>
> Thanks for your time and help.
> Mathieu.
>
>
> 2012/8/15 Hans Hübner <hans.huebner@gmail.com>
>>
>> Hi Mathieu,
>>
>> I have mined the Subversion repository for the commits that you
>> referenced, and I found them to represent this patch:
>>
>> Index: taskmaster.lisp
>> ===================================================================
>> --- taskmaster.lisp     (revision 4427)
>> +++ taskmaster.lisp     (revision 4435)
>> @@ -127,9 +127,21 @@
>>
>>  #-:lispworks
>>  (defmethod handle-incoming-connection ((taskmaster
>> one-thread-per-connection-taskmaster) socket)
>> -  (bt:make-thread (lambda ()
>> -                    (process-connection (taskmaster-acceptor
>> taskmaster) socket))
>> -                  :name (format nil "Hunchentoot worker \(client:
>> ~A)" (client-as-string socket))))
>> +  ;; We are handling all conditions here as we want to make sure that
>> +  ;; the acceptor process never crashes while trying to create a
>> +  ;; worker thread.  One such problem exists in
>> +  ;; GET-PEER-ADDRESS-AND-PORT which can signal socket conditions on
>> +  ;; some platforms in certain situations.
>> +  (handler-case
>> +      (bt:make-thread (lambda ()
>> +                        (process-connection (taskmaster-acceptor
>> taskmaster) socket))
>> +                      :name (format nil "Hunchentoot worker \(client:
>> ~A)" (client-as-string socket)))
>> +
>> +    (error (cond)
>> +      ;; Need to bind *ACCEPTOR* so that LOG-MESSAGE can do its work.
>> +      (let ((*acceptor* (taskmaster-acceptor taskmaster)))
>> +        (log-message *lisp-errors-log-level*
>> +                     "Error while creating worker thread for new
>> incoming connection: ~A" cond)))))
>>
>>  ;; LispWorks implementation
>>
>> Now, the patch actually does not apply, but it was on trunk and has
>> been evolved into the current version:
>>
>> https://github.com/edicl/hunchentoot/blob/master/taskmaster.lisp#L355
>>
>> If I understand you correctly, you say that the code does not work.
>> Can you somehow provide a way to reproduce the problem?
>>
>> Thanks,
>> Hans
>>
>> On Wed, Aug 15, 2012 at 3:29 PM, Mathieu Lemoine <mlemoine@mentel.com>
>> wrote:
>> > Hi,
>> >
>> > I am currently hitting the problem described in the following thread:
>> > http://lists.common-lisp.net/pipermail/tbnl-devel/2009-July/004768.html
>> >
>> > I'm using SBCL 1.0.58 with hunchentoot 1.2.3 and usocket 0.5.5 (both
>> > from
>> > quicklisp).
>> >
>> > I understand the long-term fix will be some hard work. However, in this
>> > thread, Hans is referencing a few diff hunks :
>> >
>> > http://bknr.net/trac/changeset/4430?format=diff&new=4430
>> > http://bknr.net/trac/changeset/4434?format=diff&new=4434
>> > http://bknr.net/trac/changeset/4435?format=diff&new=4435
>> >
>> > Since the old trac is not working anymore, could it be possible to
>> > either
>> > point me toward the equivalent commits in the github repository or the
>> > content of these patches.
>> >
>> > By the way, Peter, if you're still around, did you need to apply only
>> > 4435
>> > or both this and 4434 in order to solve the second issue (unbound
>> > variable
>> > on *acceptor*)?
>> >
>> > If anybody has information with respect to these issues, including the
>> > corresponding bug report in usocket, so I can do some follow up with
>> > them,
>> > it would be greatly appreciated.
>> >
>> > Thanks in advance for your help and time.
>> >
>> > Mathieu.
>> >
>> > _______________________________________________
>> > tbnl-devel site list
>> > tbnl-devel@common-lisp.net
>> > http://common-lisp.net/mailman/listinfo/tbnl-devel
>>
>> _______________________________________________
>> tbnl-devel site list
>> tbnl-devel@common-lisp.net
>> http://common-lisp.net/mailman/listinfo/tbnl-devel
>
>
>
> _______________________________________________
> tbnl-devel site list
> tbnl-devel@common-lisp.net
> http://common-lisp.net/mailman/listinfo/tbnl-devel

_______________________________________________
tbnl-devel site list
tbnl-devel@common-lisp.net
http://common-lisp.net/mailman/listinfo/tbnl-devel