[cmucl-imp] gc & blocked signals

Helmut Eller

31 Jan 2012 31 Jan '12

10:03 p.m.

I think there is a problem related to blocked signals and garbage collection: 1. Start cmucl -noinit -eval '(loop (ext:gc :full t))' in a terminal and let it run. 2. Under Linux, cat /proc/<pid>/status shows that SigBlk is 0 i.e. no signals are blocked. 3. Interrupt the loop with C-c (SIGINT) and wait for the debugger. 4. SigBlk is still 0. 5. Type c to continue the loop. 6. SigBlk is now 000000001fc90000 That's a bug, right? It should again be zero. The sigmask 000000001fc90000 corresponds to the signals: 17 SIGCHLD 20 SIGTSTP 23 SIGURG 24 SIGXCPU 25 SIGXFSZ 26 SIGVTALRM 27 SIGPROF 28 SIGWINCH 29 SIGIO SLIME uses SIGIO on a socket and if that stays blocked then SLIME can't interrupt the Lisp process. Also note that C-z in the terminal doesn't stop the process; presumably because SIGTSTP is blocked. lisp-implementation-version is "20c release-20c (20C Unicode)". Helmut

Show replies by date

Raymond Toy

1 Feb 1 Feb

3:27 a.m.

On 1/31/12 2:03 PM, Helmut Eller wrote:

...

I think there is a problem related to blocked signals and garbage collection:

1. Start cmucl -noinit -eval '(loop (ext:gc :full t))' in a terminal and let it run.

2. Under Linux, cat /proc/<pid>/status shows that SigBlk is 0 i.e. no signals are blocked.

3. Interrupt the loop with C-c (SIGINT) and wait for the debugger.

4. SigBlk is still 0.

5. Type c to continue the loop.

6. SigBlk is now 000000001fc90000

That's a bug, right? It should again be zero. Yeah, that looks like a bug. I think the problem is not with gc, but with the signal handler(s). I was planning on doing some work on the signal handlers to make them simpler, based on what Carl did for Windows. This should make them simpler and safer. Don't know if it will take care of this problem or not.

...

29 SIGIO

SLIME uses SIGIO on a socket and if that stays blocked then SLIME can't interrupt the Lisp process. Also note that C-z in the terminal doesn't

Does SIGIO actually work for you? I stopped using it long ago because it caused strange errors to occur (on darwin). I think it's because SIGIO causes interrupts at bad places because cmucl isn't really interrupt-safe. Ray

Helmut Eller

8:32 a.m.

* Raymond Toy [2012-02-01 03:27] writes:

...

Yeah, that looks like a bug. I think the problem is not with gc, but with the signal handler(s). I was planning on doing some work on the signal handlers to make them simpler, based on what Carl did for Windows. This should make them simpler and safer. Don't know if it will take care of this problem or not.

The problem doesn't occur with an empty loop. Therefore I think it has something to do with GC.

...

...
29 SIGIO

SLIME uses SIGIO on a socket and if that stays blocked then SLIME can't interrupt the Lisp process. Also note that C-z in the terminal doesn't

Does SIGIO actually work for you? I stopped using it long ago because it caused strange errors to occur (on darwin). I think it's because SIGIO causes interrupts at bad places because cmucl isn't really interrupt-safe.

Works good enough for me (on Linux). The GC should of course be interrupt-safe. The stream code is not reentrant so it is problematic to use streams in signal handlers. SLIME tries to be careful when reading/writing to it's own stream and tries to delay interrupts to safe-points. SLIME can't fix other streams or non-reentrant code, but the situation there is IMO the same as if the debugger is invoked with SIGINT. Helmut

Raymond Toy

5:20 p.m.

On Wed, Feb 1, 2012 at 12:32 AM, Helmut Eller <heller@common-lisp.net>wrote:

...

* Raymond Toy [2012-02-01 03:27] writes:

...
Yeah, that looks like a bug. I think the problem is not with gc, but with the signal handler(s). I was planning on doing some work on the signal handlers to make them simpler, based on what Carl did for Windows. This should make them simpler and safer. Don't know if it will take care of this problem or not.

The problem doesn't occur with an empty loop. Therefore I think it has something to do with GC.

Ok. Can you write up a ticket for this?

...

...
...
29 SIGIO

SLIME uses SIGIO on a socket and if that stays blocked then SLIME can't interrupt the Lisp process. Also note that C-z in the terminal doesn't

Does SIGIO actually work for you? I stopped using it long ago because it caused strange errors to occur (on darwin). I think it's because SIGIO causes interrupts at bad places because cmucl isn't really interrupt-safe.

Works good enough for me (on Linux). The GC should of course be interrupt-safe. The stream code is not reentrant so it is problematic to use streams in signal handlers. SLIME tries to be careful when reading/writing to it's own stream and tries to delay interrupts to safe-points. SLIME can't fix other streams or non-reentrant code, but the situation there is IMO the same as if the debugger is invoked with SIGINT.

I will try sigio with slime again. That will certainly help motivate me to make it work better. (Or frustrate me to switch back to fd-handler. :-( ) Carl has suggested modifying the compiler so that we only take (most) signals at safe points. This will certainly help many situations so we can concentrate on the truly async signals like sigint. Such safe points will have some implications on speed, though. That's for another day (or year). Ray

Helmut Eller

6:26 p.m.

* Raymond Toy [2012-02-01 17:20] writes:

...

Ok. Can you write up a ticket for this?

Sure. See: http://trac.common-lisp.net/cmucl/ticket/55 Helmut

Raymond Toy

7:47 p.m.

On Wed, Feb 1, 2012 at 10:26 AM, Helmut Eller <heller@common-lisp.net>wrote:

...

* Raymond Toy [2012-02-01 17:20] writes:

...
Ok. Can you write up a ticket for this?

Sure. See: http://trac.common-lisp.net/cmucl/ticket/55

Thanks! Now I won't forget about it. Ray

Raymond Toy

3 Feb 3 Feb

5:15 p.m.

On Tue, Jan 31, 2012 at 2:03 PM, Helmut Eller <heller@common-lisp.net>wrote:

...

I think there is a problem related to blocked signals and garbage collection:

1. Start cmucl -noinit -eval '(loop (ext:gc :full t))' in a terminal and let it run.

2. Under Linux, cat /proc/<pid>/status shows that SigBlk is 0 i.e. no signals are blocked.

3. Interrupt the loop with C-c (SIGINT) and wait for the debugger.

4. SigBlk is still 0.

5. Type c to continue the loop.

6. SigBlk is now 000000001fc90000

That's a bug, right? It should again be zero.

For your example with just a simple loop, a C-c causes an interrupt to happen right away, so things work. But when GC is running, interrupts are disabled, so the C-c is remembered and and deferred to a later time and then handled with a pendingInterrupt trap. This is the main difference between the two cases. I think the problem is in interrupt_handle_pending: memcpy(&context->uc_sigmask, &pending_mask, NSIG / LONG_BIT); We are trying to restore the sigmask that was saved by setup_pending_signal (that coincidentally copies the entire uc_sigmask to pending_mask). NSIG = 65 and LONG_BIT = 32 on my machine, so memcpy only copies 2 bytes. I think we really want to copy at least 64 bits or 8 bytes. If I modify this code to copy 8 bytes, SigBlk is now 0 after returning, and C-c continues to work. This seems like a very long-standing bug! Thanks for pointing it out. This change will get into the Feb snapshot. I hope you can do some testing with it and let me know how it works out. Ray

Carl Shapiro

4 Feb 4 Feb

12:47 a.m.

On Fri, Feb 3, 2012 at 9:15 AM, Raymond Toy <toy.raymond@gmail.com> wrote:

...

NSIG = 65 and LONG_BIT = 32 on my machine, so memcpy only copies 2 bytes. I think we really want to copy at least 64 bits or 8 bytes. If I modify this code to copy 8 bytes, SigBlk is now 0 after returning, and C-c continues to work.

This looks like my fault http://common-lisp.net/gitweb?p=projects/cmucl/cmucl.git;a=commit;h=342beebb... LONG_BIT should have been CHAR_BIT. I am surprised that the C library NSIG is now 65. When I made this change, the C library on the machine I used defined NSIG as 1024 but the kernel data structure knew it was 64. For example, compare http://lxr.linux.no/#linux+v3.2.4/arch/x86/include/asm/signal.h with http://fxr.watson.org/fxr/source/sysdeps/unix/sysv/linux/bits/signum.h?v=GLI... Using the wrong definition of NSIG may cause a corruption.

Raymond Toy

3:33 p.m.

On 2/3/12 4:47 PM, Carl Shapiro wrote:

...

I am surprised that the C library NSIG is now 65. When I made this change, the C library on the machine I used defined NSIG as 1024 but the kernel data structure knew it was 64. For example, compare

So previously, the code was copying 32 bytes? That seems like too much too. NSIG is 65 on an Ubuntu 10 system. It's also 65 on my OpenSuSE 11.3 system that I normally use to build cmucl. Ray

Carl Shapiro

8 Feb 8 Feb

2:25 a.m.

On Sat, Feb 4, 2012 at 7:33 AM, Raymond Toy <toy.raymond@gmail.com> wrote:

...

So previously, the code was copying 32 bytes? That seems like too much too.

Yes. The structure assignment of the pending mask back to the ucontext would copy 32-bytes instead of 8-bytes. context->uc_sigmask = pending_mask; I did not correctly recall the root cause. The problem was a disagreement over the size of sigset_t between glibc and the kernel. The glibc definition is a 32-byte bit set. http://repo.or.cz/w/glibc.git/blob?f=sysdeps/unix/sysv/linux/bits/sigset.h

...

NSIG is 65 on an Ubuntu 10 system. It's also 65 on my OpenSuSE 11.3 system that I normally use to build cmucl.

Yes, those numbers seem okay. NSIG is sized correctly. sigset_t is not.

5251

Age (days ago)

5259

Last active (days ago)

List overview

Download

9 comments

3 participants

participants (3)

Carl Shapiro
Helmut Eller
Raymond Toy