"Got signal before environment was installed on our thread"

Dima Pasechnik

31 Aug 2017 31 Aug '17

1:29 p.m.

Dear all, I'm struggling to understand strange segfaults coming from ECL(+Maxima) on FreeBSD embedded into Python; they typically look as follows: Got signal before environment was installed on our thread [2: No such file or directory] ;;; ECL C Backtrace ;;; 0 ecl_internal_error (0x87d790765) ;;; 1 init_unixint (0x87d7b6bd0) ;;; 2 init_unixint (0x87d7b6972) ;;; 3 pthread_sigmask (0x80103779d) ;;; 4 pthread_getspecific (0x801036d6f) ;;; 5 unknown (0x7ffffffff193) ;;; 6 GC_push_all_stacks (0x87db1ea2c) ;;; 7 GC_mark_some (0x87db12eec) ;;; 8 GC_stopped_mark (0x87db09baa) ;;; 9 GC_try_to_collect_inner (0x87db09a75) ;;; 10 GC_init (0x87db16f4f) ;;; 11 init_alloc (0x87d7caa59) ;;; 12 cl_boot (0x87d694a5b) ;;; 13 initecl (0x87d218340) ;;; 14 initecl (0x87d20a43f) ;;; 15 initecl (0x87d207e28) ;;; 16 _PyImport_LoadDynamicModule (0x800b3ed1c) ;;; 17 PyImport_AppendInittab (0x800b3d71f) ;;; 18 PyImport_AppendInittab (0x800b3d1a8) ;;; 19 PyImport_ImportModuleLevel (0x800b3c2ce) ;;; 20 _PyBuiltin_Init (0x800b162d7) ;;; 21 PyObject_Call (0x800a7d3e3) ;;; 22 PyEval_EvalFrameEx (0x800b2121c) ;;; 23 PyEval_EvalCodeEx (0x800b1b5d4) ;;; 24 PyEval_EvalCode (0x800b1ad96) ;;; 25 PyImport_ExecCodeModuleEx (0x800b3ad11) ;;; 26 PyImport_AppendInittab (0x800b3ddb8) ;;; 27 PyImport_AppendInittab (0x800b3d71f) ;;; 28 PyImport_AppendInittab (0x800b3d1a8) ;;; 29 PyImport_ImportModuleLevel (0x800b3c2ce) ;;; 30 _PyBuiltin_Init (0x800b162d7) ;;; 31 PyEval_EvalFrameEx (0x800b22dd1) Segmentation fault (core dumped) It looks as if ECL (version 16.1.2) is being called before an initialisation is complete, but it it possible to say more without a debugger? More details: is is on FreeBSD 11.0, clang 3.8.0, GC version 7.6.0 with libatomic_ops version 7.4.6. And only reproducible on FreeBSD. ECL is built with --disable-threads; GC is built with or without threads---result is still the same. (so it's unclear to me where pthread_* calls in the trace come from). Thanks, Dima PS. the segfault is at the bottom of https://trac.sagemath.org/ticket/22679#comment:87

Show replies by date

Daniel Kochmański

1 Sep 1 Sep

6:38 a.m.

Hey Dima, this looks like the issue with having GC initialized before ECL kicks in. See https://gitlab.com/embeddable-common-lisp/ecl/issues/371 for a discussion about this problem. Basically some other component already called GC_init and ECL calls it once more. It's arguably not a bug. Best regards, Daniel On 31.08.2017 15:29, Dima Pasechnik wrote:

...

Dear all,

I'm struggling to understand strange segfaults coming from ECL(+Maxima) on FreeBSD embedded into Python; they typically look as follows:

Got signal before environment was installed on our thread [2: No such file or directory]

;;; ECL C Backtrace ;;; 0 ecl_internal_error (0x87d790765) ;;; 1 init_unixint (0x87d7b6bd0) ;;; 2 init_unixint (0x87d7b6972) ;;; 3 pthread_sigmask (0x80103779d) ;;; 4 pthread_getspecific (0x801036d6f) ;;; 5 unknown (0x7ffffffff193) ;;; 6 GC_push_all_stacks (0x87db1ea2c) ;;; 7 GC_mark_some (0x87db12eec) ;;; 8 GC_stopped_mark (0x87db09baa) ;;; 9 GC_try_to_collect_inner (0x87db09a75) ;;; 10 GC_init (0x87db16f4f) ;;; 11 init_alloc (0x87d7caa59) ;;; 12 cl_boot (0x87d694a5b) ;;; 13 initecl (0x87d218340) ;;; 14 initecl (0x87d20a43f) ;;; 15 initecl (0x87d207e28) ;;; 16 _PyImport_LoadDynamicModule (0x800b3ed1c) ;;; 17 PyImport_AppendInittab (0x800b3d71f) ;;; 18 PyImport_AppendInittab (0x800b3d1a8) ;;; 19 PyImport_ImportModuleLevel (0x800b3c2ce) ;;; 20 _PyBuiltin_Init (0x800b162d7) ;;; 21 PyObject_Call (0x800a7d3e3) ;;; 22 PyEval_EvalFrameEx (0x800b2121c) ;;; 23 PyEval_EvalCodeEx (0x800b1b5d4) ;;; 24 PyEval_EvalCode (0x800b1ad96) ;;; 25 PyImport_ExecCodeModuleEx (0x800b3ad11) ;;; 26 PyImport_AppendInittab (0x800b3ddb8) ;;; 27 PyImport_AppendInittab (0x800b3d71f) ;;; 28 PyImport_AppendInittab (0x800b3d1a8) ;;; 29 PyImport_ImportModuleLevel (0x800b3c2ce) ;;; 30 _PyBuiltin_Init (0x800b162d7) ;;; 31 PyEval_EvalFrameEx (0x800b22dd1) Segmentation fault (core dumped)

It looks as if ECL (version 16.1.2) is being called before an initialisation is complete, but it it possible to say more without a debugger?

More details: is is on FreeBSD 11.0, clang 3.8.0, GC version 7.6.0 with libatomic_ops version 7.4.6. And only reproducible on FreeBSD.

ECL is built with --disable-threads; GC is built with or without threads---result is still the same. (so it's unclear to me where pthread_* calls in the trace come from).

Thanks, Dima

PS. the segfault is at the bottom of https://trac.sagemath.org/ticket/22679#comment:87

Dima Pasechnik

12:47 p.m.

Hi Daniel, Thanks for the message. The scenario you talk about only happens if GC is a shared library, right? I've rebuilt GC disabling shared libs, and ECL doing static linking to GC. And I still get very similar segfaults: ;;; ECL C Backtrace ;;; 0 ecl_internal_error (0x87d79b375) ;;; 1 init_unixint (0x87d7c17e0) ;;; 2 init_unixint (0x87d7c1582) ;;; 3 pthread_sigmask (0x80103779d) ;;; 4 pthread_getspecific (0x801036d6f) ;;; 5 unknown (0x7ffffffff193) ;;; 6 GC_push_current_stack (0x87d7ef7c3) ;;; 7 GC_with_callee_saves_pushed (0x87d7f7360) ;;; 8 GC_push_roots (0x87d7ef9c2) ;;; 9 GC_mark_some (0x87d7ec97c) ;;; 10 GC_stopped_mark (0x87d7e6b7a) ;;; 11 GC_try_to_collect_inner (0x87d7e6a75) ;;; 12 GC_init (0x87d7f08ea) ;;; 13 init_alloc (0x87d7d5669) ;;; 14 cl_boot (0x87d69f66b) ... And a very similar picture on the develop branch of ECL - although I had to change our code, as in particular ECL_OPT_TRAP_SIGCHLD is gone... So, what can it be? Some signals issue? Thanks, Dima On Fri, Sep 1, 2017 at 7:38 AM, Daniel Kochmański <daniel@turtleware.eu> wrote:

...

Hey Dima,

this looks like the issue with having GC initialized before ECL kicks in. See https://gitlab.com/embeddable-common-lisp/ecl/issues/371 for a discussion about this problem. Basically some other component already called GC_init and ECL calls it once more. It's arguably not a bug.

Best regards,

Daniel

On 31.08.2017 15:29, Dima Pasechnik wrote:

...
Dear all,

I'm struggling to understand strange segfaults coming from ECL(+Maxima) on FreeBSD embedded into Python; they typically look as follows:

Got signal before environment was installed on our thread [2: No such file or directory]

;;; ECL C Backtrace ;;; 0 ecl_internal_error (0x87d790765) ;;; 1 init_unixint (0x87d7b6bd0) ;;; 2 init_unixint (0x87d7b6972) ;;; 3 pthread_sigmask (0x80103779d) ;;; 4 pthread_getspecific (0x801036d6f) ;;; 5 unknown (0x7ffffffff193) ;;; 6 GC_push_all_stacks (0x87db1ea2c) ;;; 7 GC_mark_some (0x87db12eec) ;;; 8 GC_stopped_mark (0x87db09baa) ;;; 9 GC_try_to_collect_inner (0x87db09a75) ;;; 10 GC_init (0x87db16f4f) ;;; 11 init_alloc (0x87d7caa59) ;;; 12 cl_boot (0x87d694a5b) ;;; 13 initecl (0x87d218340) ;;; 14 initecl (0x87d20a43f) ;;; 15 initecl (0x87d207e28) ;;; 16 _PyImport_LoadDynamicModule (0x800b3ed1c) ;;; 17 PyImport_AppendInittab (0x800b3d71f) ;;; 18 PyImport_AppendInittab (0x800b3d1a8) ;;; 19 PyImport_ImportModuleLevel (0x800b3c2ce) ;;; 20 _PyBuiltin_Init (0x800b162d7) ;;; 21 PyObject_Call (0x800a7d3e3) ;;; 22 PyEval_EvalFrameEx (0x800b2121c) ;;; 23 PyEval_EvalCodeEx (0x800b1b5d4) ;;; 24 PyEval_EvalCode (0x800b1ad96) ;;; 25 PyImport_ExecCodeModuleEx (0x800b3ad11) ;;; 26 PyImport_AppendInittab (0x800b3ddb8) ;;; 27 PyImport_AppendInittab (0x800b3d71f) ;;; 28 PyImport_AppendInittab (0x800b3d1a8) ;;; 29 PyImport_ImportModuleLevel (0x800b3c2ce) ;;; 30 _PyBuiltin_Init (0x800b162d7) ;;; 31 PyEval_EvalFrameEx (0x800b22dd1) Segmentation fault (core dumped)

It looks as if ECL (version 16.1.2) is being called before an initialisation is complete, but it it possible to say more without a debugger?

More details: is is on FreeBSD 11.0, clang 3.8.0, GC version 7.6.0 with libatomic_ops version 7.4.6. And only reproducible on FreeBSD.

ECL is built with --disable-threads; GC is built with or without threads---result is still the same. (so it's unclear to me where pthread_* calls in the trace come from).

Thanks, Dima

PS. the segfault is at the bottom of https://trac.sagemath.org/ticket/22679#comment:87

Daniel Kochmański

12:57 p.m.

I dont think its related to shared vs static - rather two gc running concurrently. Try commenting out GC_init call in ecl and see what happens. Do you need sigchld for anything? Run-program was rewritten and sigchld handling wasnt viable option anymore for it. Im on phone, will be avail after the weekend. Regards, D. Dnia 1 września 2017 14:47:57 CEST, Dima Pasechnik <dimpase+ecl@gmail.com> napisał(a):

...

Hi Daniel, Thanks for the message. The scenario you talk about only happens if GC is a shared library, right?

I've rebuilt GC disabling shared libs, and ECL doing static linking to GC. And I still get very similar segfaults:

;;; ECL C Backtrace ;;; 0 ecl_internal_error (0x87d79b375) ;;; 1 init_unixint (0x87d7c17e0) ;;; 2 init_unixint (0x87d7c1582) ;;; 3 pthread_sigmask (0x80103779d) ;;; 4 pthread_getspecific (0x801036d6f) ;;; 5 unknown (0x7ffffffff193) ;;; 6 GC_push_current_stack (0x87d7ef7c3) ;;; 7 GC_with_callee_saves_pushed (0x87d7f7360) ;;; 8 GC_push_roots (0x87d7ef9c2) ;;; 9 GC_mark_some (0x87d7ec97c) ;;; 10 GC_stopped_mark (0x87d7e6b7a) ;;; 11 GC_try_to_collect_inner (0x87d7e6a75) ;;; 12 GC_init (0x87d7f08ea) ;;; 13 init_alloc (0x87d7d5669) ;;; 14 cl_boot (0x87d69f66b) ...

And a very similar picture on the develop branch of ECL - although I had to change our code, as in particular ECL_OPT_TRAP_SIGCHLD is gone...

So, what can it be? Some signals issue?

Thanks, Dima

On Fri, Sep 1, 2017 at 7:38 AM, Daniel Kochmański <daniel@turtleware.eu> wrote:

...
Hey Dima,

this looks like the issue with having GC initialized before ECL kicks in. See https://gitlab.com/embeddable-common-lisp/ecl/issues/371 for a discussion about this problem. Basically some other component already called GC_init and ECL calls it once more. It's arguably not a bug.

Best regards,

Daniel

On 31.08.2017 15:29, Dima Pasechnik wrote:

...
Dear all,

I'm struggling to understand strange segfaults coming from ECL(+Maxima) on FreeBSD embedded into Python; they typically look as follows:

Got signal before environment was installed on our thread [2: No such file or directory]

;;; ECL C Backtrace ;;; 0 ecl_internal_error (0x87d790765) ;;; 1 init_unixint (0x87d7b6bd0) ;;; 2 init_unixint (0x87d7b6972) ;;; 3 pthread_sigmask (0x80103779d) ;;; 4 pthread_getspecific (0x801036d6f) ;;; 5 unknown (0x7ffffffff193) ;;; 6 GC_push_all_stacks (0x87db1ea2c) ;;; 7 GC_mark_some (0x87db12eec) ;;; 8 GC_stopped_mark (0x87db09baa) ;;; 9 GC_try_to_collect_inner (0x87db09a75) ;;; 10 GC_init (0x87db16f4f) ;;; 11 init_alloc (0x87d7caa59) ;;; 12 cl_boot (0x87d694a5b) ;;; 13 initecl (0x87d218340) ;;; 14 initecl (0x87d20a43f) ;;; 15 initecl (0x87d207e28) ;;; 16 _PyImport_LoadDynamicModule (0x800b3ed1c) ;;; 17 PyImport_AppendInittab (0x800b3d71f) ;;; 18 PyImport_AppendInittab (0x800b3d1a8) ;;; 19 PyImport_ImportModuleLevel (0x800b3c2ce) ;;; 20 _PyBuiltin_Init (0x800b162d7) ;;; 21 PyObject_Call (0x800a7d3e3) ;;; 22 PyEval_EvalFrameEx (0x800b2121c) ;;; 23 PyEval_EvalCodeEx (0x800b1b5d4) ;;; 24 PyEval_EvalCode (0x800b1ad96) ;;; 25 PyImport_ExecCodeModuleEx (0x800b3ad11) ;;; 26 PyImport_AppendInittab (0x800b3ddb8) ;;; 27 PyImport_AppendInittab (0x800b3d71f) ;;; 28 PyImport_AppendInittab (0x800b3d1a8) ;;; 29 PyImport_ImportModuleLevel (0x800b3c2ce) ;;; 30 _PyBuiltin_Init (0x800b162d7) ;;; 31 PyEval_EvalFrameEx (0x800b22dd1) Segmentation fault (core dumped)

It looks as if ECL (version 16.1.2) is being called before an initialisation is complete, but it it possible to say more without a debugger?

More details: is is on FreeBSD 11.0, clang 3.8.0, GC version 7.6.0 with libatomic_ops version 7.4.6. And only reproducible on FreeBSD.

ECL is built with --disable-threads; GC is built with or without threads---result is still the same. (so it's unclear to me where pthread_* calls in the trace come from).

Thanks, Dima

PS. the segfault is at the bottom of https://trac.sagemath.org/ticket/22679#comment:87

-- Wysłane za pomocą K-9 Mail.

Dima Pasechnik

4 Sep 4 Sep

10:04 a.m.

On Fri, Sep 1, 2017 at 1:57 PM, Daniel Kochmański <daniel@turtleware.eu> wrote:

...

I dont think its related to shared vs static - rather two gc running concurrently. Try commenting out GC_init call in ecl and see what happens.

I don't understand how two GCs can run concurrently on a memory region controlled by ECL which is statically linked to GC... In fact I am pretty sure no other instances of GC are running anywhere within our process tree. By the way, I don't know whether it's obvious from the backtrace that cl_boot() has been completed, or not. If it actually was completed, could it be a bug that invalidates the bit indicating that cl_boot() has been done? We have seen similar troubles with clang recently, related to FPE. There an FPE bit was flipped by assignment of a double to an integer type (sic!). It took us a lot of head banging on various hard surfaces to debug this: https://trac.sagemath.org/ticket/22799 it turned out we did hit a known bug: https://bugs.llvm.org//show_bug.cgi?id=17686

...

Do you need sigchld for anything? Run-program was rewritten and sigchld handling wasnt viable option anymore for it.

We do set ECL_OPT_TRAP_SIGCHLD to 0, thus I presume we now can simply skip it all together. Thanks, Dima

...

Im on phone, will be avail after the weekend.

Regards, D.

Dnia 1 września 2017 14:47:57 CEST, Dima Pasechnik <dimpase+ecl@gmail.com> napisał(a):

...
Hi Daniel, Thanks for the message. The scenario you talk about only happens if GC is a shared library, right?

I've rebuilt GC disabling shared libs, and ECL doing static linking to GC. And I still get very similar segfaults:

;;; ECL C Backtrace ;;; 0 ecl_internal_error (0x87d79b375) ;;; 1 init_unixint (0x87d7c17e0) ;;; 2 init_unixint (0x87d7c1582) ;;; 3 pthread_sigmask (0x80103779d) ;;; 4 pthread_getspecific (0x801036d6f) ;;; 5 unknown (0x7ffffffff193) ;;; 6 GC_push_current_stack (0x87d7ef7c3) ;;; 7 GC_with_callee_saves_pushed (0x87d7f7360) ;;; 8 GC_push_roots (0x87d7ef9c2) ;;; 9 GC_mark_some (0x87d7ec97c) ;;; 10 GC_stopped_mark (0x87d7e6b7a) ;;; 11 GC_try_to_collect_inner (0x87d7e6a75) ;;; 12 GC_init (0x87d7f08ea) ;;; 13 init_alloc (0x87d7d5669) ;;; 14 cl_boot (0x87d69f66b) ...

And a very similar picture on the develop branch of ECL - although I had to change our code, as in particular ECL_OPT_TRAP_SIGCHLD is gone...

So, what can it be? Some signals issue?

Thanks, Dima

On Fri, Sep 1, 2017 at 7:38 AM, Daniel Kochmański <daniel@turtleware.eu> wrote:

...
Hey Dima,

this looks like the issue with having GC initialized before ECL kicks in. See https://gitlab.com/embeddable-common-lisp/ecl/issues/371 for a discussion about this problem. Basically some other component already called GC_init and ECL calls it once more. It's arguably not a bug.

Best regards,

Daniel

On 31.08.2017 15:29, Dima Pasechnik wrote:

...
Dear all,

I'm struggling to understand strange segfaults coming from ECL(+Maxima) on FreeBSD embedded into Python; they typically look as follows:

Got signal before environment was installed on our thread [2: No such file or directory]

;;; ECL C Backtrace ;;; 0 ecl_internal_error (0x87d790765) ;;; 1 init_unixint (0x87d7b6bd0) ;;; 2 init_unixint (0x87d7b6972) ;;; 3 pthread_sigmask (0x80103779d) ;;; 4 pthread_getspecific (0x801036d6f) ;;; 5 unknown (0x7ffffffff193) ;;; 6 GC_push_all_stacks (0x87db1ea2c) ;;; 7 GC_mark_some (0x87db12eec) ;;; 8 GC_stopped_mark (0x87db09baa) ;;; 9 GC_try_to_collect_inner (0x87db09a75) ;;; 10 GC_init (0x87db16f4f) ;;; 11 init_alloc (0x87d7caa59) ;;; 12 cl_boot (0x87d694a5b) ;;; 13 initecl (0x87d218340) ;;; 14 initecl (0x87d20a43f) ;;; 15 initecl (0x87d207e28) ;;; 16 _PyImport_LoadDynamicModule (0x800b3ed1c) ;;; 17 PyImport_AppendInittab (0x800b3d71f) ;;; 18 PyImport_AppendInittab (0x800b3d1a8) ;;; 19 PyImport_ImportModuleLevel (0x800b3c2ce) ;;; 20 _PyBuiltin_Init (0x800b162d7) ;;; 21 PyObject_Call (0x800a7d3e3) ;;; 22 PyEval_EvalFrameEx (0x800b2121c) ;;; 23 PyEval_EvalCodeEx (0x800b1b5d4) ;;; 24 PyEval_EvalCode (0x800b1ad96) ;;; 25 PyImport_ExecCodeModuleEx (0x800b3ad11) ;;; 26 PyImport_AppendInittab (0x800b3ddb8) ;;; 27 PyImport_AppendInittab (0x800b3d71f) ;;; 28 PyImport_AppendInittab (0x800b3d1a8) ;;; 29 PyImport_ImportModuleLevel (0x800b3c2ce) ;;; 30 _PyBuiltin_Init (0x800b162d7) ;;; 31 PyEval_EvalFrameEx (0x800b22dd1) Segmentation fault (core dumped)

It looks as if ECL (version 16.1.2) is being called before an initialisation is complete, but it it possible to say more without a debugger?

More details: is is on FreeBSD 11.0, clang 3.8.0, GC version 7.6.0 with libatomic_ops version 7.4.6. And only reproducible on FreeBSD.

ECL is built with --disable-threads; GC is built with or without threads---result is still the same. (so it's unclear to me where pthread_* calls in the trace come from).

Thanks, Dima

PS. the segfault is at the bottom of https://trac.sagemath.org/ticket/22679#comment:87

-- Wysłane za pomocą K-9 Mail.

Daniel Kochmański

10:15 a.m.

From the backtrace it is sure that fail is caused inside the call to GC_init. Such errors are known to have happened when another GC was initialized already on the system (I've linked the issue). It might be caused by something else in bdwgc, I don't know. Either way I'd focus on GC_init part. To make sure, that I'm right with my assertion you may put printf before and after call to GC_init. I'm not quite familiar with bdwgc internals to say, what is wrong though. Maybe updating bundled sources of GC will help? Or linking with libgc on the system? It might be that it was a bug in bdwgc which got already fixed. Regards, Daniel On 04.09.2017 12:04, Dima Pasechnik wrote:

...

On Fri, Sep 1, 2017 at 1:57 PM, Daniel Kochmański <daniel@turtleware.eu> wrote:

...
I dont think its related to shared vs static - rather two gc running concurrently. Try commenting out GC_init call in ecl and see what happens. I don't understand how two GCs can run concurrently on a memory region controlled by ECL which is statically linked to GC... In fact I am pretty sure no other instances of GC are running anywhere within our process tree.

By the way, I don't know whether it's obvious from the backtrace that cl_boot() has been completed, or not.

If it actually was completed, could it be a bug that invalidates the bit indicating that cl_boot() has been done?

We have seen similar troubles with clang recently, related to FPE. There an FPE bit was flipped by assignment of a double to an integer type (sic!). It took us a lot of head banging on various hard surfaces to debug this: https://trac.sagemath.org/ticket/22799 it turned out we did hit a known bug: https://bugs.llvm.org//show_bug.cgi?id=17686

...
Do you need sigchld for anything? Run-program was rewritten and sigchld handling wasnt viable option anymore for it.

We do set ECL_OPT_TRAP_SIGCHLD to 0, thus I presume we now can simply skip it all together.

Thanks, Dima

...
Im on phone, will be avail after the weekend.

Regards, D.

Dnia 1 września 2017 14:47:57 CEST, Dima Pasechnik <dimpase+ecl@gmail.com> napisał(a):

...
Hi Daniel, Thanks for the message. The scenario you talk about only happens if GC is a shared library, right?

I've rebuilt GC disabling shared libs, and ECL doing static linking to GC. And I still get very similar segfaults:

;;; ECL C Backtrace ;;; 0 ecl_internal_error (0x87d79b375) ;;; 1 init_unixint (0x87d7c17e0) ;;; 2 init_unixint (0x87d7c1582) ;;; 3 pthread_sigmask (0x80103779d) ;;; 4 pthread_getspecific (0x801036d6f) ;;; 5 unknown (0x7ffffffff193) ;;; 6 GC_push_current_stack (0x87d7ef7c3) ;;; 7 GC_with_callee_saves_pushed (0x87d7f7360) ;;; 8 GC_push_roots (0x87d7ef9c2) ;;; 9 GC_mark_some (0x87d7ec97c) ;;; 10 GC_stopped_mark (0x87d7e6b7a) ;;; 11 GC_try_to_collect_inner (0x87d7e6a75) ;;; 12 GC_init (0x87d7f08ea) ;;; 13 init_alloc (0x87d7d5669) ;;; 14 cl_boot (0x87d69f66b) ...

And a very similar picture on the develop branch of ECL - although I had to change our code, as in particular ECL_OPT_TRAP_SIGCHLD is gone...

So, what can it be? Some signals issue?

Thanks, Dima

On Fri, Sep 1, 2017 at 7:38 AM, Daniel Kochmański <daniel@turtleware.eu> wrote:

...
Hey Dima,

this looks like the issue with having GC initialized before ECL kicks in. See https://gitlab.com/embeddable-common-lisp/ecl/issues/371 for a discussion about this problem. Basically some other component already called GC_init and ECL calls it once more. It's arguably not a bug.

Best regards,

Daniel

On 31.08.2017 15:29, Dima Pasechnik wrote:

...
Dear all,

I'm struggling to understand strange segfaults coming from ECL(+Maxima) on FreeBSD embedded into Python; they typically look as follows:

Got signal before environment was installed on our thread [2: No such file or directory]

;;; ECL C Backtrace ;;; 0 ecl_internal_error (0x87d790765) ;;; 1 init_unixint (0x87d7b6bd0) ;;; 2 init_unixint (0x87d7b6972) ;;; 3 pthread_sigmask (0x80103779d) ;;; 4 pthread_getspecific (0x801036d6f) ;;; 5 unknown (0x7ffffffff193) ;;; 6 GC_push_all_stacks (0x87db1ea2c) ;;; 7 GC_mark_some (0x87db12eec) ;;; 8 GC_stopped_mark (0x87db09baa) ;;; 9 GC_try_to_collect_inner (0x87db09a75) ;;; 10 GC_init (0x87db16f4f) ;;; 11 init_alloc (0x87d7caa59) ;;; 12 cl_boot (0x87d694a5b) ;;; 13 initecl (0x87d218340) ;;; 14 initecl (0x87d20a43f) ;;; 15 initecl (0x87d207e28) ;;; 16 _PyImport_LoadDynamicModule (0x800b3ed1c) ;;; 17 PyImport_AppendInittab (0x800b3d71f) ;;; 18 PyImport_AppendInittab (0x800b3d1a8) ;;; 19 PyImport_ImportModuleLevel (0x800b3c2ce) ;;; 20 _PyBuiltin_Init (0x800b162d7) ;;; 21 PyObject_Call (0x800a7d3e3) ;;; 22 PyEval_EvalFrameEx (0x800b2121c) ;;; 23 PyEval_EvalCodeEx (0x800b1b5d4) ;;; 24 PyEval_EvalCode (0x800b1ad96) ;;; 25 PyImport_ExecCodeModuleEx (0x800b3ad11) ;;; 26 PyImport_AppendInittab (0x800b3ddb8) ;;; 27 PyImport_AppendInittab (0x800b3d71f) ;;; 28 PyImport_AppendInittab (0x800b3d1a8) ;;; 29 PyImport_ImportModuleLevel (0x800b3c2ce) ;;; 30 _PyBuiltin_Init (0x800b162d7) ;;; 31 PyEval_EvalFrameEx (0x800b22dd1) Segmentation fault (core dumped)

It looks as if ECL (version 16.1.2) is being called before an initialisation is complete, but it it possible to say more without a debugger?

More details: is is on FreeBSD 11.0, clang 3.8.0, GC version 7.6.0 with libatomic_ops version 7.4.6. And only reproducible on FreeBSD.

ECL is built with --disable-threads; GC is built with or without threads---result is still the same. (so it's unclear to me where pthread_* calls in the trace come from).

Thanks, Dima

PS. the segfault is at the bottom of https://trac.sagemath.org/ticket/22679#comment:87

-- Wysłane za pomocą K-9 Mail.

Dima Pasechnik

11 Sep 11 Sep

11:13 p.m.

On Mon, Sep 4, 2017 at 11:15 AM, Daniel Kochmański <daniel@turtleware.eu> wrote:

...

From the backtrace it is sure that fail is caused inside the call to GC_init. Such errors are known to have happened when another GC was initialized already on the system (I've linked the issue). It might be caused by something else in bdwgc, I don't know. Either way I'd focus on GC_init part.

Our project (sagemath) only uses libgc within the embedded ECL. Thus I am really puzzled how another libgc instance might kick in and spoil the game for ECL. One possibility is that clang is using libgc, and thus, in principle, libgc might be sitting somewhere in the runtime?!

...

To make sure, that I'm right with my assertion you may put printf before and after call to GC_init. I'm not quite familiar with bdwgc internals to say, what is wrong though. Maybe updating bundled sources of GC will help? Or linking with libgc on the system? It might be that it was a bug in bdwgc which got already fixed.

We are not using the bdwgc shipped with ECL, we use a separate libgc 7.6.0, which is the latest stable. (Is there a reason to ship bdwgc sources with ECL - do you patch it?) Thanks, Dima

...

Regards,

Daniel

On 04.09.2017 12:04, Dima Pasechnik wrote:

...
On Fri, Sep 1, 2017 at 1:57 PM, Daniel Kochmański <daniel@turtleware.eu> wrote:

...
I dont think its related to shared vs static - rather two gc running concurrently. Try commenting out GC_init call in ecl and see what happens.

I don't understand how two GCs can run concurrently on a memory region controlled by ECL which is statically linked to GC... In fact I am pretty sure no other instances of GC are running anywhere within our process tree.

By the way, I don't know whether it's obvious from the backtrace that cl_boot() has been completed, or not.

If it actually was completed, could it be a bug that invalidates the bit indicating that cl_boot() has been done?

We have seen similar troubles with clang recently, related to FPE. There an FPE bit was flipped by assignment of a double to an integer type (sic!). It took us a lot of head banging on various hard surfaces to debug this: https://trac.sagemath.org/ticket/22799 it turned out we did hit a known bug: https://bugs.llvm.org//show_bug.cgi?id=17686

...
Do you need sigchld for anything? Run-program was rewritten and sigchld handling wasnt viable option anymore for it.

We do set ECL_OPT_TRAP_SIGCHLD to 0, thus I presume we now can simply skip it all together.

Thanks, Dima

...
Im on phone, will be avail after the weekend.

Regards, D.

Dnia 1 września 2017 14:47:57 CEST, Dima Pasechnik <dimpase+ecl@gmail.com> napisał(a):

...
Hi Daniel, Thanks for the message. The scenario you talk about only happens if GC is a shared library, right?

I've rebuilt GC disabling shared libs, and ECL doing static linking to GC. And I still get very similar segfaults:

;;; ECL C Backtrace ;;; 0 ecl_internal_error (0x87d79b375) ;;; 1 init_unixint (0x87d7c17e0) ;;; 2 init_unixint (0x87d7c1582) ;;; 3 pthread_sigmask (0x80103779d) ;;; 4 pthread_getspecific (0x801036d6f) ;;; 5 unknown (0x7ffffffff193) ;;; 6 GC_push_current_stack (0x87d7ef7c3) ;;; 7 GC_with_callee_saves_pushed (0x87d7f7360) ;;; 8 GC_push_roots (0x87d7ef9c2) ;;; 9 GC_mark_some (0x87d7ec97c) ;;; 10 GC_stopped_mark (0x87d7e6b7a) ;;; 11 GC_try_to_collect_inner (0x87d7e6a75) ;;; 12 GC_init (0x87d7f08ea) ;;; 13 init_alloc (0x87d7d5669) ;;; 14 cl_boot (0x87d69f66b) ...

And a very similar picture on the develop branch of ECL - although I had to change our code, as in particular ECL_OPT_TRAP_SIGCHLD is gone...

So, what can it be? Some signals issue?

Thanks, Dima

On Fri, Sep 1, 2017 at 7:38 AM, Daniel Kochmański <daniel@turtleware.eu> wrote:

...
Hey Dima,

this looks like the issue with having GC initialized before ECL kicks in. See https://gitlab.com/embeddable-common-lisp/ecl/issues/371 for a discussion about this problem. Basically some other component already called GC_init and ECL calls it once more. It's arguably not a bug.

Best regards,

Daniel

On 31.08.2017 15:29, Dima Pasechnik wrote:

...
Dear all,

I'm struggling to understand strange segfaults coming from ECL(+Maxima) on FreeBSD embedded into Python; they typically look as follows:

Got signal before environment was installed on our thread [2: No such file or directory]

;;; ECL C Backtrace ;;; 0 ecl_internal_error (0x87d790765) ;;; 1 init_unixint (0x87d7b6bd0) ;;; 2 init_unixint (0x87d7b6972) ;;; 3 pthread_sigmask (0x80103779d) ;;; 4 pthread_getspecific (0x801036d6f) ;;; 5 unknown (0x7ffffffff193) ;;; 6 GC_push_all_stacks (0x87db1ea2c) ;;; 7 GC_mark_some (0x87db12eec) ;;; 8 GC_stopped_mark (0x87db09baa) ;;; 9 GC_try_to_collect_inner (0x87db09a75) ;;; 10 GC_init (0x87db16f4f) ;;; 11 init_alloc (0x87d7caa59) ;;; 12 cl_boot (0x87d694a5b) ;;; 13 initecl (0x87d218340) ;;; 14 initecl (0x87d20a43f) ;;; 15 initecl (0x87d207e28) ;;; 16 _PyImport_LoadDynamicModule (0x800b3ed1c) ;;; 17 PyImport_AppendInittab (0x800b3d71f) ;;; 18 PyImport_AppendInittab (0x800b3d1a8) ;;; 19 PyImport_ImportModuleLevel (0x800b3c2ce) ;;; 20 _PyBuiltin_Init (0x800b162d7) ;;; 21 PyObject_Call (0x800a7d3e3) ;;; 22 PyEval_EvalFrameEx (0x800b2121c) ;;; 23 PyEval_EvalCodeEx (0x800b1b5d4) ;;; 24 PyEval_EvalCode (0x800b1ad96) ;;; 25 PyImport_ExecCodeModuleEx (0x800b3ad11) ;;; 26 PyImport_AppendInittab (0x800b3ddb8) ;;; 27 PyImport_AppendInittab (0x800b3d71f) ;;; 28 PyImport_AppendInittab (0x800b3d1a8) ;;; 29 PyImport_ImportModuleLevel (0x800b3c2ce) ;;; 30 _PyBuiltin_Init (0x800b162d7) ;;; 31 PyEval_EvalFrameEx (0x800b22dd1) Segmentation fault (core dumped)

It looks as if ECL (version 16.1.2) is being called before an initialisation is complete, but it it possible to say more without a debugger?

More details: is is on FreeBSD 11.0, clang 3.8.0, GC version 7.6.0 with libatomic_ops version 7.4.6. And only reproducible on FreeBSD.

ECL is built with --disable-threads; GC is built with or without threads---result is still the same. (so it's unclear to me where pthread_* calls in the trace come from).

Thanks, Dima

PS. the segfault is at the bottom of https://trac.sagemath.org/ticket/22679#comment:87

-- Wysłane za pomocą K-9 Mail.

Fabrizio Fabbri

12 Sep 12 Sep

12:18 a.m.

...

On Sep 11, 2017, at 7:13 PM, Dima Pasechnik <dimpase+ecl@gmail.com> wrote:

...
On Mon, Sep 4, 2017 at 11:15 AM, Daniel Kochmański <daniel@turtleware.eu> wrote: From the backtrace it is sure that fail is caused inside the call to GC_init. Such errors are known to have happened when another GC was initialized already on the system (I've linked the issue). It might be caused by something else in bdwgc, I don't know. Either way I'd focus on GC_init part.

Our project (sagemath) only uses libgc within the embedded ECL. Thus I am really puzzled how another libgc instance might kick in and spoil the game for ECL.

One possibility is that clang is using libgc, and thus, in principle, libgc might be sitting somewhere in the runtime?!

...
To make sure, that I'm right with my assertion you may put printf before and after call to GC_init. I'm not quite familiar with bdwgc internals to say, what is wrong though. Maybe updating bundled sources of GC will help? Or linking with libgc on the system? It might be that it was a bug in bdwgc which got already fixed.

We are not using the bdwgc shipped with ECL, we use a separate libgc 7.6.0, which is the latest stable. (Is there a reason to ship bdwgc sources with ECL - do you patch it?)

I'm using ecl with the non embedded bdwgc as well and I don't have issue. Ensure that bdwgc it's not also build statically in ecl as well. I expect linking problems in that case but worth it double check.

...

Thanks, Dima

...
Regards,

Daniel

...
On 04.09.2017 12:04, Dima Pasechnik wrote:

On Fri, Sep 1, 2017 at 1:57 PM, Daniel Kochmański <daniel@turtleware.eu> wrote:

...
I dont think its related to shared vs static - rather two gc running concurrently. Try commenting out GC_init call in ecl and see what happens.

I don't understand how two GCs can run concurrently on a memory region controlled by ECL which is statically linked to GC... In fact I am pretty sure no other instances of GC are running anywhere within our process tree.

By the way, I don't know whether it's obvious from the backtrace that cl_boot() has been completed, or not.

If it actually was completed, could it be a bug that invalidates the bit indicating that cl_boot() has been done?

We have seen similar troubles with clang recently, related to FPE. There an FPE bit was flipped by assignment of a double to an integer type (sic!). It took us a lot of head banging on various hard surfaces to debug this: https://trac.sagemath.org/ticket/22799 it turned out we did hit a known bug: https://bugs.llvm.org//show_bug.cgi?id=17686

...
Do you need sigchld for anything? Run-program was rewritten and sigchld handling wasnt viable option anymore for it.

We do set ECL_OPT_TRAP_SIGCHLD to 0, thus I presume we now can simply skip it all together.

Thanks, Dima

...
Im on phone, will be avail after the weekend.

Regards, D.

Dnia 1 września 2017 14:47:57 CEST, Dima Pasechnik <dimpase+ecl@gmail.com> napisał(a):

...
Hi Daniel, Thanks for the message. The scenario you talk about only happens if GC is a shared library, right?

I've rebuilt GC disabling shared libs, and ECL doing static linking to GC. And I still get very similar segfaults:

;;; ECL C Backtrace ;;; 0 ecl_internal_error (0x87d79b375) ;;; 1 init_unixint (0x87d7c17e0) ;;; 2 init_unixint (0x87d7c1582) ;;; 3 pthread_sigmask (0x80103779d) ;;; 4 pthread_getspecific (0x801036d6f) ;;; 5 unknown (0x7ffffffff193) ;;; 6 GC_push_current_stack (0x87d7ef7c3) ;;; 7 GC_with_callee_saves_pushed (0x87d7f7360) ;;; 8 GC_push_roots (0x87d7ef9c2) ;;; 9 GC_mark_some (0x87d7ec97c) ;;; 10 GC_stopped_mark (0x87d7e6b7a) ;;; 11 GC_try_to_collect_inner (0x87d7e6a75) ;;; 12 GC_init (0x87d7f08ea) ;;; 13 init_alloc (0x87d7d5669) ;;; 14 cl_boot (0x87d69f66b) ...

And a very similar picture on the develop branch of ECL - although I had to change our code, as in particular ECL_OPT_TRAP_SIGCHLD is gone...

So, what can it be? Some signals issue?

Thanks, Dima

On Fri, Sep 1, 2017 at 7:38 AM, Daniel Kochmański <daniel@turtleware.eu> wrote:

...
Hey Dima,

this looks like the issue with having GC initialized before ECL kicks in. See https://gitlab.com/embeddable-common-lisp/ecl/issues/371 for a discussion about this problem. Basically some other component already called GC_init and ECL calls it once more. It's arguably not a bug.

Best regards,

Daniel

> On 31.08.2017 15:29, Dima Pasechnik wrote: > > > Dear all, > > I'm struggling to understand strange segfaults coming from > ECL(+Maxima) on FreeBSD embedded into Python; they typically look as > follows: > > Got signal before environment was installed on our thread > [2: No such file or directory] > > ;;; ECL C Backtrace > ;;; 0 ecl_internal_error (0x87d790765) > ;;; 1 init_unixint (0x87d7b6bd0) > ;;; 2 init_unixint (0x87d7b6972) > ;;; 3 pthread_sigmask (0x80103779d) > ;;; 4 pthread_getspecific (0x801036d6f) > ;;; 5 unknown (0x7ffffffff193) > ;;; 6 GC_push_all_stacks (0x87db1ea2c) > ;;; 7 GC_mark_some (0x87db12eec) > ;;; 8 GC_stopped_mark (0x87db09baa) > ;;; 9 GC_try_to_collect_inner (0x87db09a75) > ;;; 10 GC_init (0x87db16f4f) > ;;; 11 init_alloc (0x87d7caa59) > ;;; 12 cl_boot (0x87d694a5b) > ;;; 13 initecl (0x87d218340) > ;;; 14 initecl (0x87d20a43f) > ;;; 15 initecl (0x87d207e28) > ;;; 16 _PyImport_LoadDynamicModule (0x800b3ed1c) > ;;; 17 PyImport_AppendInittab (0x800b3d71f) > ;;; 18 PyImport_AppendInittab (0x800b3d1a8) > ;;; 19 PyImport_ImportModuleLevel (0x800b3c2ce) > ;;; 20 _PyBuiltin_Init (0x800b162d7) > ;;; 21 PyObject_Call (0x800a7d3e3) > ;;; 22 PyEval_EvalFrameEx (0x800b2121c) > ;;; 23 PyEval_EvalCodeEx (0x800b1b5d4) > ;;; 24 PyEval_EvalCode (0x800b1ad96) > ;;; 25 PyImport_ExecCodeModuleEx (0x800b3ad11) > ;;; 26 PyImport_AppendInittab (0x800b3ddb8) > ;;; 27 PyImport_AppendInittab (0x800b3d71f) > ;;; 28 PyImport_AppendInittab (0x800b3d1a8) > ;;; 29 PyImport_ImportModuleLevel (0x800b3c2ce) > ;;; 30 _PyBuiltin_Init (0x800b162d7) > ;;; 31 PyEval_EvalFrameEx (0x800b22dd1) > Segmentation fault (core dumped) > > It looks as if ECL (version 16.1.2) is being called before an > initialisation is complete, but it it possible to say more without a > debugger? > > More details: is is on FreeBSD 11.0, clang 3.8.0, GC version 7.6.0 > with libatomic_ops version 7.4.6. > And only reproducible on FreeBSD. > > ECL is built with --disable-threads; GC is built with or without > threads---result is still the same. > (so it's unclear to me where pthread_* calls in the trace > come from). > > Thanks, > Dima > > PS. the segfault is at the bottom of > https://trac.sagemath.org/ticket/22679#comment:87

-- Wysłane za pomocą K-9 Mail.

Dima Pasechnik

21 Sep 21 Sep

12:31 p.m.

On Tue, Sep 12, 2017 at 1:18 AM, Fabrizio Fabbri <strabixbox@yahoo.com> wrote:

...

...
On Sep 11, 2017, at 7:13 PM, Dima Pasechnik <dimpase+ecl@gmail.com>

...

...
...
On Mon, Sep 4, 2017 at 11:15 AM, Daniel Kochmański <daniel@turtleware.eu>

wrote:

...
...
From the backtrace it is sure that fail is caused inside the call to GC_init. Such errors are known to have happened when another GC was initialized already on the system (I've linked the issue). It might be caused by something else in bdwgc, I don't know. Either way I'd focus on GC_init part.

Our project (sagemath) only uses libgc within the embedded ECL. Thus I am really puzzled how another libgc instance might kick in and spoil the game for ECL.

One possibility is that clang is using libgc, and thus, in principle, libgc might be sitting somewhere in the runtime?!

...
To make sure, that I'm right with my assertion you may put printf

before and

...
after call to GC_init. I'm not quite familiar with bdwgc internals to say, what is wrong though. Maybe updating bundled sources of GC will help? Or linking with libgc on the system? It might be that it was a bug in bdwgc which got already fixed.

We are not using the bdwgc shipped with ECL, we use a separate libgc 7.6.0, which is the latest stable. (Is there a reason to ship bdwgc sources with ECL - do you patch it?)

I'm using ecl with the non embedded bdwgc as well and I don't have issue.

Ensure that bdwgc it's not also build statically in ecl as well. I expect

wrote: linking problems in that case but worth it double check. here is a part of a stacktrace from the debugger, in a scenario where a call to embedded ECL from Python leads to a ECL's stack overflow, on an already initialised ECL; it seems to be related to a particular thread this call comes from (another, usual, calling sequence does not lead to crashes). There is no mention of GC in the stacktrace. This looks to me as a lack of thread safety on ECL side, although I might be wrong. ... frame #16: 0x000000088444b9d6 libecl.so.16.1`si_serror(narg=6, cformat=0x0000000000d27ba0, eformat=0x00000008847d12a0) at error.d:549 frame #17: 0x000000088448bd42 libecl.so.16.1`ecl_cs_overflow at stacks.d:76 frame #18: 0x00000008844168af libecl.so.16.1`ecl_interpret(frame=0x00007fffdeff2658, env=0x0000000000000001, bytecodes=0x0000000000db33c0) at interpreter.d:286 frame #19: 0x0000000884414afc libecl.so.16.1`ecl_apply_from_stack_frame(frame=0x00007fffdeff2658, x=0x0000000000db33c0) at eval.d:79 frame #20: 0x000000088441545b libecl.so.16.1`cl_apply(narg=0, fun=0x0000000000db33c0, lastarg=0x0000000000000001) at eval.d:164 frame #21: 0x0000000883e0e1b4 ecl.so`__pyx_f_4sage_4libs_3ecl_ecl_safe_funcall(__pyx_v_func=0x0000000000769600, __pyx_v_arg=0x0000000000e6dfa0) at ecl.c:5831 frame #22: 0x0000000883e0d519 ecl.so`__pyx_f_4sage_4libs_3ecl_ecl_safe_read_string(__pyx_v_s="(setf *load-verbose* NIL)") at ecl.c:6084 frame #23: 0x0000000883e0d02b ecl.so`__pyx_f_4sage_4libs_3ecl_ecl_eval(__pyx_v_s=0x0000000882add970, __pyx_skip_dispatch=0) at ecl.c:10682 frame #24: 0x0000000883e0cd4c ecl.so`__pyx_pf_4sage_4libs_3ecl_10ecl_eval(__pyx_self=0x0000000000000000, __pyx_v_s=0x0000000882add970) at ecl.c:10762 frame #25: 0x0000000883e0cab7 ecl.so`__pyx_pw_4sage_4libs_3ecl_11ecl_eval(__pyx_self=0x0000000000000000, __pyx_v_s=0x0000000882add970) at ecl.c:10745 frame #26: 0x0000000800d8a68f libpython2.7.so.1`call_function(pp_stack=0x00007fffdeff2c00, oparg=1) at ceval.c:4340 frame #27: 0x0000000800d854d2 libpython2.7.so.1`PyEval_EvalFrameEx(f=0x00000008829939b0, throwflag=0) at ceval.c:2989 ... frame #91: 0x0000000800d88361 libpython2.7.so.1`PyEval_CallObjectWithKeywords(func=0x000000087cdf99e0, arg=0x000000080064e060, kw=0x0000000000000000) at ceval.c:4221 frame #92: 0x0000000800de60d1 libpython2.7.so.1`t_bootstrap(boot_raw=0x0000000807015598) at threadmodule.c:620 frame #93: 0x00000008012d3b55 libthr.so.3`___lldb_unnamed_symbol1$$libthr.so.3 + 325

...

...
Thanks, Dima

...
Regards,

Daniel

...
On 04.09.2017 12:04, Dima Pasechnik wrote:

On Fri, Sep 1, 2017 at 1:57 PM, Daniel Kochmański <daniel@turtleware.eu

...
...
...
wrote:

...
I dont think its related to shared vs static - rather two gc running concurrently. Try commenting out GC_init call in ecl and see what happens.

I don't understand how two GCs can run concurrently on a memory region controlled by ECL which is statically linked to GC... In fact I am pretty sure no other instances of GC are running anywhere within our process tree.

By the way, I don't know whether it's obvious from the backtrace that cl_boot() has been completed, or not.

If it actually was completed, could it be a bug that invalidates the bit indicating that cl_boot() has been done?

We have seen similar troubles with clang recently, related to FPE. There an FPE bit was flipped by assignment of a double to an integer type (sic!). It took us a lot of head banging on various hard surfaces to debug

this:

...

...
...
...
https://trac.sagemath.org/ticket/22799 it turned out we did hit a known bug: https://bugs.llvm.org//show_bug.cgi?id=17686

...
Do you need sigchld for anything? Run-program was rewritten and sigchld handling wasnt viable option anymore for it.

We do set ECL_OPT_TRAP_SIGCHLD to 0, thus I presume we now can simply skip it all together.

Thanks, Dima

...
Im on phone, will be avail after the weekend.

Regards, D.

Dnia 1 września 2017 14:47:57 CEST, Dima Pasechnik <dimpase+ecl@gmail.com> napisał(a):

...
Hi Daniel, Thanks for the message. The scenario you talk about only happens if

GC

...
is a shared library, right?

I've rebuilt GC disabling shared libs, and ECL doing static linking to GC. And I still get very similar segfaults:

;;; ECL C Backtrace ;;; 0 ecl_internal_error (0x87d79b375) ;;; 1 init_unixint (0x87d7c17e0) ;;; 2 init_unixint (0x87d7c1582) ;;; 3 pthread_sigmask (0x80103779d) ;;; 4 pthread_getspecific (0x801036d6f) ;;; 5 unknown (0x7ffffffff193) ;;; 6 GC_push_current_stack (0x87d7ef7c3) ;;; 7 GC_with_callee_saves_pushed (0x87d7f7360) ;;; 8 GC_push_roots (0x87d7ef9c2) ;;; 9 GC_mark_some (0x87d7ec97c) ;;; 10 GC_stopped_mark (0x87d7e6b7a) ;;; 11 GC_try_to_collect_inner (0x87d7e6a75) ;;; 12 GC_init (0x87d7f08ea) ;;; 13 init_alloc (0x87d7d5669) ;;; 14 cl_boot (0x87d69f66b) ...

And a very similar picture on the develop branch of ECL - although I had to change our code, as in particular ECL_OPT_TRAP_SIGCHLD is gone...

So, what can it be? Some signals issue?

Thanks, Dima

On Fri, Sep 1, 2017 at 7:38 AM, Daniel Kochmański < daniel@turtleware.eu> wrote: > > Hey Dima, > > this looks like the issue with having GC initialized before ECL kicks > in. > See https://gitlab.com/embeddable-common-lisp/ecl/issues/371 for a > discussion about this problem. Basically some other component already > called > GC_init and ECL calls it once more. It's arguably not a bug. > > Best regards, > > Daniel > > >> On 31.08.2017 15:29, Dima Pasechnik wrote: >> >> >> Dear all, >> >> I'm struggling to understand strange segfaults coming from >> ECL(+Maxima) on FreeBSD embedded into Python; they typically look as >> follows: >> >> Got signal before environment was installed on our thread >> [2: No such file or directory] >> >> ;;; ECL C Backtrace >> ;;; 0 ecl_internal_error (0x87d790765) >> ;;; 1 init_unixint (0x87d7b6bd0) >> ;;; 2 init_unixint (0x87d7b6972) >> ;;; 3 pthread_sigmask (0x80103779d) >> ;;; 4 pthread_getspecific (0x801036d6f) >> ;;; 5 unknown (0x7ffffffff193) >> ;;; 6 GC_push_all_stacks (0x87db1ea2c) >> ;;; 7 GC_mark_some (0x87db12eec) >> ;;; 8 GC_stopped_mark (0x87db09baa) >> ;;; 9 GC_try_to_collect_inner (0x87db09a75) >> ;;; 10 GC_init (0x87db16f4f) >> ;;; 11 init_alloc (0x87d7caa59) >> ;;; 12 cl_boot (0x87d694a5b) >> ;;; 13 initecl (0x87d218340) >> ;;; 14 initecl (0x87d20a43f) >> ;;; 15 initecl (0x87d207e28) >> ;;; 16 _PyImport_LoadDynamicModule (0x800b3ed1c) >> ;;; 17 PyImport_AppendInittab (0x800b3d71f) >> ;;; 18 PyImport_AppendInittab (0x800b3d1a8) >> ;;; 19 PyImport_ImportModuleLevel (0x800b3c2ce) >> ;;; 20 _PyBuiltin_Init (0x800b162d7) >> ;;; 21 PyObject_Call (0x800a7d3e3) >> ;;; 22 PyEval_EvalFrameEx (0x800b2121c) >> ;;; 23 PyEval_EvalCodeEx (0x800b1b5d4) >> ;;; 24 PyEval_EvalCode (0x800b1ad96) >> ;;; 25 PyImport_ExecCodeModuleEx (0x800b3ad11) >> ;;; 26 PyImport_AppendInittab (0x800b3ddb8) >> ;;; 27 PyImport_AppendInittab (0x800b3d71f) >> ;;; 28 PyImport_AppendInittab (0x800b3d1a8) >> ;;; 29 PyImport_ImportModuleLevel (0x800b3c2ce) >> ;;; 30 _PyBuiltin_Init (0x800b162d7) >> ;;; 31 PyEval_EvalFrameEx (0x800b22dd1) >> Segmentation fault (core dumped) >> >> It looks as if ECL (version 16.1.2) is being called before an >> initialisation is complete, but it it possible to say more without a >> debugger? >> >> More details: is is on FreeBSD 11.0, clang 3.8.0, GC version 7.6.0 >> with libatomic_ops version 7.4.6. >> And only reproducible on FreeBSD. >> >> ECL is built with --disable-threads; GC is built with or without >> threads---result is still the same. >> (so it's unclear to me where pthread_* calls in the trace >> come from). >> >> Thanks, >> Dima >> >> PS. the segfault is at the bottom of >> https://trac.sagemath.org/ticket/22679#comment:87

Fabrizio Fabbri

1:23 p.m.

...

On Sep 21, 2017, at 8:31 AM, Dima Pasechnik <dimpase+ecl@gmail.com> wrote:

On Tue, Sep 12, 2017 at 1:18 AM, Fabrizio Fabbri <strabixbox@yahoo.com> wrote:

...
...
On Sep 11, 2017, at 7:13 PM, Dima Pasechnik <dimpase+ecl@gmail.com> wrote:

...
On Mon, Sep 4, 2017 at 11:15 AM, Daniel Kochmański <daniel@turtleware.eu> wrote: From the backtrace it is sure that fail is caused inside the call to GC_init. Such errors are known to have happened when another GC was initialized already on the system (I've linked the issue). It might be caused by something else in bdwgc, I don't know. Either way I'd focus on GC_init part.

Our project (sagemath) only uses libgc within the embedded ECL. Thus I am really puzzled how another libgc instance might kick in and spoil the game for ECL.

One possibility is that clang is using libgc, and thus, in principle, libgc might be sitting somewhere in the runtime?!

...
To make sure, that I'm right with my assertion you may put printf before and after call to GC_init. I'm not quite familiar with bdwgc internals to say, what is wrong though. Maybe updating bundled sources of GC will help? Or linking with libgc on the system? It might be that it was a bug in bdwgc which got already fixed.

We are not using the bdwgc shipped with ECL, we use a separate libgc 7.6.0, which is the latest stable. (Is there a reason to ship bdwgc sources with ECL - do you patch it?)

I'm using ecl with the non embedded bdwgc as well and I don't have issue..

Ensure that bdwgc it's not also build statically in ecl as well. I expect linking problems in that case but worth it double check.

here is a part of a stacktrace from the debugger, in a scenario where a call to embedded ECL from Python leads to a ECL's stack overflow, on an already initialised ECL; it seems to be related to a particular thread this call comes from (another, usual, calling sequence does not lead to crashes). There is no mention of GC in the stacktrace.

If the current thread is generated outside the lisp environment you need to import it before call any ecl function. That is done by ecl_import_current_thread ecl_release_current_thread You could see the example here: https://gitlab.com/embeddable-common-lisp/ecl/tree/develop/examples/threads/... Maybe you already do that but worth mentioning that. Best F.

...

This looks to me as a lack of thread safety on ECL side, although I might be wrong. ... frame #16: 0x000000088444b9d6 libecl.so.16.1`si_serror(narg=6, cformat=0x0000000000d27ba0, eformat=0x00000008847d12a0) at error.d:549 frame #17: 0x000000088448bd42 libecl.so.16.1`ecl_cs_overflow at stacks.d:76 frame #18: 0x00000008844168af libecl.so.16.1`ecl_interpret(frame=0x00007fffdeff2658, env=0x0000000000000001, bytecodes=0x0000000000db33c0) at interpreter.d:286 frame #19: 0x0000000884414afc libecl.so.16.1`ecl_apply_from_stack_frame(frame=0x00007fffdeff2658, x=0x0000000000db33c0) at eval.d:79 frame #20: 0x000000088441545b libecl.so.16.1`cl_apply(narg=0, fun=0x0000000000db33c0, lastarg=0x0000000000000001) at eval.d:164 frame #21: 0x0000000883e0e1b4 ecl.so`__pyx_f_4sage_4libs_3ecl_ecl_safe_funcall(__pyx_v_func=0x0000000000769600, __pyx_v_arg=0x0000000000e6dfa0) at ecl.c:5831 frame #22: 0x0000000883e0d519 ecl.so`__pyx_f_4sage_4libs_3ecl_ecl_safe_read_string(__pyx_v_s="(setf *load-verbose* NIL)") at ecl.c:6084 frame #23: 0x0000000883e0d02b ecl.so`__pyx_f_4sage_4libs_3ecl_ecl_eval(__pyx_v_s=0x0000000882add970, __pyx_skip_dispatch=0) at ecl.c:10682 frame #24: 0x0000000883e0cd4c ecl.so`__pyx_pf_4sage_4libs_3ecl_10ecl_eval(__pyx_self=0x0000000000000000, __pyx_v_s=0x0000000882add970) at ecl.c:10762 frame #25: 0x0000000883e0cab7 ecl.so`__pyx_pw_4sage_4libs_3ecl_11ecl_eval(__pyx_self=0x0000000000000000, __pyx_v_s=0x0000000882add970) at ecl.c:10745 frame #26: 0x0000000800d8a68f libpython2.7.so.1`call_function(pp_stack=0x00007fffdeff2c00, oparg=1) at ceval.c:4340 frame #27: 0x0000000800d854d2 libpython2.7.so.1`PyEval_EvalFrameEx(f=0x00000008829939b0, throwflag=0) at ceval.c:2989 ... frame #91: 0x0000000800d88361 libpython2.7.so.1`PyEval_CallObjectWithKeywords(func=0x000000087cdf99e0, arg=0x000000080064e060, kw=0x0000000000000000) at ceval.c:4221 frame #92: 0x0000000800de60d1 libpython2.7.so.1`t_bootstrap(boot_raw=0x0000000807015598) at threadmodule.c:620 frame #93: 0x00000008012d3b55 libthr.so.3`___lldb_unnamed_symbol1$$libthr.so.3 + 325

...
...
Thanks, Dima

...
Regards,

Daniel

...
On 04.09.2017 12:04, Dima Pasechnik wrote:

On Fri, Sep 1, 2017 at 1:57 PM, Daniel Kochmański <daniel@turtleware.eu> wrote:

...
I dont think its related to shared vs static - rather two gc running concurrently. Try commenting out GC_init call in ecl and see what happens.

I don't understand how two GCs can run concurrently on a memory region controlled by ECL which is statically linked to GC... In fact I am pretty sure no other instances of GC are running anywhere within our process tree.

By the way, I don't know whether it's obvious from the backtrace that cl_boot() has been completed, or not.

If it actually was completed, could it be a bug that invalidates the bit indicating that cl_boot() has been done?

We have seen similar troubles with clang recently, related to FPE. There an FPE bit was flipped by assignment of a double to an integer type (sic!). It took us a lot of head banging on various hard surfaces to debug this: https://trac.sagemath.org/ticket/22799 it turned out we did hit a known bug: https://bugs.llvm.org//show_bug.cgi?id=17686

...
Do you need sigchld for anything? Run-program was rewritten and sigchld handling wasnt viable option anymore for it.

We do set ECL_OPT_TRAP_SIGCHLD to 0, thus I presume we now can simply skip it all together.

Thanks, Dima

...
Im on phone, will be avail after the weekend.

Regards, D.

Dnia 1 września 2017 14:47:57 CEST, Dima Pasechnik <dimpase+ecl@gmail.com> napisał(a): > > Hi Daniel, > Thanks for the message. The scenario you talk about only happens if GC > is a shared library, right? > > I've rebuilt GC disabling shared libs, and ECL doing static linking to > GC. > And I still get very similar segfaults: > > ;;; ECL C Backtrace > ;;; 0 ecl_internal_error (0x87d79b375) > ;;; 1 init_unixint (0x87d7c17e0) > ;;; 2 init_unixint (0x87d7c1582) > ;;; 3 pthread_sigmask (0x80103779d) > ;;; 4 pthread_getspecific (0x801036d6f) > ;;; 5 unknown (0x7ffffffff193) > ;;; 6 GC_push_current_stack (0x87d7ef7c3) > ;;; 7 GC_with_callee_saves_pushed (0x87d7f7360) > ;;; 8 GC_push_roots (0x87d7ef9c2) > ;;; 9 GC_mark_some (0x87d7ec97c) > ;;; 10 GC_stopped_mark (0x87d7e6b7a) > ;;; 11 GC_try_to_collect_inner (0x87d7e6a75) > ;;; 12 GC_init (0x87d7f08ea) > ;;; 13 init_alloc (0x87d7d5669) > ;;; 14 cl_boot (0x87d69f66b) > ... > > And a very similar picture on the develop branch of ECL - although > I had to change our code, as in particular > ECL_OPT_TRAP_SIGCHLD is gone... > > So, what can it be? Some signals issue? > > Thanks, > Dima > > On Fri, Sep 1, 2017 at 7:38 AM, Daniel Kochmański <daniel@turtleware.eu> > wrote: >> >> Hey Dima, >> >> this looks like the issue with having GC initialized before ECL kicks >> in. >> See https://gitlab.com/embeddable-common-lisp/ecl/issues/371 for a >> discussion about this problem. Basically some other component already >> called >> GC_init and ECL calls it once more. It's arguably not a bug. >> >> Best regards, >> >> Daniel >> >> >>> On 31.08.2017 15:29, Dima Pasechnik wrote: >>> >>> >>> Dear all, >>> >>> I'm struggling to understand strange segfaults coming from >>> ECL(+Maxima) on FreeBSD embedded into Python; they typically look as >>> follows: >>> >>> Got signal before environment was installed on our thread >>> [2: No such file or directory] >>> >>> ;;; ECL C Backtrace >>> ;;; 0 ecl_internal_error (0x87d790765) >>> ;;; 1 init_unixint (0x87d7b6bd0) >>> ;;; 2 init_unixint (0x87d7b6972) >>> ;;; 3 pthread_sigmask (0x80103779d) >>> ;;; 4 pthread_getspecific (0x801036d6f) >>> ;;; 5 unknown (0x7ffffffff193) >>> ;;; 6 GC_push_all_stacks (0x87db1ea2c) >>> ;;; 7 GC_mark_some (0x87db12eec) >>> ;;; 8 GC_stopped_mark (0x87db09baa) >>> ;;; 9 GC_try_to_collect_inner (0x87db09a75) >>> ;;; 10 GC_init (0x87db16f4f) >>> ;;; 11 init_alloc (0x87d7caa59) >>> ;;; 12 cl_boot (0x87d694a5b) >>> ;;; 13 initecl (0x87d218340) >>> ;;; 14 initecl (0x87d20a43f) >>> ;;; 15 initecl (0x87d207e28) >>> ;;; 16 _PyImport_LoadDynamicModule (0x800b3ed1c) >>> ;;; 17 PyImport_AppendInittab (0x800b3d71f) >>> ;;; 18 PyImport_AppendInittab (0x800b3d1a8) >>> ;;; 19 PyImport_ImportModuleLevel (0x800b3c2ce) >>> ;;; 20 _PyBuiltin_Init (0x800b162d7) >>> ;;; 21 PyObject_Call (0x800a7d3e3) >>> ;;; 22 PyEval_EvalFrameEx (0x800b2121c) >>> ;;; 23 PyEval_EvalCodeEx (0x800b1b5d4) >>> ;;; 24 PyEval_EvalCode (0x800b1ad96) >>> ;;; 25 PyImport_ExecCodeModuleEx (0x800b3ad11) >>> ;;; 26 PyImport_AppendInittab (0x800b3ddb8) >>> ;;; 27 PyImport_AppendInittab (0x800b3d71f) >>> ;;; 28 PyImport_AppendInittab (0x800b3d1a8) >>> ;;; 29 PyImport_ImportModuleLevel (0x800b3c2ce) >>> ;;; 30 _PyBuiltin_Init (0x800b162d7) >>> ;;; 31 PyEval_EvalFrameEx (0x800b22dd1) >>> Segmentation fault (core dumped) >>> >>> It looks as if ECL (version 16.1.2) is being called before an >>> initialisation is complete, but it it possible to say more without a >>> debugger? >>> >>> More details: is is on FreeBSD 11.0, clang 3.8.0, GC version 7.6.0 >>> with libatomic_ops version 7.4.6. >>> And only reproducible on FreeBSD. >>> >>> ECL is built with --disable-threads; GC is built with or without >>> threads---result is still the same. >>> (so it's unclear to me where pthread_* calls in the trace >>> come from). >>> >>> Thanks, >>> Dima >>> >>> PS. the segfault is at the bottom of >>> https://trac.sagemath.org/ticket/22679#comment:87

Dima Pasechnik

22 Sep 22 Sep

11:31 a.m.

On Thu, Sep 21, 2017 at 2:23 PM, Fabrizio Fabbri <strabixbox@yahoo.com> wrote:

...

On Sep 21, 2017, at 8:31 AM, Dima Pasechnik <dimpase+ecl@gmail.com> wrote:

On Tue, Sep 12, 2017 at 1:18 AM, Fabrizio Fabbri <strabixbox@yahoo.com> wrote:

...
...
On Sep 11, 2017, at 7:13 PM, Dima Pasechnik <dimpase+ecl@gmail.com> wrote:

...
On Mon, Sep 4, 2017 at 11:15 AM, Daniel Kochmański <daniel@turtleware.eu> wrote: From the backtrace it is sure that fail is caused inside the call to GC_init. Such errors are known to have happened when another GC was initialized already on the system (I've linked the issue). It might be caused by something else in bdwgc, I don't know. Either way I'd focus on GC_init part.

Our project (sagemath) only uses libgc within the embedded ECL. Thus I am really puzzled how another libgc instance might kick in and spoil the game for ECL.

One possibility is that clang is using libgc, and thus, in principle, libgc might be sitting somewhere in the runtime?!

...
To make sure, that I'm right with my assertion you may put printf before and after call to GC_init. I'm not quite familiar with bdwgc internals to say, what is wrong though. Maybe updating bundled sources of GC will help? Or linking with libgc on the system? It might be that it was a bug in bdwgc which got already fixed.

We are not using the bdwgc shipped with ECL, we use a separate libgc 7.6.0, which is the latest stable. (Is there a reason to ship bdwgc sources with ECL - do you patch it?)

I'm using ecl with the non embedded bdwgc as well and I don't have issue.

Ensure that bdwgc it's not also build statically in ecl as well. I expect linking problems in that case but worth it double check.

here is a part of a stacktrace from the debugger, in a scenario where a call to embedded ECL from Python leads to a ECL's stack overflow, on an already initialised ECL; it seems to be related to a particular thread this call comes from (another, usual, calling sequence does not lead to crashes). There is no mention of GC in the stacktrace.

If the current thread is generated outside the lisp environment you need to import it before call any ecl function. That is done by ecl_import_current_thread ecl_release_current_thread

Thanks for pointing this out - it's new to me!

...

You could see the example here: https://gitlab.com/embeddable-common-lisp/ecl/tree/develop/examples/threads/...

Maybe you already do that but worth mentioning that.

No, we have not done that before, and everything worked on Linux and OSX, and even on Cygwin (that is to say, we were lucky with threads implementations on these platforms, depending on some sort of undefined behaviour). Now I am trying ecl_import_current_thread/ecl_release_current_thread on FreeBSD, and it certainly appears to be the right direction, but I have a couple of questions, at least one of them related to signal handling. 0) any advice on signal flags to be set to certain values? Namely, ECL_OPT_SIGNAL_HANDLING_THREAD and ECL_OPT_THREAD_INTERRUPT_SIGNAL? They seem to affect the setup quite a bit; I had to do some trial and error, setting the former to 1 and the latter to 67 (probably OS-specific value) seemed to have done the trick... 1) as ECL must be built with --enable-threads, does it mean that it will also try to spawn threads on its own? (so far we always used to --disable-threads; for debugging purposes I'd rather not let ECL run its own threads) [I'd say this is a documentation issue, too, as it's not clear what exactly --enable-threads is doing: enabling own ECL's threads, or enabling ECL embedding in a multithreaded program, or both?] 2) for some reason calling ecl_release_current_thread() leads to a nasty crash, with lines like frame #299974: 0x0000000883a52463 libecl.so.16.1`FElibc_error(msg="", narg=0) at error.d:490 frame #299975: 0x0000000883ab3e2c libecl.so.16.1`ecl_process_env at process.d:70 frame #299976: 0x0000000883aba9d4 libecl.so.16.1`ecl_alloc_compact_object(t=t_base_string, extra_space=12) at alloc_2.d:622 frame #299977: 0x0000000883a8c782 libecl.so.16.1`ecl_alloc_simple_vector(l=11, aet=ecl_aet_bc) at array.d:585 frame #299978: 0x0000000883a5331d libecl.so.16.1`make_base_string_copy(s="No error: 0") at string.d:136 frame #299979: 0x0000000883a52320 libecl.so.16.1`_ecl_strerror(code=0) at error.d:475 frame #299980: 0x0000000883a52463 repeating endlessly in the backtrace. Must it be called at all? (The test program in examples you pointed at does work for me, with few makefile changes...) 3) How does one call cl_boot() in such a multithreaded setting? I tried merely putting the call to ecl_import_current_thread() before the call to cl_boot() but I get an error from GC: "Threads explicit registering is not previously enabled" and the program aborts. Without doing ecl_import_current_thread(), cl_boot() succeeds in "main" thread, but coredumps if invoked from another thread---this is the behaviour you mistook for another instance of GC kicking in) While we probably can live with cl_boot() always being called in the main thread, this would be an extra burden to implement... 4) GC_THREADS is #define'd both in ECL and in GC headers. This seems wrong to me. Thanks, Dima

...

Best F.

This looks to me as a lack of thread safety on ECL side, although I might be wrong. ... frame #16: 0x000000088444b9d6 libecl.so.16.1`si_serror(narg=6, cformat=0x0000000000d27ba0, eformat=0x00000008847d12a0) at error.d:549 frame #17: 0x000000088448bd42 libecl.so.16.1`ecl_cs_overflow at stacks.d:76 frame #18: 0x00000008844168af libecl.so.16.1`ecl_interpret(frame=0x00007fffdeff2658, env=0x0000000000000001, bytecodes=0x0000000000db33c0) at interpreter.d:286 frame #19: 0x0000000884414afc libecl.so.16.1`ecl_apply_from_stack_frame(frame=0x00007fffdeff2658, x=0x0000000000db33c0) at eval.d:79 frame #20: 0x000000088441545b libecl.so.16.1`cl_apply(narg=0, fun=0x0000000000db33c0, lastarg=0x0000000000000001) at eval.d:164 frame #21: 0x0000000883e0e1b4 ecl.so`__pyx_f_4sage_4libs_3ecl_ecl_safe_funcall(__pyx_v_func=0x0000000000769600, __pyx_v_arg=0x0000000000e6dfa0) at ecl.c:5831 frame #22: 0x0000000883e0d519 ecl.so`__pyx_f_4sage_4libs_3ecl_ecl_safe_read_string(__pyx_v_s="(setf *load-verbose* NIL)") at ecl.c:6084 frame #23: 0x0000000883e0d02b ecl.so`__pyx_f_4sage_4libs_3ecl_ecl_eval(__pyx_v_s=0x0000000882add970, __pyx_skip_dispatch=0) at ecl.c:10682 frame #24: 0x0000000883e0cd4c ecl.so`__pyx_pf_4sage_4libs_3ecl_10ecl_eval(__pyx_self=0x0000000000000000, __pyx_v_s=0x0000000882add970) at ecl.c:10762 frame #25: 0x0000000883e0cab7 ecl.so`__pyx_pw_4sage_4libs_3ecl_11ecl_eval(__pyx_self=0x0000000000000000, __pyx_v_s=0x0000000882add970) at ecl.c:10745 frame #26: 0x0000000800d8a68f libpython2.7.so.1`call_function(pp_stack=0x00007fffdeff2c00, oparg=1) at ceval.c:4340 frame #27: 0x0000000800d854d2 libpython2.7.so.1`PyEval_EvalFrameEx(f=0x00000008829939b0, throwflag=0) at ceval.c:2989 ... frame #91: 0x0000000800d88361 libpython2.7.so.1`PyEval_CallObjectWithKeywords(func=0x000000087cdf99e0, arg=0x000000080064e060, kw=0x0000000000000000) at ceval.c:4221 frame #92: 0x0000000800de60d1 libpython2.7.so.1`t_bootstrap(boot_raw=0x0000000807015598) at threadmodule.c:620 frame #93: 0x00000008012d3b55 libthr.so.3`___lldb_unnamed_symbol1$$libthr.so.3 + 325

...
...
Thanks, Dima

...
Regards,

Daniel

...
On 04.09.2017 12:04, Dima Pasechnik wrote:

On Fri, Sep 1, 2017 at 1:57 PM, Daniel Kochmański <daniel@turtleware.eu> wrote:

...
I dont think its related to shared vs static - rather two gc running concurrently. Try commenting out GC_init call in ecl and see what happens.

I don't understand how two GCs can run concurrently on a memory region controlled by ECL which is statically linked to GC... In fact I am pretty sure no other instances of GC are running anywhere within our process tree.

By the way, I don't know whether it's obvious from the backtrace that cl_boot() has been completed, or not.

If it actually was completed, could it be a bug that invalidates the bit indicating that cl_boot() has been done?

We have seen similar troubles with clang recently, related to FPE. There an FPE bit was flipped by assignment of a double to an integer type (sic!). It took us a lot of head banging on various hard surfaces to debug this: https://trac.sagemath.org/ticket/22799 it turned out we did hit a known bug: https://bugs.llvm.org//show_bug.cgi?id=17686

...
Do you need sigchld for anything? Run-program was rewritten and sigchld handling wasnt viable option anymore for it.

We do set ECL_OPT_TRAP_SIGCHLD to 0, thus I presume we now can simply skip it all together.

Thanks, Dima

...
Im on phone, will be avail after the weekend.

Regards, D.

Dnia 1 września 2017 14:47:57 CEST, Dima Pasechnik <dimpase+ecl@gmail.com> napisał(a): > > Hi Daniel, > Thanks for the message. The scenario you talk about only happens if > GC > is a shared library, right? > > I've rebuilt GC disabling shared libs, and ECL doing static linking > to > GC. > And I still get very similar segfaults: > > ;;; ECL C Backtrace > ;;; 0 ecl_internal_error (0x87d79b375) > ;;; 1 init_unixint (0x87d7c17e0) > ;;; 2 init_unixint (0x87d7c1582) > ;;; 3 pthread_sigmask (0x80103779d) > ;;; 4 pthread_getspecific (0x801036d6f) > ;;; 5 unknown (0x7ffffffff193) > ;;; 6 GC_push_current_stack (0x87d7ef7c3) > ;;; 7 GC_with_callee_saves_pushed (0x87d7f7360) > ;;; 8 GC_push_roots (0x87d7ef9c2) > ;;; 9 GC_mark_some (0x87d7ec97c) > ;;; 10 GC_stopped_mark (0x87d7e6b7a) > ;;; 11 GC_try_to_collect_inner (0x87d7e6a75) > ;;; 12 GC_init (0x87d7f08ea) > ;;; 13 init_alloc (0x87d7d5669) > ;;; 14 cl_boot (0x87d69f66b) > ... > > And a very similar picture on the develop branch of ECL - although > I had to change our code, as in particular > ECL_OPT_TRAP_SIGCHLD is gone... > > So, what can it be? Some signals issue? > > Thanks, > Dima > > On Fri, Sep 1, 2017 at 7:38 AM, Daniel Kochmański > <daniel@turtleware.eu> > wrote: >> >> Hey Dima, >> >> this looks like the issue with having GC initialized before ECL >> kicks >> in. >> See https://gitlab.com/embeddable-common-lisp/ecl/issues/371 for a >> discussion about this problem. Basically some other component >> already >> called >> GC_init and ECL calls it once more. It's arguably not a bug. >> >> Best regards, >> >> Daniel >> >> >>> On 31.08.2017 15:29, Dima Pasechnik wrote: >>> >>> >>> Dear all, >>> >>> I'm struggling to understand strange segfaults coming from >>> ECL(+Maxima) on FreeBSD embedded into Python; they typically look >>> as >>> follows: >>> >>> Got signal before environment was installed on our thread >>> [2: No such file or directory] >>> >>> ;;; ECL C Backtrace >>> ;;; 0 ecl_internal_error (0x87d790765) >>> ;;; 1 init_unixint (0x87d7b6bd0) >>> ;;; 2 init_unixint (0x87d7b6972) >>> ;;; 3 pthread_sigmask (0x80103779d) >>> ;;; 4 pthread_getspecific (0x801036d6f) >>> ;;; 5 unknown (0x7ffffffff193) >>> ;;; 6 GC_push_all_stacks (0x87db1ea2c) >>> ;;; 7 GC_mark_some (0x87db12eec) >>> ;;; 8 GC_stopped_mark (0x87db09baa) >>> ;;; 9 GC_try_to_collect_inner (0x87db09a75) >>> ;;; 10 GC_init (0x87db16f4f) >>> ;;; 11 init_alloc (0x87d7caa59) >>> ;;; 12 cl_boot (0x87d694a5b) >>> ;;; 13 initecl (0x87d218340) >>> ;;; 14 initecl (0x87d20a43f) >>> ;;; 15 initecl (0x87d207e28) >>> ;;; 16 _PyImport_LoadDynamicModule (0x800b3ed1c) >>> ;;; 17 PyImport_AppendInittab (0x800b3d71f) >>> ;;; 18 PyImport_AppendInittab (0x800b3d1a8) >>> ;;; 19 PyImport_ImportModuleLevel (0x800b3c2ce) >>> ;;; 20 _PyBuiltin_Init (0x800b162d7) >>> ;;; 21 PyObject_Call (0x800a7d3e3) >>> ;;; 22 PyEval_EvalFrameEx (0x800b2121c) >>> ;;; 23 PyEval_EvalCodeEx (0x800b1b5d4) >>> ;;; 24 PyEval_EvalCode (0x800b1ad96) >>> ;;; 25 PyImport_ExecCodeModuleEx (0x800b3ad11) >>> ;;; 26 PyImport_AppendInittab (0x800b3ddb8) >>> ;;; 27 PyImport_AppendInittab (0x800b3d71f) >>> ;;; 28 PyImport_AppendInittab (0x800b3d1a8) >>> ;;; 29 PyImport_ImportModuleLevel (0x800b3c2ce) >>> ;;; 30 _PyBuiltin_Init (0x800b162d7) >>> ;;; 31 PyEval_EvalFrameEx (0x800b22dd1) >>> Segmentation fault (core dumped) >>> >>> It looks as if ECL (version 16.1.2) is being called before an >>> initialisation is complete, but it it possible to say more without >>> a >>> debugger? >>> >>> More details: is is on FreeBSD 11.0, clang 3.8.0, GC version 7.6.0 >>> with libatomic_ops version 7.4.6. >>> And only reproducible on FreeBSD. >>> >>> ECL is built with --disable-threads; GC is built with or without >>> threads---result is still the same. >>> (so it's unclear to me where pthread_* calls in the trace >>> come from). >>> >>> Thanks, >>> Dima >>> >>> PS. the segfault is at the bottom of >>> https://trac.sagemath.org/ticket/22679#comment:87

Daniel Kochmański

12:14 p.m.

...

No, we have not done that before, and everything worked on Linux and OSX, and even on Cygwin (that is to say, we were lucky with threads implementations on these platforms, depending on some sort of undefined behaviour). Now I am trying ecl_import_current_thread/ecl_release_current_thread on FreeBSD, and it certainly appears to be the right direction, but I have a couple of questions, at least one of them related to signal handling.

0) any advice on signal flags to be set to certain values? Namely, ECL_OPT_SIGNAL_HANDLING_THREAD and ECL_OPT_THREAD_INTERRUPT_SIGNAL? They seem to affect the setup quite a bit; I had to do some trial and error, setting the former to 1 and the latter to 67 (probably OS-specific value) seemed to have done the trick...

1) as ECL must be built with --enable-threads, does it mean that it will also try to spawn threads on its own? (so far we always used to --disable-threads; for debugging purposes I'd rather not let ECL run its own threads)

[I'd say this is a documentation issue, too, as it's not clear what exactly --enable-threads is doing: enabling own ECL's threads, or enabling ECL embedding in a multithreaded program, or both?] --enable-threads gives ECL ability to use threads (i.e programmer may create his own

On 22.09.2017 13:31, Dima Pasechnik wrote: thread in Lisp program). ECL may be embedded in any scenario (having both threads enabled and disabled), but if threads for ECL are disabled, than it is calling program responsibility to assure that ECL is accessed synchronously (i.e no two functions in ECL at the same time are called). ECL_OPT_SIGNAL_HANDLING_THREAD is a flag, which when set to 1 makes ECL create a separate thread meant for handling signals (so ECL in that case runs two threads). If ECL has threads disabled, then this flag does nothing.

...

2) for some reason calling ecl_release_current_thread() leads to a nasty crash, with lines like

frame #299974: 0x0000000883a52463 libecl.so.16.1`FElibc_error(msg="", narg=0) at error.d:490 frame #299975: 0x0000000883ab3e2c libecl.so.16.1`ecl_process_env at process.d:70 frame #299976: 0x0000000883aba9d4 libecl.so.16.1`ecl_alloc_compact_object(t=t_base_string, extra_space=12) at alloc_2.d:622 frame #299977: 0x0000000883a8c782 libecl.so.16.1`ecl_alloc_simple_vector(l=11, aet=ecl_aet_bc) at array.d:585 frame #299978: 0x0000000883a5331d libecl.so.16.1`make_base_string_copy(s="No error: 0") at string.d:136 frame #299979: 0x0000000883a52320 libecl.so.16.1`_ecl_strerror(code=0) at error.d:475 frame #299980: 0x0000000883a52463

repeating endlessly in the backtrace. Must it be called at all? (The test program in examples you pointed at does work for me, with few makefile changes...)

3) How does one call cl_boot() in such a multithreaded setting? I tried merely putting the call to

ecl_import_current_thread()

before the call to

cl_boot()

but I get an error from GC:

"Threads explicit registering is not previously enabled" and the program aborts. Without doing ecl_import_current_thread(), cl_boot() succeeds in "main" thread, but coredumps if invoked from another thread---this is the behaviour you mistook for another instance of GC kicking in)

While we probably can live with cl_boot() always being called in the main thread, this would be an extra burden to implement...

ecl_import_current_thread and ecl_release_current_thread needs to be called in threads, which call ECL and are different than thread where cl_boot is called. For the thread where cl_boot was called you don't call import/release, they are implicit for cl_boot / cl_shutdown. You may run cl_boot in thread not being main one, that's not the issue.

...

4) GC_THREADS is #define'd both in ECL and in GC headers. This seems wrong to me.

No idea why it is that way. I'll keep in mind investigating that. Regards, Daniel

Dima Pasechnik

26 Sep 26 Sep

5:57 p.m.

Discussing the issue with GC people here https://github.com/ivmai/bdwgc/issues/180 On Fri, Sep 22, 2017 at 1:14 PM, Daniel Kochmański <daniel@turtleware.eu> wrote:

...

On 22.09.2017 13:31, Dima Pasechnik wrote:

...
No, we have not done that before, and everything worked on Linux and OSX, and even on Cygwin (that is to say, we were lucky with threads implementations on these platforms, depending on some sort of undefined behaviour). Now I am trying ecl_import_current_thread/ecl_release_current_thread on FreeBSD, and it certainly appears to be the right direction, but I have a couple of questions, at least one of them related to signal handling.

0) any advice on signal flags to be set to certain values? Namely, ECL_OPT_SIGNAL_HANDLING_THREAD and ECL_OPT_THREAD_INTERRUPT_SIGNAL? They seem to affect the setup quite a bit; I had to do some trial and error, setting the former to 1 and the latter to 67 (probably OS-specific value) seemed to have done the trick...

1) as ECL must be built with --enable-threads, does it mean that it will also try to spawn threads on its own? (so far we always used to --disable-threads; for debugging purposes I'd rather not let ECL run its own threads)

[I'd say this is a documentation issue, too, as it's not clear what exactly --enable-threads is doing: enabling own ECL's threads, or enabling ECL embedding in a multithreaded program, or both?]

--enable-threads gives ECL ability to use threads (i.e programmer may create his own thread in Lisp program). ECL may be embedded in any scenario (having both threads enabled and disabled), but if threads for ECL are disabled, than it is calling program responsibility to assure that ECL is accessed synchronously (i.e no two functions in ECL at the same time are called).

ECL_OPT_SIGNAL_HANDLING_THREAD is a flag, which when set to 1 makes ECL create a separate thread meant for handling signals (so ECL in that case runs two threads). If ECL has threads disabled, then this flag does nothing.

...
2) for some reason calling ecl_release_current_thread() leads to a nasty crash, with lines like

frame #299974: 0x0000000883a52463 libecl.so.16.1`FElibc_error(msg="", narg=0) at error.d:490 frame #299975: 0x0000000883ab3e2c libecl.so.16.1`ecl_process_env at process.d:70 frame #299976: 0x0000000883aba9d4 libecl.so.16.1`ecl_alloc_compact_object(t=t_base_string, extra_space=12) at alloc_2.d:622 frame #299977: 0x0000000883a8c782 libecl.so.16.1`ecl_alloc_simple_vector(l=11, aet=ecl_aet_bc) at array.d:585 frame #299978: 0x0000000883a5331d libecl.so.16.1`make_base_string_copy(s="No error: 0") at string.d:136 frame #299979: 0x0000000883a52320 libecl.so.16.1`_ecl_strerror(code=0) at error.d:475 frame #299980: 0x0000000883a52463

repeating endlessly in the backtrace. Must it be called at all? (The test program in examples you pointed at does work for me, with few makefile changes...)

3) How does one call cl_boot() in such a multithreaded setting? I tried merely putting the call to

ecl_import_current_thread()

before the call to

cl_boot()

but I get an error from GC:

"Threads explicit registering is not previously enabled" and the program aborts. Without doing ecl_import_current_thread(), cl_boot() succeeds in "main" thread, but coredumps if invoked from another thread---this is the behaviour you mistook for another instance of GC kicking in)

While we probably can live with cl_boot() always being called in the main thread, this would be an extra burden to implement...

ecl_import_current_thread and ecl_release_current_thread needs to be called in threads, which call ECL and are different than thread where cl_boot is called. For the thread where cl_boot was called you don't call import/release, they are implicit for cl_boot / cl_shutdown. You may run cl_boot in thread not being main one, that's not the issue.

...
4) GC_THREADS is #define'd both in ECL and in GC headers. This seems wrong to me.

No idea why it is that way. I'll keep in mind investigating that.

Regards, Daniel

2838

Age (days ago)

2864

Last active (days ago)

List overview

Download

12 comments

3 participants

participants (3)

Daniel Kochmański
Dima Pasechnik
Fabrizio Fabbri

"Got signal before environment was installed on our thread"

tags

participants (3)