Re: "Got signal before environment was installed on our thread"

21 Sep 2017

...
On Sep 21, 2017, at 8:31 AM, Dima Pasechnik <dimpase+ecl@gmail.com> wrote:
On Tue, Sep 12, 2017 at 1:18 AM, Fabrizio Fabbri <strabixbox@yahoo.com> wrote:
...
...
On Sep 11, 2017, at 7:13 PM, Dima Pasechnik <dimpase+ecl@gmail.com> wrote:
...
On Mon, Sep 4, 2017 at 11:15 AM, Daniel Kochmański <daniel@turtleware.eu> wrote:
From the backtrace it is sure that fail is caused inside the call to
GC_init. Such errors are known to have happened when another GC was
initialized already on the system (I've linked the issue). It might be
caused by something else in bdwgc, I don't know. Either way I'd focus on
GC_init part.
Our project (sagemath) only uses libgc within the embedded ECL. Thus I
am really puzzled how another libgc instance might kick in and spoil
the game for ECL.
One possibility is that clang is using libgc, and thus, in principle,
libgc might be sitting somewhere in the runtime?!
...
To make sure, that I'm right with my assertion you may put printf before and
after call to GC_init. I'm not quite familiar with bdwgc internals to say,
what is wrong though. Maybe updating bundled sources of GC will help? Or
linking with libgc on the system? It might be that it was a bug in bdwgc
which got already fixed.
We are not using the bdwgc shipped with ECL, we use a separate libgc
7.6.0, which is the latest stable.
(Is there a reason to ship bdwgc sources with ECL - do you patch it?)
I'm using ecl with the non embedded bdwgc as well and I don't have issue..
Ensure that bdwgc it's not also build statically in ecl as well. I expect linking problems in that case but worth it double check.
here is a part of a stacktrace from the debugger, in a scenario where
a call to embedded ECL from Python leads to a ECL's stack overflow, on
an already initialised ECL; it seems to be related to a particular thread this call comes from (another, usual, calling sequence
does not lead to crashes). There is no mention of GC in the stacktrace.
If the current thread is generated outside the lisp environment you need to import it before call any ecl function.
That is done by 
ecl_import_current_thread
ecl_release_current_thread

You could see the example here:
https://gitlab.com/embeddable-common-lisp/ecl/tree/develop/examples/threads/...

Maybe you already do that but worth mentioning that.

Best
F.
...
This looks to me as a lack of thread safety on ECL side, although I might be wrong.
...
frame #16: 0x000000088444b9d6 libecl.so.16.1`si_serror(narg=6, cformat=0x0000000000d27ba0, eformat=0x00000008847d12a0) at error.d:549
frame #17: 0x000000088448bd42 libecl.so.16.1`ecl_cs_overflow at stacks.d:76
frame #18: 0x00000008844168af libecl.so.16.1`ecl_interpret(frame=0x00007fffdeff2658, env=0x0000000000000001, bytecodes=0x0000000000db33c0) at interpreter.d:286
frame #19: 0x0000000884414afc libecl.so.16.1`ecl_apply_from_stack_frame(frame=0x00007fffdeff2658, x=0x0000000000db33c0) at eval.d:79
frame #20: 0x000000088441545b libecl.so.16.1`cl_apply(narg=0, fun=0x0000000000db33c0, lastarg=0x0000000000000001) at eval.d:164
frame #21: 0x0000000883e0e1b4 ecl.so`__pyx_f_4sage_4libs_3ecl_ecl_safe_funcall(__pyx_v_func=0x0000000000769600, __pyx_v_arg=0x0000000000e6dfa0) at ecl.c:5831
frame #22: 0x0000000883e0d519 ecl.so`__pyx_f_4sage_4libs_3ecl_ecl_safe_read_string(__pyx_v_s="(setf *load-verbose* NIL)") at ecl.c:6084
frame #23: 0x0000000883e0d02b ecl.so`__pyx_f_4sage_4libs_3ecl_ecl_eval(__pyx_v_s=0x0000000882add970, __pyx_skip_dispatch=0) at ecl.c:10682
frame #24: 0x0000000883e0cd4c ecl.so`__pyx_pf_4sage_4libs_3ecl_10ecl_eval(__pyx_self=0x0000000000000000, __pyx_v_s=0x0000000882add970) at ecl.c:10762
frame #25: 0x0000000883e0cab7 ecl.so`__pyx_pw_4sage_4libs_3ecl_11ecl_eval(__pyx_self=0x0000000000000000, __pyx_v_s=0x0000000882add970) at ecl.c:10745
frame #26: 0x0000000800d8a68f libpython2.7.so.1`call_function(pp_stack=0x00007fffdeff2c00, oparg=1) at ceval.c:4340
frame #27: 0x0000000800d854d2 libpython2.7.so.1`PyEval_EvalFrameEx(f=0x00000008829939b0, throwflag=0) at ceval.c:2989
...
frame #91: 0x0000000800d88361 libpython2.7.so.1`PyEval_CallObjectWithKeywords(func=0x000000087cdf99e0, arg=0x000000080064e060, kw=0x0000000000000000) at ceval.c:4221
frame #92: 0x0000000800de60d1 libpython2.7.so.1`t_bootstrap(boot_raw=0x0000000807015598) at threadmodule.c:620
frame #93: 0x00000008012d3b55 libthr.so.3`___lldb_unnamed_symbol1$$libthr.so.3 + 325
...
...
Thanks,
Dima
...
Regards,
Daniel
...
On 04.09.2017 12:04, Dima Pasechnik wrote:
On Fri, Sep 1, 2017 at 1:57 PM, Daniel Kochmański <daniel@turtleware.eu>
wrote:
...
I dont think its related to shared vs static - rather two gc running
concurrently. Try commenting out GC_init call in ecl and see what
happens.
I don't understand how two GCs can run concurrently on a memory region
controlled by ECL which is statically linked to GC...
In fact I am pretty sure no other instances of GC are running anywhere
within our process tree.
By the way, I don't know whether it's obvious from the backtrace that
cl_boot() has been completed, or not.
If it actually was completed, could it be a bug that invalidates the
bit indicating that cl_boot() has been done?
We have seen similar troubles with clang recently, related to FPE.
There an FPE bit was flipped by assignment of a double to an
integer type (sic!).
It took us a lot of head banging on various hard surfaces to debug this:
https://trac.sagemath.org/ticket/22799
it turned out we did hit a known bug:
https://bugs.llvm.org//show_bug.cgi?id=17686
...
Do you need sigchld for anything? Run-program was rewritten and sigchld
handling wasnt viable option anymore for it.
We do set ECL_OPT_TRAP_SIGCHLD to 0, thus I presume we
now can simply skip it all together.
Thanks,
Dima
...
Im on phone, will be avail after the weekend.
Regards, D.
Dnia 1 września 2017 14:47:57 CEST, Dima Pasechnik
<dimpase+ecl@gmail.com>
napisał(a):
>
> Hi Daniel,
> Thanks for the message. The scenario you talk about only happens if GC
> is a shared library, right?
>
> I've rebuilt GC disabling shared libs, and ECL doing static linking to
> GC.
> And I still get very similar segfaults:
>
> ;;; ECL C Backtrace
> ;;; 0 ecl_internal_error (0x87d79b375)
> ;;; 1 init_unixint (0x87d7c17e0)
> ;;; 2 init_unixint (0x87d7c1582)
> ;;; 3 pthread_sigmask (0x80103779d)
> ;;; 4 pthread_getspecific (0x801036d6f)
> ;;; 5 unknown (0x7ffffffff193)
> ;;; 6 GC_push_current_stack (0x87d7ef7c3)
> ;;; 7 GC_with_callee_saves_pushed (0x87d7f7360)
> ;;; 8 GC_push_roots (0x87d7ef9c2)
> ;;; 9 GC_mark_some (0x87d7ec97c)
> ;;; 10 GC_stopped_mark (0x87d7e6b7a)
> ;;; 11 GC_try_to_collect_inner (0x87d7e6a75)
> ;;; 12 GC_init (0x87d7f08ea)
> ;;; 13 init_alloc (0x87d7d5669)
> ;;; 14 cl_boot (0x87d69f66b)
> ...
>
> And a very similar picture on the develop branch of ECL - although
> I had to change our code, as in particular
> ECL_OPT_TRAP_SIGCHLD is gone...
>
> So, what can it be? Some signals issue?
>
> Thanks,
> Dima
>
> On Fri, Sep 1, 2017 at 7:38 AM, Daniel Kochmański <daniel@turtleware.eu>
> wrote:
>>
>> Hey Dima,
>>
>> this looks like the issue with having GC initialized before ECL kicks
>> in.
>> See https://gitlab.com/embeddable-common-lisp/ecl/issues/371 for a
>> discussion about this problem. Basically some other component already
>> called
>> GC_init and ECL calls it once more. It's arguably not a bug.
>>
>> Best regards,
>>
>> Daniel
>>
>>
>>> On 31.08.2017 15:29, Dima Pasechnik wrote:
>>>
>>>
>>> Dear all,
>>>
>>> I'm struggling to understand strange segfaults coming from
>>> ECL(+Maxima) on FreeBSD embedded into Python; they typically look as
>>> follows:
>>>
>>> Got signal before environment was installed on our thread
>>> [2: No such file or directory]
>>>
>>> ;;; ECL C Backtrace
>>> ;;; 0 ecl_internal_error (0x87d790765)
>>> ;;; 1 init_unixint (0x87d7b6bd0)
>>> ;;; 2 init_unixint (0x87d7b6972)
>>> ;;; 3 pthread_sigmask (0x80103779d)
>>> ;;; 4 pthread_getspecific (0x801036d6f)
>>> ;;; 5 unknown (0x7ffffffff193)
>>> ;;; 6 GC_push_all_stacks (0x87db1ea2c)
>>> ;;; 7 GC_mark_some (0x87db12eec)
>>> ;;; 8 GC_stopped_mark (0x87db09baa)
>>> ;;; 9 GC_try_to_collect_inner (0x87db09a75)
>>> ;;; 10 GC_init (0x87db16f4f)
>>> ;;; 11 init_alloc (0x87d7caa59)
>>> ;;; 12 cl_boot (0x87d694a5b)
>>> ;;; 13 initecl (0x87d218340)
>>> ;;; 14 initecl (0x87d20a43f)
>>> ;;; 15 initecl (0x87d207e28)
>>> ;;; 16 _PyImport_LoadDynamicModule (0x800b3ed1c)
>>> ;;; 17 PyImport_AppendInittab (0x800b3d71f)
>>> ;;; 18 PyImport_AppendInittab (0x800b3d1a8)
>>> ;;; 19 PyImport_ImportModuleLevel (0x800b3c2ce)
>>> ;;; 20 _PyBuiltin_Init (0x800b162d7)
>>> ;;; 21 PyObject_Call (0x800a7d3e3)
>>> ;;; 22 PyEval_EvalFrameEx (0x800b2121c)
>>> ;;; 23 PyEval_EvalCodeEx (0x800b1b5d4)
>>> ;;; 24 PyEval_EvalCode (0x800b1ad96)
>>> ;;; 25 PyImport_ExecCodeModuleEx (0x800b3ad11)
>>> ;;; 26 PyImport_AppendInittab (0x800b3ddb8)
>>> ;;; 27 PyImport_AppendInittab (0x800b3d71f)
>>> ;;; 28 PyImport_AppendInittab (0x800b3d1a8)
>>> ;;; 29 PyImport_ImportModuleLevel (0x800b3c2ce)
>>> ;;; 30 _PyBuiltin_Init (0x800b162d7)
>>> ;;; 31 PyEval_EvalFrameEx (0x800b22dd1)
>>> Segmentation fault (core dumped)
>>>
>>> It looks as if ECL (version 16.1.2) is being called before an
>>> initialisation is complete, but it it possible to say more without a
>>> debugger?
>>>
>>> More details: is is on FreeBSD 11.0, clang 3.8.0, GC version 7.6.0
>>> with libatomic_ops version 7.4.6.
>>> And only reproducible on FreeBSD.
>>>
>>> ECL is built with --disable-threads; GC is built with or without
>>> threads---result is still the same.
>>> (so it's unclear to me where pthread_* calls in the trace
>>> come from).
>>>
>>> Thanks,
>>> Dima
>>>
>>> PS. the segfault is at the bottom of
>>> https://trac.sagemath.org/ticket/22679#comment:87