On Tue, Sep 12, 2017 at 1:18 AM, Fabrizio Fabbri <strabixbox@yahoo.com> wrote:
>
>> On Sep 11, 2017, at 7:13 PM, Dima Pasechnik <dimpase+ecl@gmail.com> wrote:
>>
>>> On Mon, Sep 4, 2017 at 11:15 AM, Daniel Kochmański <daniel@turtleware.eu> wrote:
>>> From the backtrace it is sure that fail is caused inside the call to
>>> GC_init. Such errors are known to have happened when another GC was
>>> initialized already on the system (I've linked the issue). It might be
>>> caused by something else in bdwgc, I don't know. Either way I'd focus on
>>> GC_init part.
>>
>> Our project (sagemath) only uses libgc within the embedded ECL. Thus I
>> am really puzzled how another libgc instance might kick in and spoil
>> the game for ECL.
>>
>> One possibility is that clang is using libgc, and thus, in principle,
>> libgc might be sitting somewhere in the runtime?!
>>
>>
>>>
>>> To make sure, that I'm right with my assertion you may put printf before and
>>> after call to GC_init. I'm not quite familiar with bdwgc internals to say,
>>> what is wrong though. Maybe updating bundled sources of GC will help? Or
>>> linking with libgc on the system? It might be that it was a bug in bdwgc
>>> which got already fixed.
>>
>> We are not using the bdwgc shipped with ECL, we use a separate libgc
>> 7.6.0, which is the latest stable.
>> (Is there a reason to ship bdwgc sources with ECL - do you patch it?)
>>
>
> I'm using ecl with the non embedded bdwgc as well and I don't have issue.
>
> Ensure that bdwgc it's not also build statically in ecl as well. I expect linking problems in that case but worth it double check.

here is a part of a stacktrace from the debugger, in a scenario where
a call to embedded ECL from Python leads to a ECL's stack overflow, on
an already initialised ECL; it seems to be related to a particular thread this call comes from (another, usual, calling sequence
does not lead to crashes). There is no mention of GC in the stacktrace.

This looks to me as a lack of thread safety on ECL side, although I might be wrong.
...
frame #16: 0x000000088444b9d6 libecl.so.16.1`si_serror(narg=6, cformat=0x0000000000d27ba0, eformat=0x00000008847d12a0) at error.d:549
frame #17: 0x000000088448bd42 libecl.so.16.1`ecl_cs_overflow at stacks.d:76
frame #18: 0x00000008844168af libecl.so.16.1`ecl_interpret(frame=0x00007fffdeff2658, env=0x0000000000000001, bytecodes=0x0000000000db33c0) at interpreter.d:286
frame #19: 0x0000000884414afc libecl.so.16.1`ecl_apply_from_stack_frame(frame=0x00007fffdeff2658, x=0x0000000000db33c0) at eval.d:79
frame #20: 0x000000088441545b libecl.so.16.1`cl_apply(narg=0, fun=0x0000000000db33c0, lastarg=0x0000000000000001) at eval.d:164
frame #21: 0x0000000883e0e1b4 ecl.so`__pyx_f_4sage_4libs_3ecl_ecl_safe_funcall(__pyx_v_func=0x0000000000769600, __pyx_v_arg=0x0000000000e6dfa0) at ecl.c:5831
frame #22: 0x0000000883e0d519 ecl.so`__pyx_f_4sage_4libs_3ecl_ecl_safe_read_string(__pyx_v_s="(setf *load-verbose* NIL)") at ecl.c:6084
frame #23: 0x0000000883e0d02b ecl.so`__pyx_f_4sage_4libs_3ecl_ecl_eval(__pyx_v_s=0x0000000882add970, __pyx_skip_dispatch=0) at ecl.c:10682
frame #24: 0x0000000883e0cd4c ecl.so`__pyx_pf_4sage_4libs_3ecl_10ecl_eval(__pyx_self=0x0000000000000000, __pyx_v_s=0x0000000882add970) at ecl.c:10762
frame #25: 0x0000000883e0cab7 ecl.so`__pyx_pw_4sage_4libs_3ecl_11ecl_eval(__pyx_self=0x0000000000000000, __pyx_v_s=0x0000000882add970) at ecl.c:10745
frame #26: 0x0000000800d8a68f libpython2.7.so.1`call_function(pp_stack=0x00007fffdeff2c00, oparg=1) at ceval.c:4340
frame #27: 0x0000000800d854d2 libpython2.7.so.1`PyEval_EvalFrameEx(f=0x00000008829939b0, throwflag=0) at ceval.c:2989
...
frame #91: 0x0000000800d88361 libpython2.7.so.1`PyEval_CallObjectWithKeywords(func=0x000000087cdf99e0, arg=0x000000080064e060, kw=0x0000000000000000) at ceval.c:4221
frame #92: 0x0000000800de60d1 libpython2.7.so.1`t_bootstrap(boot_raw=0x0000000807015598) at threadmodule.c:620
frame #93: 0x00000008012d3b55 libthr.so.3`___lldb_unnamed_symbol1$$libthr.so.3 + 325



>
>> Thanks,
>> Dima
>>
>>>
>>> Regards,
>>>
>>> Daniel
>>>
>>>
>>>
>>>> On 04.09.2017 12:04, Dima Pasechnik wrote:
>>>>
>>>> On Fri, Sep 1, 2017 at 1:57 PM, Daniel Kochmański <daniel@turtleware.eu>
>>>> wrote:
>>>>>
>>>>> I dont think its related to shared vs static - rather two gc running
>>>>> concurrently. Try commenting out GC_init call in ecl and see what
>>>>> happens.
>>>>
>>>> I don't understand how two GCs can run concurrently on a memory region
>>>> controlled by ECL which is statically linked to GC...
>>>> In fact I am pretty sure no other instances of GC are running anywhere
>>>> within our process tree.
>>>>
>>>> By the way, I don't know whether it's obvious from the backtrace that
>>>> cl_boot() has been completed, or not.
>>>>
>>>> If it actually was completed, could it be a bug that invalidates the
>>>> bit indicating that cl_boot() has been done?
>>>>
>>>> We have seen similar troubles with clang recently, related to FPE.
>>>> There an FPE bit was flipped by assignment of a double to an
>>>> integer type (sic!).
>>>> It took us a lot of head banging on various hard surfaces to debug this:
>>>> https://trac.sagemath.org/ticket/22799
>>>> it turned out we did hit a known bug:
>>>> https://bugs.llvm.org//show_bug.cgi?id=17686
>>>>
>>>>> Do you need sigchld for anything? Run-program was rewritten and sigchld
>>>>> handling wasnt viable option anymore for it.
>>>>>
>>>> We do set ECL_OPT_TRAP_SIGCHLD to 0, thus I presume we
>>>> now can simply skip it all together.
>>>>
>>>> Thanks,
>>>> Dima
>>>>
>>>>> Im on phone, will be avail after the weekend.
>>>>>
>>>>> Regards, D.
>>>>>
>>>>>
>>>>> Dnia 1 września 2017 14:47:57 CEST, Dima Pasechnik
>>>>> <dimpase+ecl@gmail.com>
>>>>> napisał(a):
>>>>>>
>>>>>> Hi Daniel,
>>>>>> Thanks for the message. The scenario you talk about only happens if GC
>>>>>> is a shared library, right?
>>>>>>
>>>>>> I've rebuilt GC disabling shared libs, and ECL doing static linking to
>>>>>> GC.
>>>>>> And I still get very similar segfaults:
>>>>>>
>>>>>> ;;; ECL C Backtrace
>>>>>> ;;; 0 ecl_internal_error (0x87d79b375)
>>>>>> ;;; 1 init_unixint (0x87d7c17e0)
>>>>>> ;;; 2 init_unixint (0x87d7c1582)
>>>>>> ;;; 3 pthread_sigmask (0x80103779d)
>>>>>> ;;; 4 pthread_getspecific (0x801036d6f)
>>>>>> ;;; 5 unknown (0x7ffffffff193)
>>>>>> ;;; 6 GC_push_current_stack (0x87d7ef7c3)
>>>>>> ;;; 7 GC_with_callee_saves_pushed (0x87d7f7360)
>>>>>> ;;; 8 GC_push_roots (0x87d7ef9c2)
>>>>>> ;;; 9 GC_mark_some (0x87d7ec97c)
>>>>>> ;;; 10 GC_stopped_mark (0x87d7e6b7a)
>>>>>> ;;; 11 GC_try_to_collect_inner (0x87d7e6a75)
>>>>>> ;;; 12 GC_init (0x87d7f08ea)
>>>>>> ;;; 13 init_alloc (0x87d7d5669)
>>>>>> ;;; 14 cl_boot (0x87d69f66b)
>>>>>> ...
>>>>>>
>>>>>> And a very similar picture on the develop branch of ECL - although
>>>>>> I had to change our code, as in particular
>>>>>> ECL_OPT_TRAP_SIGCHLD is gone...
>>>>>>
>>>>>> So, what can it be? Some signals issue?
>>>>>>
>>>>>> Thanks,
>>>>>> Dima
>>>>>>
>>>>>> On Fri, Sep 1, 2017 at 7:38 AM, Daniel Kochmański <daniel@turtleware.eu>
>>>>>> wrote:
>>>>>>>
>>>>>>> Hey Dima,
>>>>>>>
>>>>>>> this looks like the issue with having GC initialized before ECL kicks
>>>>>>> in.
>>>>>>> See https://gitlab.com/embeddable-common-lisp/ecl/issues/371 for a
>>>>>>> discussion about this problem. Basically some other component already
>>>>>>> called
>>>>>>> GC_init and ECL calls it once more. It's arguably not a bug.
>>>>>>>
>>>>>>> Best regards,
>>>>>>>
>>>>>>> Daniel
>>>>>>>
>>>>>>>
>>>>>>>> On 31.08.2017 15:29, Dima Pasechnik wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> Dear all,
>>>>>>>>
>>>>>>>> I'm struggling to understand strange segfaults coming from
>>>>>>>> ECL(+Maxima) on FreeBSD embedded into Python; they typically look as
>>>>>>>> follows:
>>>>>>>>
>>>>>>>> Got signal before environment was installed on our thread
>>>>>>>> [2: No such file or directory]
>>>>>>>>
>>>>>>>> ;;; ECL C Backtrace
>>>>>>>> ;;; 0 ecl_internal_error (0x87d790765)
>>>>>>>> ;;; 1 init_unixint (0x87d7b6bd0)
>>>>>>>> ;;; 2 init_unixint (0x87d7b6972)
>>>>>>>> ;;; 3 pthread_sigmask (0x80103779d)
>>>>>>>> ;;; 4 pthread_getspecific (0x801036d6f)
>>>>>>>> ;;; 5 unknown (0x7ffffffff193)
>>>>>>>> ;;; 6 GC_push_all_stacks (0x87db1ea2c)
>>>>>>>> ;;; 7 GC_mark_some (0x87db12eec)
>>>>>>>> ;;; 8 GC_stopped_mark (0x87db09baa)
>>>>>>>> ;;; 9 GC_try_to_collect_inner (0x87db09a75)
>>>>>>>> ;;; 10 GC_init (0x87db16f4f)
>>>>>>>> ;;; 11 init_alloc (0x87d7caa59)
>>>>>>>> ;;; 12 cl_boot (0x87d694a5b)
>>>>>>>> ;;; 13 initecl (0x87d218340)
>>>>>>>> ;;; 14 initecl (0x87d20a43f)
>>>>>>>> ;;; 15 initecl (0x87d207e28)
>>>>>>>> ;;; 16 _PyImport_LoadDynamicModule (0x800b3ed1c)
>>>>>>>> ;;; 17 PyImport_AppendInittab (0x800b3d71f)
>>>>>>>> ;;; 18 PyImport_AppendInittab (0x800b3d1a8)
>>>>>>>> ;;; 19 PyImport_ImportModuleLevel (0x800b3c2ce)
>>>>>>>> ;;; 20 _PyBuiltin_Init (0x800b162d7)
>>>>>>>> ;;; 21 PyObject_Call (0x800a7d3e3)
>>>>>>>> ;;; 22 PyEval_EvalFrameEx (0x800b2121c)
>>>>>>>> ;;; 23 PyEval_EvalCodeEx (0x800b1b5d4)
>>>>>>>> ;;; 24 PyEval_EvalCode (0x800b1ad96)
>>>>>>>> ;;; 25 PyImport_ExecCodeModuleEx (0x800b3ad11)
>>>>>>>> ;;; 26 PyImport_AppendInittab (0x800b3ddb8)
>>>>>>>> ;;; 27 PyImport_AppendInittab (0x800b3d71f)
>>>>>>>> ;;; 28 PyImport_AppendInittab (0x800b3d1a8)
>>>>>>>> ;;; 29 PyImport_ImportModuleLevel (0x800b3c2ce)
>>>>>>>> ;;; 30 _PyBuiltin_Init (0x800b162d7)
>>>>>>>> ;;; 31 PyEval_EvalFrameEx (0x800b22dd1)
>>>>>>>> Segmentation fault (core dumped)
>>>>>>>>
>>>>>>>> It looks as if ECL (version 16.1.2) is being called before an
>>>>>>>> initialisation is complete, but it it possible to say more without a
>>>>>>>> debugger?
>>>>>>>>
>>>>>>>> More details: is is on FreeBSD 11.0, clang 3.8.0, GC version 7.6.0
>>>>>>>> with libatomic_ops version 7.4.6.
>>>>>>>> And only reproducible on FreeBSD.
>>>>>>>>
>>>>>>>> ECL is built with --disable-threads; GC is built with or without
>>>>>>>> threads---result is still the same.
>>>>>>>> (so it's unclear to me where pthread_* calls in the trace
>>>>>>>> come from).
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Dima
>>>>>>>>
>>>>>>>> PS. the segfault is at the bottom of
>>>>>>>> https://trac.sagemath.org/ticket/22679#comment:87