Any ideas what might trigger cl-postgres (postmodern 1.14) coming to its knees with the backtrace shown below? No error or condition description is given - sbcl (1.0.28) just retires from active duty and drops into the debugger with the ("bogus stack frame") indication. The postgresql server (8.3) log doesn't have any indication of trouble at the time cl-postgres seems to run into trouble. This seems to only happen intermittently (every few thousand queries), irreproducibly (doesn't seem to be associated with a specific query or anything along those lines), and only when the postgresql server is being utilized by several lisp instances concurrently.
Thanks in advance for any input/ideas/troubleshooting suggestions...
- Alan
... 32: ("bogus stack frame") 33: (SB-IMPL::SUB-SUB-SERVE-EVENT NIL NIL) 34: (SB-IMPL::SUB-SERVE-EVENT NIL NIL NIL) 35: (SB-SYS:WAIT-UNTIL-FD-USABLE 5 :INPUT NIL) 36: (SB-IMPL::REFILL-INPUT-BUFFER #<SB-SYS:FD-STREAM for "a socket" {C265BA9}>) 37: (SB-IMPL::INPUT-UNSIGNED-8BIT-BYTE #<SB-SYS:FD-STREAM for "a socket" {C265BA9}> T NIL) 38: (CL-POSTGRES::SKIP-BYTES #<unavailable argument> #<unavailable argument>) 39: (CL-POSTGRES::TRY-TO-SYNC #<SB-SYS:FD-STREAM for "a socket" {C265BA9}> T) 40: ((FLET #:CLEANUP-FUN-[FORM-FUN-[SEND-QUERY]389]390))[:CLEANUP] 41: (CL-POSTGRES::SEND-QUERY #<unavailable argument> #<unavailable argument> #<unavailable argument>) 42: ((LABELS #:G161)) 43: (SQLG::QUERY-CORE "select program from programs where programskey = '192828' ;" 20 #<unused argument>)
There has been some noise on #lisp the past days about current usocket using an internal SBCL feature that's breaking with current SBCL. Try downgrading your SBCL to see if that solves the problem, and if it does, wait for an usocket update.
I recently patched cl-postgres to use the non-portable socket interface when on ACL, I think I had to change only three lines. You could try a similar thing for SBCL. Attila also talked about having cl-postgres use iolib (though that has the disadvantage of a huge set of dependencies).
Best, Marijn
There has been some noise on #lisp the past days about current usocket using an internal SBCL feature that's breaking with current SBCL. Try downgrading your SBCL to see if that solves the problem, and if it does, wait for an usocket update.
that's not it and it only happens with a few days old, freshly compiled sbcl.
on the other hand sbcl's network stuff is the last source of major stability issues we have with our codebase, that's why i have it on my TODO to add an option to use iolib for the network communication in cl-postgres. i've seen sub-serve-event too many times in weird backtraces...
a few days ago i had a 30 min timeframe to move cl-postgres to iolib/babel, but gave up on something. iirc, the read-a-string-until-a-zero-byte-comes was unnecessarily hard to implement with the current iolib streams (which are being worked on currently by Stelian).
On Mon, May 18, 2009 at 4:14 AM, Attila Lendvai attila.lendvai@gmail.com wrote:
Try downgrading your SBCL to see if that solves the problem, and if it does, wait for an usocket update.
that's not it and it only happens with a few days old, freshly compiled sbcl.
on the other hand sbcl's network stuff is the last source of major stability issues we have with our codebase, that's why i have it on my TODO to add an option to use iolib for the network communication in cl-postgres. i've seen sub-serve-event too many times in weird backtraces...
It's definitely not a postmodern/cl-postgres issue... Out of curiosity, I tried using CLSQL (4.0.3) with its :postgresql-socket interface under similar circumstances. Ended up with similar SBCL unhappiness: nothing showing up in the postgresql server log but "broken pipe" errors triggering SBCL (1.0.28) sporadically dropping into ldb with
Signal 13 masked fatal error encountered in SBCL pid [somepid](tid [sometid]): some deferrable signals blocked, some unblocked
The ldb backtraces all look pretty similar (example backtrace below).
- Alan
Backtrace: 0: Foreign fp = 0xb70abec8, ra = 0x8059314 1: Foreign fp = 0xb70abee8, ra = 0x8055afe 2: Foreign fp = 0xb70abf98, ra = 0x8056616 3: Foreign fp = 0xb70abfd8, ra = 0x8056f0b 4: Foreign fp = 0xb70abff8, ra = 0x8057089 5: Foreign fp = 0xb70ac028, ra = 0x8058c4a 6: Foreign fp = 0xb70ac444, ra = 0xb7fe5440 7: Foreign fp = 0xb70ac564, ra = 0xb75c6b79 8: Foreign fp = 0xb70ac724, ra = 0xb75c6ced 9: Foreign fp = 0xb70ac854, ra = 0xb75be9be 10: Foreign fp = 0xb70ac874, ra = 0xb75bc36b 11: Foreign fp = 0xb70ac894, ra = 0xb75be00c 12: (SB-PCL::FAST-METHOD CLSQL-SYS::DATABASE-QUERY (COMMON-LISP::T CLSQL-POSTGRESQL::POSTGRESQL-DATABASE COMMON-LISP::T COMMON-LISP::T)) 13: (COMMON-LISP::LAMBDA (SB-PCL::.PV. SB-PCL::.NEXT-METHOD-CALL. SB-PCL::.ARG0. SB-PCL::.ARG1. SB-PCL::.ARG2. SB-PCL::.ARG3.)) 14: (SB-C::TL-XEP (SB-PCL::FAST-METHOD CLSQL-SYS::QUERY (COMMON-LISP::STRING))) 15: (SB-C::TL-XEP SQLG::COLUMNS-FROM-TABLE-WHERE) 16: SQLG::ATOM-FROM-TABLE-WHERE 17: RP-GENEVAL::GET-PROGRAM-FROM-DB 18: (SB-C::TL-XEP RP-GENEVAL::GET-TEMPLATE-MATCH01) 19: RP-GENEVAL::KSST-CORE 20: SB-INT::SIMPLE-EVAL-IN-LEXENV 21: (SB-C::HAIRY-ARG-PROCESSOR RP-GENEVAL::KRW) 22: RP-GENEVAL::KRWU 23: SB-INT::SIMPLE-EVAL-IN-LEXENV 24: (SB-C::HAIRY-ARG-PROCESSOR RP-GENEVAL::KRW) 25: RP-GENEVAL::KRWU 26: SB-INT::SIMPLE-EVAL-IN-LEXENV 27: (SB-C::HAIRY-ARG-PROCESSOR RP-GENEVAL::RPG-EVAL) 28: (SB-C::HAIRY-ARG-PROCESSOR RP-GENEVAL::EVAL-METHOD) 29: (SB-C::HAIRY-ARG-PROCESSOR RP-GENEVAL::CRITERION-FUNCTION) 30: (SB-C::HAIRY-ARG-PROCESSOR RP-GENEVAL::EVALUATE-PROGRAM-INTERNAL) 31: (SB-C::TL-XEP RP-GENEVAL::EVALUATE-*PROGRAM-POPULATION*-FITNESS)
Since SBCL 1.0.26 or so, there's been a sanity check in interrupt handling that ensures that all signals in a set are either blocked or unblocked. This check is dumping SBCL into LDB, and here's why:
Postgresql's libpq will block sigpipe for its own reasons. When another interrupt is signaled during a query (PQexec or PQsendQuery), SBCL has to decide what to do about the interrupt, whether to handle it or not. What winds up happening though is that the sanity check runs and sees that signal 13 is blocked, while the rest of the set of "deferrable" signals are not. It does not expect this, freaks out, and you get dumped into LDB.
To get around it, calls to PQexec need to get wrapped in sb-thread::block-deferrable-signals and sb-unix::unblock-deferrable-signals for now. It's a bummer, as potentially useful signals like sigint and sigalrm will get blocked.
I'm also going to try a patched sbcl that allows sbcl to tell the sanity check to ignore checking certain blocked signals, which isn't the best solution, but it's the quick and dirty solution I thought up. Hopefully future SBCLs will play nice with shared object library functions, but only time will tell.
Regards, Andrew Golding
Hey Andrew,
Thanks for the information. However, Postmodern does not use libpq. It only uses foreign code when creating SSL sockets. Was this mail just a general heads-up, or did you have a problem with Postmodern corrupting stack frames? (It uses (safety 0) all over the place, so potentially it could do bad things.)
Best, Marijn
postmodern-devel@common-lisp.net