This may be a CLX issue, or a problem in the CLX backend -- I would not know, but the test-case is pure CLIM. After the error future attempts to RUN-FRAME-TOP-LEVEL result in similar errors.
Backtrace follows, self contained test-case attached (RUN-FRAME-TOP-LEVEL, then "Stress" -- it may take a while, but eventually it breaks for me.)
This is CVS McCLIM, Christophe's CLX, and SBCL 1.0.13.45, on a MacBook running Tiger.
$ uname -a Darwin manic.local 8.11.1 Darwin Kernel Version 8.11.1: Wed Oct 10 18:23:28 PDT 2007; root:xnu-792.25.20~1/RELEASE_I386 i386 i386
X is the Apple packaged one: X11 1.1.3 - XFree86 4.4.0
Cheers,
-- Nikodemus
Asynchronous ID-CHOICE-ERROR in request 10265 (last request was 10266) Code 55.0 [CreateGC] ID #x600002 [Condition of type XLIB:ID-CHOICE-ERROR]
Restarts: 0: [CONTINUE] Ignore 1: [ABORT] Return to application command loop 2: [ABORT] Return to SLIME's top level. 3: [ABORT] Exit debugger, returning to top level.
Backtrace: 0: (XLIB::X-CERROR "Ignore" XLIB:ID-CHOICE-ERROR) 1: (XLIB:DEFAULT-ERROR-HANDLER #<XLIB:DISPLAY :0 (The XFree86 Project, Inc R40400000)> XLIB:ID-CHOICE-ERROR) 2: (XLIB::REPORT-ERROR #<XLIB:DISPLAY :0 (The XFree86 Project, Inc R40400000)> 14 XLIB:ID-CHOICE-ERROR) 3: (XLIB::READ-ERROR-INPUT #<XLIB:DISPLAY :0 (The XFree86 Project, Inc R40400000)> 10265 #S(XLIB::REPLY-BUFFER :SIZE 32 :IBUF8 #(0 14 25 40 2 0 ...) :NEXT NIL :DATA-SIZE 32) #<SB-THREAD:THREAD "initial thread" {116920A9}>) 4: (XLIB::READ-INPUT #<XLIB:DISPLAY :0 (The XFree86 Project, Inc R40400000)> NIL NIL #<FUNCTION (LAMBDA (XLIB::PENDING-COMMAND)) {13969ECD}>) 5: (XLIB::READ-REPLY #<unavailable argument> #<unavailable argument>) 6: (XLIB:ALLOC-COLOR #<XLIB:COLORMAP TRUE-COLOR :0 32> #<XLIB:COLOR 0.641639 0.52443945 0.84608877>) 7: ((SB-PCL::FAST-METHOD CLIM-CLX::X-PIXEL (CLIM-CLX::CLX-PORT T)) #(17 NIL) #<unavailable argument> #<CLIM-CLX::CLX-PORT :HOST "" :DISPLAY-ID 0 {122D0AC1}> #<CLIM-INTERNALS::NAMED-COLOR "Unnamed color">) 8: ((SB-PCL::FAST-METHOD CLIM-CLX::MEDIUM-GCONTEXT (CLIM-CLX::CLX-MEDIUM COLOR)) #(9 NIL) #<unused argument> #<CLIM-CLX::CLX-MEDIUM {13F67CE1}> #<CLIM-INTERNALS::NAMED-COLOR "Unnamed color">) 9: ((LAMBDA (SB-PCL::.PV. SB-PCL::.NEXT-METHOD-CALL. SB-PCL::.ARG0. SB-PCL::.ARG1.)) #<unavailable argument> #<unavailable argument> #<CLIM-CLX::CLX-MEDIUM {13F67CE1}> #<CLIM-INTERNALS::NAMED-COLOR "Unnamed color">) 10: ((SB-PCL::FAST-METHOD MEDIUM-DRAW-POINT* (CLIM-CLX::CLX-MEDIUM T T)) #<unavailable argument> #<unavailable argument> #<CLIM-CLX::CLX-MEDIUM {13F67CE1}> 290 142) 11: ((SB-PCL::FAST-METHOD CLIM-INTERNALS::DO-GRAPHICS-WITH-OPTIONS-INTERNAL (MEDIUM T T)) #<unavailable argument> #<unavailable argument> #CLIM-CLX::CLX-MEDIUM {13F67CE1} #<CLIM-CLX::CLX-MEDIUM {13F67CE1}> #<CLOSURE (FLET CLIM-INTERNALS::GRAPHICS-OP) {13F67CC5}>) 12: ((SB-PCL::FAST-METHOD CLIM-INTERNALS::DO-GRAPHICS-WITH-OPTIONS (CLIM-INTERNALS::PIXMAP T)) #<unavailable argument> #<unavailable argument> #<CLIM-INTERNALS::MIRRORED-PIXMAP {117385F9}> #<CLOSURE (FLET CLIM-INTERNALS::GRAPHICS-OP) {13F67CC5}>) 13: ((LAMBDA (COLOR X Y)) #(0.641639 0.52443945 0.84608877) 290 142) 14: (RANDOM-RENDER 400 300 #<CLOSURE (LAMBDA (COLOR X Y)) {11738895}>) 15: (RENDER-SCENE #<APPLICATION-PANE CANVAS {11F91F99}>) 16: (COM-STRESS) 17: ((SB-PCL::FAST-METHOD DEFAULT-FRAME-TOP-LEVEL (APPLICATION-FRAME)) #<unavailable argument> #<unavailable argument> #<TEST-FRAME {1381E9F1}>) 18: ((SB-PCL::FAST-METHOD RUN-FRAME-TOP-LEVEL (APPLICATION-FRAME)) #<unavailable argument> #<unavailable argument> #<TEST-FRAME {1381E9F1}>) 19: ((SB-PCL::FAST-METHOD RUN-FRAME-TOP-LEVEL :AROUND (APPLICATION-FRAME)) #<unavailable argument> #S(SB-PCL::FAST-METHOD-CALL :FUNCTION #<FUNCTION (SB-C::&OPTIONAL-DISPATCH #) {131373FD}> :PV NIL :NEXT-METHOD-CALL NIL :ARG-INFO (1 . T)) #<TEST-FRAME {1381E9F1}>) 20: (SB-INT:SIMPLE-EVAL-IN-LEXENV (RUN-FRAME-TOP-LEVEL (MAKE-APPLICATION-FRAME (QUOTE TEST-FRAME))) #<NULL-LEXENV>)
I just encountered the "Asynchronous ID-CHOICE-ERROR" myself tonight, in my case when multiple mcclim applications (in multiple threads) on one screen race to repaint. I don't know if it's related to Nikodemus' test case, which doesn't make any use of threads (excluding the omnipresent CLX event dispatching thread), but at least it's the same error.
Attached is a patch to CLX intending to address a race condition when multiple threads attempt to allocate a resource ID - DISPLAY-RESOURCE-ID-COUNT is incremented, but there is no locking to protect the slot, resulting in intermittent duplicate IDs and subsequent ID-CHOICE-ERROR. I've wrapped the allocation in a with-display, and done the same to the nearby deallocate-resource-id-internal for good measure. This should protect the incf in resourcealloc, the (setf gethash) in save-id, and the remhash in deallocate-resource-id-internal.
--- old-clx/display.lisp 2008-08-21 23:27:35.000000000 -0400 +++ new-clx/display.lisp 2008-08-21 23:27:35.000000000 -0400 @@ -155,11 +155,12 @@
(defmacro allocate-resource-id (display object type) ;; Allocate a resource-id for OBJECT in DISPLAY - (if (member (eval type) +clx-cached-types+) - `(let ((id (funcall (display-xid ,display) ,display))) - (save-id ,display id ,object) - id) - `(funcall (display-xid ,display) ,display))) + `(with-display (,display) + ,(if (member (eval type) +clx-cached-types+) + `(let ((id (funcall (display-xid ,display) ,display))) + (save-id ,display id ,object) + id) + `(funcall (display-xid ,display) ,display))))
(defmacro deallocate-resource-id (display id type) ;; Deallocate a resource-id for OBJECT in DISPLAY @@ -167,7 +168,8 @@ `(deallocate-resource-id-internal ,display ,id)))
(defun deallocate-resource-id-internal (display id) - (remhash id (display-resource-id-map display))) + (with-display (display) + (remhash id (display-resource-id-map display))))
(defun lookup-resource-id (display id) ;; Find the object associated with resource ID
"Andy Hefner" ahefner@gmail.com writes:
Attached is a patch to CLX intending to address a race condition when multiple threads attempt to allocate a resource ID - DISPLAY-RESOURCE-ID-COUNT is incremented, but there is no locking to protect the slot, resulting in intermittent duplicate IDs and subsequent ID-CHOICE-ERROR. I've wrapped the allocation in a with-display, and done the same to the nearby deallocate-resource-id-internal for good measure. This should protect the incf in resourcealloc, the (setf gethash) in save-id, and the remhash in deallocate-resource-id-internal.
Thanks, pushed.
Best,
Christophe
On Fri, Aug 22, 2008 at 6:06 AM, Andy Hefner ahefner@gmail.com wrote:
I just encountered the "Asynchronous ID-CHOICE-ERROR" myself tonight, in my case when multiple mcclim applications (in multiple threads) on one screen race to repaint. I don't know if it's related to Nikodemus' test case, which doesn't make any use of threads (excluding the omnipresent CLX event dispatching thread), but at least it's the same error.
I don't remember offhand for sure, but I think that was actually on a single-threaded SBCL -- so there would perforce be only one thread in Lisp Land.
Since it seems to be in danger of falling through the cracks, here is a (possibly related) report from sbcl-devel.
Cheers,
-- Nikodemus
---------- Forwarded message ---------- From: Shawn sabetts@gmail.com Date: Sat, Jul 12, 2008 at 5:27 AM Subject: Re: [Sbcl-devel] breaking the clx xid cache To: sbcl-devel@lists.sf.net
Hi folks,
It looks like the portable-clx mailing list is down so I'm sending this here.
In stumpwm I've been getting more and more error reports where the :window slot for events is a pixmap instead of a window. I've tracked it to a problem with the xid cache in clx. X appears to be recycling XIDs which confuses clx. Here's a function that exposes the bug:
(defun break-display-xid-cache () (labels ((make-win (dpy) (xlib:create-window :parent (xlib:screen-root (first (xlib:display-roots dpy))) :x 0 :y 0 :width 50 :height 50)) (make-pixmap (window) (xlib:create-pixmap :width (random 100) :height (random 100) :depth 8 :drawable window)) (first-pass (dpy) ;; Open a fresh connection. Create a window and a pixmap. (let* ((dpy2 (xlib:open-default-display)) (window (make-win dpy2)) (pixmap (make-pixmap window))) ;; make the pixmap the window's icon pixmap hint. (setf (xlib:wm-hints window) (xlib:make-wm-hints :icon-pixmap pixmap)) (format t "Window ID: ~s pixmap ID: ~s~%" (xlib:window-id window) (xlib:pixmap-id pixmap)) (xlib:display-finish-output dpy2) ;; On the old connection, list the root window children ;; and the icon pixmap hint to cache their XIDs. (loop for w in (xlib:query-tree (xlib:screen-root (first (xlib:display-roots dpy)))) for hints = (xlib:wm-hints w) when hints do (format t "top level window id: ~s | icon pixmap hint: ~s~%" (xlib:window-id w) (xlib:wm-hints-icon-pixmap hints))) (xlib:close-display dpy2))) (second-pass (dpy) ;; Open a fresh connection and create 2 windows. (let* ((dpy2 (xlib:open-default-display)) (window1 (make-win dpy2)) (window2 (make-win dpy2))) (format t "Window#1 ID: ~s Window#2 ID: ~s~%" (xlib:window-id window1) (xlib:window-id window2)) (xlib:display-finish-output dpy2) ;; On the old connection, list the root window children ;; and note the second window is erroneously a pixmap ;; due to too agressive caching in clx. (loop for w in (xlib:query-tree (xlib:screen-root (first (xlib:display-roots dpy)))) do (format t "window: ~s~%" w)) (xlib:close-display dpy2)))) (let ((dpy (xlib:open-default-display))) (first-pass dpy) (second-pass dpy) (xlib:close-display dpy))))
Note that the last window in the window list printed at the end is a pixmap and not a window! Here's the relevant output when i run it:
* (break-display-xid-cache) Window ID: 14680065 pixmap ID: 14680066 top level window id: 14680065 | icon pixmap hint: #<XLIB:PIXMAP :0 E00002> Window#1 ID: 14680065 Window#2 ID: 14680066 window: #<XLIB:WINDOW :0 E00001> window: #<XLIB:PIXMAP :0 E00002> NIL *
I propose the following patch. Since a cache error isn't the sign of a bug, I don't think checking the type should depend +type-check?+ constant. In the event of a lookup error, I've added two restarts for invalidating the cache and creating fresh XID cache entries. This is how it has been addressed in new-clx.
--- display.lisp 2005-08-26 03:13:28.000000000 -0700 +++ display.lisp 2008-07-11 17:22:39.000000000 -0700 @@ -199,17 +199,21 @@ :display display :id id)) (save-id display id ,type)) ;; Found. Check the type - ,(cond ((null +type-check?+) - `(t ,type)) - ((member type '(window pixmap)) - `((type? ,type 'drawable) ,type)) - (t `((type? ,type ',type) ,type))) - ,@(when +type-check?+ - `((t (x-error 'lookup-error - :id id - :display display - :type ',type - :object ,type)))))) + ((type? ,type ',type) ,type) + (t + (restart-case + (x-error 'lookup-error + :id id + :display display + :type ',type + :object ,type) + (one () + :report "Invalidate this cache entry" + (save-id display id (,(xintern 'make- type) :display display :id id))) + (all () + :report "Invalidate all display cache" + (clrhash (display-resource-id-map display)) + (save-id display id (,(xintern 'make- type) :display display :id id))))))) ;; Not being cached. Create a new one each time. `(,(xintern 'make- type) :display display :id id))))
-Shawn
"Nikodemus Siivola" nikodemus@random-state.net writes:
On Fri, Aug 22, 2008 at 6:06 AM, Andy Hefner ahefner@gmail.com wrote:
I just encountered the "Asynchronous ID-CHOICE-ERROR" myself tonight, in my case when multiple mcclim applications (in multiple threads) on one screen race to repaint. I don't know if it's related to Nikodemus' test case, which doesn't make any use of threads (excluding the omnipresent CLX event dispatching thread), but at least it's the same error.
I don't remember offhand for sure, but I think that was actually on a single-threaded SBCL -- so there would perforce be only one thread in Lisp Land.
Since it seems to be in danger of falling through the cracks, here is a (possibly related) report from sbcl-devel.
I don't think this is related, but I have been investigating that one this morning anyway, while waiting for your test case to break. (It's been about an hour so far; is that normal? I can suggest diagnostics to run on your system if not.) Because your resource ID was suspicious (had a large number of 0s in it: #x600002 or so) I would tend to suspect either a bug in the resource ID logic in clx, or a bug in the X server (this is Apple, after all...)
I think that the fix to Shawn's problem is to teach clx to cache only its own client resource IDs, and not anyone else's. Before installing that fix, I want to read xlib or xcb sources, and possibly a window manager's, to check that that's sane. (If someone else wants to read xlib/xcb sources instead, please be my guest!)
Best,
Christophe
On Fri, Aug 22, 2008 at 1:16 PM, Christophe Rhodes csr21@cantab.net wrote:
"Nikodemus Siivola" nikodemus@random-state.net writes:
I don't think this is related, but I have been investigating that one this morning anyway, while waiting for your test case to break. (It's been about an hour so far; is that normal? I can suggest diagnostics to run on your system if not.)
IIRC it used to break in less then five minutes. I'll try to reproduce again before updating CLX.
Cheers,
-- Nikodemus
"Nikodemus Siivola" nikodemus@random-state.net writes:
On Fri, Aug 22, 2008 at 1:16 PM, Christophe Rhodes csr21@cantab.net wrote:
"Nikodemus Siivola" nikodemus@random-state.net writes:
I don't think this is related, but I have been investigating that one this morning anyway, while waiting for your test case to break. (It's been about an hour so far; is that normal? I can suggest diagnostics to run on your system if not.)
IIRC it used to break in less then five minutes. I'll try to reproduce again before updating CLX.
It's still going here, about 4 hours later. Can you run
(let ((display (xlib:open-default-display))) (values (xlib::display-resource-id-mask display) (xlib::display-resource-id-base display)))
against your X server?
Thanks,
Christophe
On Fri, Aug 22, 2008 at 3:20 PM, Christophe Rhodes csr21@cantab.net wrote:
"Nikodemus Siivola" nikodemus@random-state.net writes:
On Fri, Aug 22, 2008 at 1:16 PM, Christophe Rhodes csr21@cantab.net wrote:
"Nikodemus Siivola" nikodemus@random-state.net writes:
I don't think this is related, but I have been investigating that one this morning anyway, while waiting for your test case to break. (It's been about an hour so far; is that normal? I can suggest diagnostics to run on your system if not.)
IIRC it used to break in less then five minutes. I'll try to reproduce again before updating CLX.
It's still going here, about 4 hours later.
Was able to reproduce here, though it took something like 30 minutes.
Can you run
(let ((display (xlib:open-default-display))) (values (xlib::display-resource-id-mask display) (xlib::display-resource-id-base display)))
In the image where I had just duplicated the error, while in the debugger:
2097151 10485760
In a clean image with just CLX loaded:
2097151 8388608
Cheers,
-- Nikodemus
"Nikodemus Siivola" nikodemus@random-state.net writes:
Was able to reproduce here, though it took something like 30 minutes.
Can you run
(let ((display (xlib:open-default-display))) (values (xlib::display-resource-id-mask display) (xlib::display-resource-id-base display)))
In the image where I had just duplicated the error, while in the debugger:
2097151 10485760
Please tell me you saved the exact error message and can now add that piece of information... in particular, I'm interested in the hex number that is reported as part of the id-choice error.
C.
On Fri, Aug 22, 2008 at 6:30 PM, Christophe Rhodes csr21@cantab.net wrote:
"Nikodemus Siivola" nikodemus@random-state.net writes:
Was able to reproduce here, though it took something like 30 minutes.
Can you run
(let ((display (xlib:open-default-display))) (values (xlib::display-resource-id-mask display) (xlib::display-resource-id-base display)))
In the image where I had just duplicated the error, while in the debugger:
2097151 10485760
Please tell me you saved the exact error message and can now add that piece of information... in particular, I'm interested in the hex number that is reported as part of the id-choice error.
I did not, but here are the results from a new run:
Received CLX ID-CHOICE-ERROR in process NIL
debugger invoked on a XLIB:ID-CHOICE-ERROR: Asynchronous ID-CHOICE-ERROR in request 10247 (last request was 10248) Code 55.0 [CreateGC] ID #x800002
Type HELP for debugger help, or (SB-EXT:QUIT) to exit from SBCL.
restarts (invokable by number or by possibly-abbreviated name): 0: [CONTINUE ] Ignore 1: [TRY-AGAIN] Try executing the command COM-STRESS again 2: [ABORT ] Return to application command loop 3: Exit debugger, returning to top level.
(XLIB::X-CERROR "Ignore" XLIB:ID-CHOICE-ERROR)[:EXTERNAL] 0] (let ((display (xlib:open-default-display))) (values (xlib:display-resource-id-mask display) (xlib:display-resource-id-base display)))
2097151 10485760 0]
I'm keeping the session alive, so let me know if you need anything else.
Cheers,
-- Nikodemus
I think there are two bugs at play here - first, that CLX makes no attempt to avoid reusing IDs that are still in use once the count wraps around, and second that in this instance mcclim creates millions of gcontexts without freeing them.
I've hacked resourcealloc in CLX to check against the id cache, skipping IDs which are still in use, but this runs into the problem that not all types (in particular, graphics contexts) are stored in the cache. There's some comment about this not being useful and wasting space in depdefs.lisp next to defconstant +clx-cached-types+, and without considering the performance implications I've worked around it by setting the hash table entry to T at the point the ID is allocated, so that (I hope) every allocated resource is now there. To do this correctly also requirse that the deallocate-resource-id be changed to unconditionally invoke deallocate-resource-id-internal (thereby making the macro useless), which I've neglected to do. With that in mind, here's a version of resourcealloc with the changes described:
(defun resourcealloc (display) ;; Allocate a resource-id for in DISPLAY (declare (type display display)) (declare (clx-values resource-id))
(loop for next-count upfrom (1+ (display-resource-id-count display)) repeat (1+ (display-resource-id-mask display)) as id = (dpb next-count (display-resource-id-byte display) (display-resource-id-base display)) do (when (nth-value 1 (gethash id (display-resource-id-map display))) #+NIL (format *trace-output* "~&Skipping ~X (~A)~%" id (gethash id (display-resource-id-map display)))) (when (zerop (random 10000)) (format *trace-output* "~&~X~%" id)) (unless (nth-value 1 (gethash id (display-resource-id-map display))) (setf (display-resource-id-count display) next-count) #+NIL (format *trace-output* "~&Allocating ~X (~A IDs in use)~%" id (hash-table-count (display-resource-id-map display))) (setf (gethash id (display-resource-id-map display)) t) (when (= id #x2A00027) (break "Hi there.")) (return-from resourcealloc id))) (assert (= (hash-table-count (display-resource-id-map display)) (1+ (display-resource-id-mask display)))) (error "Available X resource IDs have been exhausted."))
Fixing this allocation issue doesn't solve the underlying problem that this test causes mcclim to exhaust the available space of resource-ids by never freeing these gcontexts.
"Andy Hefner" ahefner@gmail.com writes:
Fixing this allocation issue doesn't solve the underlying problem that this test causes mcclim to exhaust the available space of resource-ids by never freeing these gcontexts.
I attach a patch for consideration against mcclim (yeah, I know, I have commit access, but I'm rusty) which seems to fix the leaky gcontext problem in the CLX backend. There may well be an analogous problem in other backends (except the Null backend :-); I haven't investigated.
I've also got something like Andy's resourcealloc patch in my clx (not pushed yet), along with a fix for Shawn's resource ID issue. Wow, it's almost like being up to date...
Best,
Christophe
(defmethod port-deallocate-pixmap ((port clx-port) pixmap) + (let ((medium (climi::pixmap-medium pixmap))) + (when medium + (with-slots (gc) medium + (when gc + (xlib:free-gcontext gc))))) (when (port-lookup-mirror port pixmap) (destroy-mirror port pixmap)))
Would it be more appropriate to move the subform starting with (with-slots (gc) ..) to deallocate-medium (or possibly degraft-medium), and one of those instead? Not a big deal, given virtually nothing else in mcclim calls these when it probably should.