Hi David,
Are there any known thread safety issues with CLOS? I've been getting some weird MOP errors that I made go away by redefining slot-value such that it has to acquire a global lock before it can do anything. I assume that the problem is in my code: but the threads that are executing when I get the failures share no code (or variables), and every time the errors pop up slot-value is on the callstack someplace.
Well, there have not been any specific threading issues mentioned. Over two years ago, back when I started doing ABCL, there were threading issues which looked like threading issues in CLOS. They turned out to be threading issues in the compiler though.
Having said that, I see there have not been taken synchronization measures on a high level: there are no locks and threads:synchronized-on blocks in clos.lisp. There are some synchronized calls on the Java side; I'm not sure the low-level protection is good enough to maintain consistency though.
Can you share some examples which show the issues here? Adding thread-safety to clos.lisp - in specific, targeted places - should be well doable.
Bye,
Erik.
On Sat, Apr 24, 2010 at 7:50 AM, Erik Huelsmann ehuels@gmail.com wrote:
Can you share some examples which show the issues here? Adding thread-safety to clos.lisp - in specific, targeted places - should be well doable.
It was surprisingly difficult to create a fragment which would reproduce the errors. But here it is!
(defclass counter () ((count :initform 0)))
(setf *ncount* 10000) (setf *nthread* 1000)
(defun count-thread (amount) #'(lambda () (let ((c (make-instance 'counter))) (dotimes (i *ncount*) (incf (slot-value c 'count) amount)) (format t "count ~A = ~A~%" amount (slot-value c 'count)))))
(dotimes (i *nthread*) (threads:make-thread (count-thread i)))
On my machine, a half dozen or so threads will reliably die with either:
Missing slot MOP::NAME
or
Missing slot MOP::PRECEDENCE-LIST
The MOP::NAME error has about 5x more common than the MOP::PRECEDENCE-LIST error. The key to reproducing the error is to use a lot of threads, with *nthread* = 2, the code can run all morning without any trouble. But it will eventually give an error: I originally noticed this with long running code and two threads. Even with *nthread* = 10, it works without error most of the time.
It also works (I mean gives errors) if you drop the final format in count-thread. The format is not thread safe, but it does not seem to cause errors by itself -- it just mangles output -- and it helps to see which threads are finishing.
-david k.
David Kirkman dkirkman@ucsd.edu writes:
On Sat, Apr 24, 2010 at 7:50 AM, Erik Huelsmann ehuels@gmail.com wrote:
Can you share some examples which show the issues here? Adding thread-safety to clos.lisp - in specific, targeted places - should be well doable.
It was surprisingly difficult to create a fragment which would reproduce the errors. But here it is!
(defclass counter () ((count :initform 0)))
(setf *ncount* 10000) (setf *nthread* 1000)
(defun count-thread (amount) #'(lambda () (let ((c (make-instance 'counter))) (dotimes (i *ncount*) (incf (slot-value c 'count) amount)) (format t "count ~A = ~A~%" amount (slot-value c 'count)))))
(dotimes (i *nthread*) (threads:make-thread (count-thread i)))
On my machine, a half dozen or so threads will reliably die with either:
Missing slot MOP::NAME
or
Missing slot MOP::PRECEDENCE-LIST
The MOP::NAME error has about 5x more common than the MOP::PRECEDENCE-LIST error. The key to reproducing the error is to use a lot of threads, with *nthread* = 2, the code can run all morning without any trouble. But it will eventually give an error: I originally noticed this with long running code and two threads. Even with *nthread* = 10, it works without error most of the time.
It also works (I mean gives errors) if you drop the final format in count-thread. The format is not thread safe, but it does not seem to cause errors by itself -- it just mangles output -- and it helps to see which threads are finishing.
There's a trick against the mangling. Use
(fresh-line) (write-line (format nil <fmt-ctrl> . <fmt-args>))
HTH
-T.
On Sat, Apr 24, 2010 at 9:44 AM, Tobias C. Rittweiler tcr@freebits.de wrote:
There's a trick against the mangling. Use
(fresh-line) (write-line (format nil <fmt-ctrl> . <fmt-args>))
That seems to have a much smaller but still non-vanishing interaction cross section. It's much better than just a plain (format t ...), but it still occasionally mangles.
(defun print-stuff () #'(lambda () (dotimes (i 1000) (fresh-line) (write-line (format nil "i = ~A" i) ))))
(dotimes (i 10) (threads:make-thread (print-stuff)))
Produces a few lines like these:
i = 831 i = 826i = 832 i = 833 i = 828
In my actual code I use a sledge hammer and
(defvar *output-lock* (cons nil nil))
then I wrap anything that writes a message to the screen with
(with-thread-lock (*output-lock*) (format ...))
Of course, combining the two approaches and wrapping the (write-line ...) would minimize thread contention, but about the only thing I use this for is debug output ...
Thanks,
david
David Kirkman dkirkman@ucsd.edu writes:
On Sat, Apr 24, 2010 at 9:44 AM, Tobias C. Rittweiler tcr@freebits.de wrote:
There's a trick against the mangling. Use
(fresh-line) (write-line (format nil <fmt-ctrl> . <fmt-args>))
That seems to have a much smaller but still non-vanishing interaction cross section. It's much better than just a plain (format t ...), but it still occasionally mangles.
I'm still on a uniprocessor system, so the above does it for me. You should also add (FORCE-OUTPUT) after the WRITE-LINE that will further minimize the window for clobbering.
(That's especially true in case you're using ABCL in combination with Slime. In that case, and it's not good enough, you could also hack SlimeOutputStream.java to make _writeLine() always involve _finishOutput().)
-T.
On Sat, Apr 24, 2010 at 11:43 AM, Tobias C. Rittweiler tcr@freebits.de wrote:
David Kirkman dkirkman@ucsd.edu writes:
On Sat, Apr 24, 2010 at 9:44 AM, Tobias C. Rittweiler tcr@freebits.de wrote:
There's a trick against the mangling. Use
(fresh-line) (write-line (format nil <fmt-ctrl> . <fmt-args>))
That seems to have a much smaller but still non-vanishing interaction cross section. It's much better than just a plain (format t ...), but it still occasionally mangles.
I'm still on a uniprocessor system, so the above does it for me. You should also add (FORCE-OUTPUT) after the WRITE-LINE that will further minimize the window for clobbering.
(That's especially true in case you're using ABCL in combination with Slime. In that case, and it's not good enough, you could also hack SlimeOutputStream.java to make _writeLine() always involve _finishOutput().)
Yep, adding a (FORCE_OUTPUT) either makes the cross section zero, or close enough for simple screen output. And yes, I am using abcl in combination with slime, and was kind of surprised at just how often the threads were clobbering each others output.
Thanks!
-david
"dk" == David Kirkman dkirkman@ucsd.edu writes:
dk> On my machine, a half dozen or so threads will reliably die with either: dk> dk> Missing slot MOP::NAME
On my machine (4-core AMD64 running Linux), I'm not able to reproduce the bug with the server JVM, but it triggers very easily when running in interpreter mode (java -Xint).
OpenJDK / Java 1.6.0_18
On Sat, Apr 24, 2010 at 11:36 AM, Eric Marsden eric.marsden@free.fr wrote:
"dk" == David Kirkman dkirkman@ucsd.edu writes:
dk> On my machine, a half dozen or so threads will reliably die with either: dk> dk> Missing slot MOP::NAME
On my machine (4-core AMD64 running Linux), I'm not able to reproduce the bug with the server JVM, but it triggers very easily when running in interpreter mode (java -Xint).
OpenJDK / Java 1.6.0_18
I was using the client jvm (2-core macbook), java 1.5.0_22. I can also trigger it the the server jvm, but possibly with lower frequency.
But in trying out the server jvm, I realized that that code fragment is less reliable than I thought -- sometimes it generates dozens of errors, and other times I can run it with no trouble ...
Note that 1000 threads takes a lot of memory -- if I start it with the default jvm parameters I just thrash. I don't get errors, but it does not actually end up computing anything.
On my machine, if it does not generate errors, it starts generating output after about 10 sec, and is done withing a minute. If it generates errors, everything slows down a lot -- it dos not ever seem to finish (just keeps my cpus pegged at 100% until I get tired of waiting for it).
On Sat, Apr 24, 2010 at 9:17 AM, David Kirkman dkirkman@ucsd.edu wrote:
On Sat, Apr 24, 2010 at 7:50 AM, Erik Huelsmann ehuels@gmail.com wrote:
Can you share some examples which show the issues here? Adding thread-safety to clos.lisp - in specific, targeted places - should be well doable.
Here is a second example that reliably gives me an error. This time, I only need two threads:
(defclass counter () ((count :initform 0)))
(defun make-many (id) #'(lambda () (dotimes (i 1000) (make-instance 'counter)) (fresh-line) (write-line (format nil "done with ~A" id))))
(progn (threads:make-thread (make-many 'thread1)) (threads:make-thread (make-many 'thread2)))
This fails 100% of the time for me when I try to make 1000 instances, and about 25% of the time when I only try 100 instances, so if it does not give you an error try upping the 1000 by a factor of a few. I'm using a dual core G5 right now, java 1.5.0_19 (it fails with both the client and the sever VM)
I now think that the original example is also failing because of make-instance, which is why I have to use so many threads to get an error.
-david
It was surprisingly difficult to create a fragment which would reproduce the errors. But here it is!
(defclass counter () ((count :initform 0)))
(setf *ncount* 10000) (setf *nthread* 1000)
(defun count-thread (amount) #'(lambda () (let ((c (make-instance 'counter))) (dotimes (i *ncount*) (incf (slot-value c 'count) amount)) (format t "count ~A = ~A~%" amount (slot-value c 'count)))))
(dotimes (i *nthread*) (threads:make-thread (count-thread i)))
On my machine, a half dozen or so threads will reliably die with either:
Missing slot MOP::NAME
or
Missing slot MOP::PRECEDENCE-LIST
The MOP::NAME error has about 5x more common than the MOP::PRECEDENCE-LIST error. The key to reproducing the error is to use a lot of threads, with *nthread* = 2, the code can run all morning without any trouble. But it will eventually give an error: I originally noticed this with long running code and two threads. Even with *nthread* = 10, it works without error most of the time.
It also works (I mean gives errors) if you drop the final format in count-thread. The format is not thread safe, but it does not seem to cause errors by itself -- it just mangles output -- and it helps to see which threads are finishing.
-david k.
Hi David,
On Mon, Apr 26, 2010 at 8:36 PM, David Kirkman dkirkman@ucsd.edu wrote:
On Sat, Apr 24, 2010 at 9:17 AM, David Kirkman dkirkman@ucsd.edu wrote:
On Sat, Apr 24, 2010 at 7:50 AM, Erik Huelsmann ehuels@gmail.com wrote:
Can you share some examples which show the issues here? Adding thread-safety to clos.lisp - in specific, targeted places - should be well doable.
Here is a second example that reliably gives me an error. This time, I only need two threads:
(defclass counter () ((count :initform 0)))
Yesterday, I took a little time to look at the issue you mention. You said with 1 big "lock around everything" it all worked. Where did you put that lock? I tried putting some locks in a number of places (educated guesses), but nothing helped so far. I don't have data regarding the stack trace you see when running into this issue - there wasn't enough time for it.
I know now that the error comes from one of two functions in StandardGenericFunction.java, but that's simply based on the text you're seeing in the error message.
The above is to let you know I did some work on it - and the current state. Maybe someone has time to work on it before me: I may have time to work on it some more somewhere next week.
Bye,
Erik.
On Tue, Apr 27, 2010 at 11:37 AM, Erik Huelsmann ehuels@gmail.com wrote:
Yesterday, I took a little time to look at the issue you mention. You said with 1 big "lock around everything" it all worked. Where did you put that lock? I tried putting some locks in a number of places (educated guesses), but nothing helped so far. I don't have data regarding the stack trace you see when running into this issue - there wasn't enough time for it.
I redefined make-instance and slot-value. This seems to fix things, or at least greatly reduce the frequency of problems. (I'm still having an occasional MOP::NAME error, but I can't reproduce it). Other than make-instance on slot-value, my code uses a lot of generic functions.
(defvar *skye-clos-lock* (make-hash-table))
(defmacro make-instance (&body body) `(progn (threads:with-thread-lock (*skye-clos-lock*) (common-lisp:make-instance ,@body))))
(defun slot-value (&rest args) (threads:with-thread-lock (*skye-clos-lock*) (apply #'common-lisp:slot-value args)))
(defun set-slot-value (c slot val) (threads:with-thread-lock (*skye-clos-lock*) (setf (common-lisp:slot-value c slot) val)))
(defsetf slot-value set-slot-value)
Can you reproduce the error? The second snippet that just exercises make-instance is more reliable at failing. Do you want a stack trace?
Cheers,
-david k.
Hi David,
On Tue, Apr 27, 2010 at 8:56 PM, David Kirkman dkirkman@ucsd.edu wrote:
On Tue, Apr 27, 2010 at 11:37 AM, Erik Huelsmann ehuels@gmail.com wrote:
Yesterday, I took a little time to look at the issue you mention. You said with 1 big "lock around everything" it all worked. Where did you put that lock? I tried putting some locks in a number of places (educated guesses), but nothing helped so far. I don't have data regarding the stack trace you see when running into this issue - there wasn't enough time for it.
I redefined make-instance and slot-value. This seems to fix things, or at least greatly reduce the frequency of problems. (I'm still having an occasional MOP::NAME error, but I can't reproduce it). Other than make-instance on slot-value, my code uses a lot of generic functions.
(defvar *skye-clos-lock* (make-hash-table))
(defmacro make-instance (&body body) `(progn (threads:with-thread-lock (*skye-clos-lock*) (common-lisp:make-instance ,@body))))
(defun slot-value (&rest args) (threads:with-thread-lock (*skye-clos-lock*) (apply #'common-lisp:slot-value args)))
(defun set-slot-value (c slot val) (threads:with-thread-lock (*skye-clos-lock*) (setf (common-lisp:slot-value c slot) val)))
(defsetf slot-value set-slot-value)
Can you reproduce the error?
Yes! Absolutely, I'm able to reproduce reliably (on a Core2 Duo). I just was inserting locks *inside* make-instance.
The second snippet that just exercises make-instance is more reliable at failing. Do you want a stack trace?
If you can get a stacktrace of the failing thread easily, yes, please!
Bye,
Erik.
On Tue, Apr 27, 2010 at 12:01 PM, Erik Huelsmann ehuels@gmail.com wrote:
If you can get a stacktrace of the failing thread easily, yes, please!
Here's one from slime: --------------------------------------
Missing slot MOP::NAME [Condition of type ERROR]
Restarts: 0: [ABORT] Abort thread.
Backtrace: 0: (#<FUNCTION {ABA511}> #<ERROR {66D0BE}> #<FUNCTION {ABA511}>) 1: (APPLY #<FUNCTION {ABA511}> (#<ERROR {66D0BE}> #<FUNCTION {ABA511}>)) 2: (SYSTEM::RUN-HOOK SYSTEM::*INVOKE-DEBUGGER-HOOK* #<ERROR {66D0BE}> #<FUNCTION {ABA511}>) 3: (INVOKE-DEBUGGER #<ERROR {66D0BE}>) 4: org.armedbear.lisp.Lisp.error(Lisp.java:356) 5: org.armedbear.lisp.StandardObject.getInstanceSlotValue(StandardObject.java:313) 6: org.armedbear.lisp.StandardClass.getName(StandardClass.java:122) 7: org.armedbear.lisp.StandardObject.typep(StandardObject.java:184) 8: org.armedbear.lisp.LispClass.typep(LispClass.java:298) 9: org.armedbear.lisp.SlotClass.typep(SlotClass.java:80) 10: org.armedbear.lisp.StandardClass.typep(StandardClass.java:293) 11: org.armedbear.lisp.typep_4.execute(typep.lisp:65) 12: org.armedbear.lisp.Symbol.execute(Symbol.java:787) 13: org.armedbear.lisp.LispThread.execute(LispThread.java:595) 14: org.armedbear.lisp.clos_164.execute(clos.lisp:1375) 15: org.armedbear.lisp.Symbol.execute(Symbol.java:787) 16: org.armedbear.lisp.LispThread.execute(LispThread.java:595) 17: org.armedbear.lisp.clos_165.execute(clos.lisp:1386) 18: org.armedbear.lisp.Symbol.execute(Symbol.java:787) 19: org.armedbear.lisp.LispThread.execute(LispThread.java:595) 20: org.armedbear.lisp.clos_525.execute(clos.lisp:2330) 21: org.armedbear.lisp.clos_177.execute(clos.lisp:1484) 22: org.armedbear.lisp.clos_157.execute(clos.lisp:1252) 23: org.armedbear.lisp.StandardGenericFunction.execute(StandardGenericFunction.java:138) 24: org.armedbear.lisp.Symbol.execute(Symbol.java:787) 25: org.armedbear.lisp.LispThread.execute(LispThread.java:595) 26: org.armedbear.lisp.clos_461.execute(clos.lisp:2081) 27: org.armedbear.lisp.Symbol.execute(Symbol.java:799) 28: org.armedbear.lisp.LispThread.execute(LispThread.java:612) 29: org.armedbear.lisp.clos_468.execute(clos.lisp:2136) 30: org.armedbear.lisp.CompiledClosure.execute(CompiledClosure.java:101) 31: org.armedbear.lisp.LispThread.execute(LispThread.java:578) 32: org.armedbear.lisp.Primitives$pf_apply.execute(Primitives.java:2778) 33: (SYSTEM::%TYPEP #<STANDARD-CLASS STANDARD-OBJECT {50B45A}> MOP::EQL-SPECIALIZER) 34: (MOP::METHOD-APPLICABLE-P #<STANDARD-METHOD SHARED-INITIALIZE (STANDARD-OBJECT T) {CEBDB2}> (#<COUNTER {7F8B6}> T)) 35: (MOP::%COMPUTE-APPLICABLE-METHODS #<STANDARD-GENERIC-FUNCTION SHARED-INITIALIZE {B1807D}> (#<COUNTER {7F8B6}> T)) 36: (COMPUTE-APPLICABLE-METHODS #<STANDARD-GENERIC-FUNCTION SHARED-INITIALIZE {B1807D}> (#<COUNTER {7F8B6}> T)) 37: (MOP::CHECK-INITARGS #<COUNTER {7F8B6}> T NIL) 38: (#<FUNCTION {5988BE}> #<STANDARD-CLASS COUNTER {E4FAFE}>) 39: (APPLY #<FUNCTION {5988BE}> (#<STANDARD-CLASS COUNTER {E4FAFE}>)) 40: (#<STANDARD-GENERIC-FUNCTION MAKE-INSTANCE {B2145C}> #<STANDARD-CLASS COUNTER {E4FAFE}>) 41: (APPLY #<STANDARD-GENERIC-FUNCTION MAKE-INSTANCE {B2145C}> #<STANDARD-CLASS COUNTER {E4FAFE}> NIL) 42: (#<FUNCTION {20B1B9}> COUNTER) 43: (APPLY #<FUNCTION {20B1B9}> (COUNTER)) 44: (#<STANDARD-GENERIC-FUNCTION MAKE-INSTANCE {B2145C}> COUNTER) 45: (#<FUNCTION (LAMBDA ()) {7416E9}>) 46: (THREADS::THREAD-FUNCTION-WRAPPER #<FUNCTION (LAMBDA ()) {7416E9}>)
I think I worked out the problem here: there are accesses to HashTables from unsynchronized get and put methods. Adding a 'synchronized' to the get/put methods in EqHashTable fixes this particular threading problem.
When the example below fails, it looks likeEqHashTable.get is being called from Layout.getSlotIndex(LispObject slotName) (line 191), but I can't find a way to fix the problem by synchronizing on getSlotIndex or on the slotTable variable, perhaps there are direct get calls to the slotTable from someplace else?
I've attached a patch that shows where I had to add 'synchronized' to fix these problems (which have become a fairly large problem for me!). I'm unsure of the fix though, because it looks like HashTable was designed for accesses through the getHash and putHash methods, which are already synchronized. If you apply this fix, you probably also want to update EqlHashTable and EqualHashTable.
Cheers,
-david k.
P.S.
Here is an updated code fragment that always fails for me before adding the synchronized keywords, and which now works (along with, as far as I can tell, the rest of my code).
(labels ((make-many () #'(lambda () (dotimes (i 5000) (make-instance 'slot-value-test))))) (let ((thread1 (make-thread (make-many))) (thread2 (make-thread (make-many)))) (threads:thread-join thread1) (threads:thread-join thread2)))
P.P.S.
Is testing 'ant test.ansi.compiled' currently broken? I get an end of file error from test/lisp/ansi/parse-ansi-errors.lisp when I try to run the tests. Looks like (defvar *default-database-file* is not closed
On Tue, Apr 27, 2010 at 11:37 AM, Erik Huelsmann ehuels@gmail.com wrote:
Hi David,
On Mon, Apr 26, 2010 at 8:36 PM, David Kirkman dkirkman@ucsd.edu wrote:
On Sat, Apr 24, 2010 at 9:17 AM, David Kirkman dkirkman@ucsd.edu wrote:
On Sat, Apr 24, 2010 at 7:50 AM, Erik Huelsmann ehuels@gmail.com wrote:
Can you share some examples which show the issues here? Adding thread-safety to clos.lisp - in specific, targeted places - should be well doable.
Here is a second example that reliably gives me an error. This time, I only need two threads:
(defclass counter () ((count :initform 0)))
Yesterday, I took a little time to look at the issue you mention. You said with 1 big "lock around everything" it all worked. Where did you put that lock? I tried putting some locks in a number of places (educated guesses), but nothing helped so far. I don't have data regarding the stack trace you see when running into this issue - there wasn't enough time for it.
I know now that the error comes from one of two functions in StandardGenericFunction.java, but that's simply based on the text you're seeing in the error message.
The above is to let you know I did some work on it - and the current state. Maybe someone has time to work on it before me: I may have time to work on it some more somewhere next week.
Bye,
Erik.
Hi David!
On Sun, Oct 3, 2010 at 10:09 PM, David Kirkman dkirkman@ucsd.edu wrote:
I think I worked out the problem here: there are accesses to HashTables from unsynchronized get and put methods. Adding a 'synchronized' to the get/put methods in EqHashTable fixes this particular threading problem.
Ville told me through IRC he'll be playing with your patch a bit before committing, but let me express my excitement with the submission! Thanks! I'd have thought to be working to find the source of this issue for at least a few weeks. And it's already the case that we're going to have a great 0.23 release with Maxima support, our own generic class file writer, possibly support for the long form of DEFINE-METHOD-COMBINATION and now *this*!
When the example below fails, it looks likeEqHashTable.get is being called from Layout.getSlotIndex(LispObject slotName) (line 191), but I can't find a way to fix the problem by synchronizing on getSlotIndex or on the slotTable variable, perhaps there are direct get calls to the slotTable from someplace else?
Thanks again!
Bye,
Erik.
On 3 October 2010 23:09, David Kirkman dkirkman@ucsd.edu wrote:
I think I worked out the problem here: there are accesses to HashTables from
Committed in r12946. We honestly can't thank you enough for finding this problem and fixing it, fabulous work!
armedbear-devel@common-lisp.net