Howdy,
I wonder if those of you have worked with threads might have a quick look to see if I am doing something stupid.
https://lsw2.googlecode.com/svn/branches/bona/util/jargrep.lisp
The situation is that I want to do stuff (like look for matches to a regular expression) in 240k files which comprise 52G of data.
I am running on a VM allocated 5 CPUs each with three cores.
Because at the moment the disk subsystem isn't very fast, I decided to approach this by breaking up the 240k files into 15 parts and put each part in a jar file.
The code mentioned above looks for a regular expression (two methods for two different regex handlers: java and dk.brics.automaton
It is invoked something like:
(jar-map-threads-automaton-find regex (generate-filename-sequence "/data/jars/15/file#.jar" 2 0 14))
This spawns off 15 threads that go at it for something around a minute. As they find hits they save them in a lisp hash table keyed by the entry name in the jar file, which is unique across all the jar files.
The result of running this is about (and their's the rub) 20 key value pairs in the hash table (I had read that ABCL hash tables are thread safe). The problem is that different runs of this code on the same data get different numbers of key value pairs, between 13 and 24!
I'm not sure whether I'm just not doing this the right way, in which case it would be very helpful to get an explanation of why not, or there's a problem somewhere in the implementation.
Any ideas would be greatly appreciated.
Best, Alan
(LISP-IMPLEMENTATION-VERSION) "1.2.0-dev-svn-14436M"
"Java_HotSpot(TM)_64-Bit_Server_VM-Oracle_Corporation-1.7.0_21-b11"
"amd64-Linux-3.8.0-30-generic"
On 9/26/13 0728 , Alan Ruttenberg wrote:
Howdy,
I wonder if those of you have worked with threads might have a quick look to see if I am doing something stupid.
https://lsw2.googlecode.com/svn/branches/bona/util/jargrep.lisp
I whacked away at your file, converting it to the attached form to use the JSS namespace and ABCL-ASDF to resolve the dk.brics.automaton artifact, but I can't get to seem the matches to occur. Not having your jar files to test, I just run it across Maven jars as follows:
CL-USER> (jar-map-threads-automaton-find "Manifest" (jss::all-jars-below "~/.m2")) 12.295 seconds real time 1897572 cons cells
0 0 CL-USER> (length (jss::all-jars-below "~/.m2")) 460
which should result in matches for all jars, because every jar that Maven uses, has a manifest contains the string "Manifest-Version: 1.0". But I get no hits, and the execution is so fast, that I suspect that the matcher is not actually working on anything for some reason. Since you pass a closure with a reference to the regex as the function to THEREADS:MAKE-THREAD, trying to TRACE stuff doesn't seem to work so well.
I need to spend more time with the matcher to understand why I am not generating any hits. Any ideas on your end?
[…]
The result of running this is about (and their's the rub) 20 key value pairs in the hash table (I had read that ABCL hash tables are thread safe). The problem is that different runs of this code on the same data get different numbers of key value pairs, between 13 and 24!
ABCL hashtables should indeed be thread-safe, with all accesses protected by an underlying java.util.concurrent.locks.ReentrantLock.
I'm not sure whether I'm just not doing this the right way, in which case it would be very helpful to get an explanation of why not, or there's a problem somewhere in the implementation.
For the record, I used
CL-USER> (lisp-implementation-version) "1.3.0-dev" "Java_HotSpot(TM)_64-Bit_Server_VM-Oracle_Corporation-1.7.0_40-b43" "amd64-Linux-2.6.18-348.16.1.el5.centos.plus"
to run my tests, but I have no reason to currently suspect the ABCL version is at fault here.
More later when I get the time, Mark
which should result in matches for all jars, because every jar that Maven uses, has a manifest contains the string "Manifest-Version: 1.0". But I get no hits, and the execution is so fast, that I suspect that the matcher is not actually working on anything for some reason. Since you pass a closure with a reference to the regex as the function to THEREADS:MAKE-THREAD, trying to TRACE stuff doesn't seem to work so well.
I need to spend more time with the matcher to understand why I am not generating any hits. Any ideas on your end?
Yah. The, the search is case sensitive. You are searching for MANIFEST and the title of the entry is MANIFEST but the file has the string "Manifest" :)
-Alan
Another reason might be that the string looks like garbage. Maybe a character encoding issue - haven't run it on a mac, which I am now. It was definitely working on the big machine.
...
Yes, character encoding. For the purposes of test use
(unwind-protect (let ((name (#"getName" next-in))) (funcall fn (jss:new 'java.lang.string buffer 0 size "UTF-8") name)) (#"close" in-stream))
in jar-map, which gets the string as UTF-8. I need to think about what the right way to handle this in general. Presumably there is some way to figure out the character encoding...
-Alan
On Thu, Sep 26, 2013 at 4:38 PM, Alan Ruttenberg alanruttenberg@gmail.comwrote:
which should result in matches for all jars, because every jar that Maven uses, has a manifest contains the string "Manifest-Version: 1.0". But I get no hits, and the execution is so fast, that I suspect that the matcher is not actually working on anything for some reason. Since you pass a closure with a reference to the regex as the function to THEREADS:MAKE-THREAD, trying to TRACE stuff doesn't seem to work so well.
I need to spend more time with the matcher to understand why I am not generating any hits. Any ideas on your end?
Yah. The, the search is case sensitive. You are searching for MANIFEST and the title of the entry is MANIFEST but the file has the string "Manifest" :)
-Alan
On Thu, Sep 26, 2013 at 5:17 PM, Alan Ruttenberg alanruttenberg@gmail.com wrote:
I need to think about what the right way to handle this in general.
OK I've thought about it. I'd better just know. For java-defined entries they are always UTF-8. I guess I will assume UTF-8 unless I know otherwise.
-Alan
Oh, and nthreads isn't what you might think it is. It actually acts to just subset the filenames to be nthreads long. For testing. So if you used :nthreads 4 and it went to fast to grep through the 1000 files you gave it, that'd be why. I will fix that confusion next time I edit the file. -Alan
ps. (abcl-asdf:resolve-dependencies "dk.brics.automaton" "automaton") is very cute :)
On Thu, Sep 26, 2013 at 5:27 PM, Alan Ruttenberg alanruttenberg@gmail.comwrote:
On Thu, Sep 26, 2013 at 5:17 PM, Alan Ruttenberg <alanruttenberg@gmail.com
wrote:
I need to think about what the right way to handle this in general.
OK I've thought about it. I'd better just know. For java-defined entries they are always UTF-8. I guess I will assume UTF-8 unless I know otherwise.
-Alan
Thanks for this! I'll have a look a little later - crazy day. It occurred to me that perhaps the automaton library wasn't re-entrant but I asked the developer and he thinks it is.
-Alan
On Thu, Sep 26, 2013 at 7:44 AM, Mark Evenson evenson@panix.com wrote:
On 9/26/13 0728 , Alan Ruttenberg wrote:
Howdy,
I wonder if those of you have worked with threads might have a quick look to see if I am doing something stupid.
https://lsw2.googlecode.com/**svn/branches/bona/util/**jargrep.lisphttps://lsw2.googlecode.com/svn/branches/bona/util/jargrep.lisp
I whacked away at your file, converting it to the attached form to use the JSS namespace and ABCL-ASDF to resolve the dk.brics.automaton artifact, but I can't get to seem the matches to occur. Not having your jar files to test, I just run it across Maven jars as follows:
CL-USER> (jar-map-threads-automaton-**find "Manifest" (jss::all-jars-below "~/.m2")) 12.295 seconds real time 1897572 cons cells
0 0 CL-USER> (length (jss::all-jars-below "~/.m2")) 460
which should result in matches for all jars, because every jar that Maven uses, has a manifest contains the string "Manifest-Version: 1.0". But I get no hits, and the execution is so fast, that I suspect that the matcher is not actually working on anything for some reason. Since you pass a closure with a reference to the regex as the function to THEREADS:MAKE-THREAD, trying to TRACE stuff doesn't seem to work so well.
I need to spend more time with the matcher to understand why I am not generating any hits. Any ideas on your end?
[…]
The result of running this is about (and their's the rub) 20 key value
pairs in the hash table (I had read that ABCL hash tables are thread safe). The problem is that different runs of this code on the same data get different numbers of key value pairs, between 13 and 24!
ABCL hashtables should indeed be thread-safe, with all accesses protected by an underlying java.util.concurrent.locks.**ReentrantLock.
I'm not sure whether I'm just not doing this the right way, in which
case it would be very helpful to get an explanation of why not, or there's a problem somewhere in the implementation.
For the record, I used
CL-USER> (lisp-implementation-version) "1.3.0-dev" "Java_HotSpot(TM)_64-Bit_**Server_VM-Oracle_Corporation-**1.7.0_40-b43" "amd64-Linux-2.6.18-348.16.1.**el5.centos.plus"
to run my tests, but I have no reason to currently suspect the ABCL version is at fault here.
More later when I get the time, Mark
-- "A screaming comes across the sky. It has happened before, but there is nothing to compare to it now."
Attached find a test that abstracts all the file i/o away, and seemingly shows that there *is* a problem with concurrent hashtables access to ABCL.
CL-USER> (run 1) Spawning 1 threads... Done. Spawned threads that should sum to 10 while the shared value is 10. NIL CL-USER> (run 10) Spawning 10 threads... Done. Spawned threads that should sum to 100 while the shared value is 100. NIL CL-USER> (run 100) Spawning 100 threads... Done. Spawned threads that should sum to 1000 while the shared value is 993. NIL
In this test, each thread sleeps for (/ (random 1000) 1000) seconds to allow contentions to arise.
Inspection of the code and observations of whether this test should be expected to "fail" for some other problem are solicited.
armedbear-devel@common-lisp.net