Howdy,
I wonder if those of you have worked with threads might have a quick look to see if I am doing something stupid.
https://lsw2.googlecode.com/svn/branches/bona/util/jargrep.lisp
The situation is that I want to do stuff (like look for matches to a regular expression) in 240k files which comprise 52G of data.
I am running on a VM allocated 5 CPUs each with three cores.
Because at the moment the disk subsystem isn't very fast, I decided to approach this by breaking up the 240k files into 15 parts and put each part in a jar file.
The code mentioned above looks for a regular expression (two methods for two different regex handlers: java and dk.brics.automaton
It is invoked something like:
(jar-map-threads-automaton-find regex (generate-filename-sequence "/data/jars/15/file#.jar" 2 0 14))
This spawns off 15 threads that go at it for something around a minute. As they find hits they save them in a lisp hash table keyed by the entry name in the jar file, which is unique across all the jar files.
The result of running this is about (and their's the rub) 20 key value pairs in the hash table (I had read that ABCL hash tables are thread safe). The problem is that different runs of this code on the same data get different numbers of key value pairs, between 13 and 24!
I'm not sure whether I'm just not doing this the right way, in which case it would be very helpful to get an explanation of why not, or there's a problem somewhere in the implementation.
Any ideas would be greatly appreciated.
Best, Alan
(LISP-IMPLEMENTATION-VERSION) "1.2.0-dev-svn-14436M"
"Java_HotSpot(TM)_64-Bit_Server_VM-Oracle_Corporation-1.7.0_21-b11"
"amd64-Linux-3.8.0-30-generic"