Hello,
Not sure if this is the appropriate forum as the email is not related to the development of cl-ppcre, but I did not find a list for users. Please feel free to redirect me elsewhere.
I could use some help in figuring out why this regexp is so slow. As far as I can tell, there is nothing abnormal about it. I currently use the same regexp in python and its blazes through the input file. Bear in mind, this is the first time that I've used cl-ppcre. It is was an experiment to see if I could lisp for this little application.
Here is the regexp (at least a small portion of it that exhibits the behavior I am seeing):
^(?:\S+ ){7}(\S+)\s+- commAlarm
Here is the input line it is matching against (note: this is a single line albeit a long one):
1105243660 11 Sun Jan 09 04:07:40 2005 sclax02.ibasis.net - commAlarm ovnyc00p.ov.i\vanet.net [1] private.enterprises.2496.1.1.5.5.1.0 (Integer): 0 [2] private.enterprises.\2496.1.1.5.5.2.0 (Integer): 115 [3] private.enterprises.2496.1.1.5.5.3.0 (OctetString): \ISUP: UNEX ANM [4] private.enterprises.2496.1.1.5.5.4.0 (OctetString): ISDN User Part Un\expected ANM [5] private.enterprises.2496.1.1.5.5.5.0 (Integer): 2 [6] private.enterpri\ses.2496.1.1.5.5.6.0 (Integer): 1 [7] private.enterprises.2496.1.1.5.5.7.0 (Integer): 1 \ [8] private.enterprises.2496.1.1.5.5.8.0 (Integer): 2 [9] private.enterprises.2496.1.1.\1.1.1.1.1.1.1.1376258 (Integer): 1376258 [10] private.enterprises.2496.1.1.1.1.1.1.1.1.2.1376258 (Integer): 21 [11] private.enterprises.2496.1.1.1.1.1.1.1.1.4.1376258 (OctetStr\ing): ss7path-att [12] private.enterprises.2496.1.1.1.1.1.1.1.1.5.1376258 (OctetString):\ SS7 Path For ATT and NGT DPC 5.21.39 [13] private.enterprises.2496.1.1.1.1.1.1.1.1.3.13\76258 (Integer): 1245188 [14] private.enterprises.2496.1.1.5.5.9.0 (Integer): 1105243880;1 .1.3.6.1.4.1.2496.1.1.4.1 0
Stuff 51 of those lines above into a into a file and try to match on that regexp and I get the following results:
PGW> (time (parse-file "/tmp/sample")) Evaluation took: 2.984 seconds of real time 1.81 seconds of user run time 1.12 seconds of system run time 0 page faults and 228,191,424 bytes consed.
I am hoping to parse a file that has close to 75,000 lines in that format. At this rate, I will never make it in a reasonable amount of time. Here is the PARSE-FILE function I am using:
(defun parse-file (file) (with-open-file (in file) (do ((line (read-line in nil :eof) (read-line in nil :eof))) ((eql line :eof) t) (do-register-groups ((#'intern host)) ("^(?:\S+ ){7}(\S+)\s+- commAlarm" line) (format t "Found host ~A~%" host)))))
I've also read the docs and I am unable to find anything to help identify the problem with my regexp. I have tried single-line mode, however the results were very similiar. My platform is SBCL 0.8.18.23 and version 1.0 of cl-ppcre.
Any help would be appreciated.
Thanks, Pete
Hi!
On Thu, 20 Jan 2005 15:16:52 -0500, pete-cl-ppcre@kazmier.com wrote:
Not sure if this is the appropriate forum as the email is not related to the development of cl-ppcre, but I did not find a list for users. Please feel free to redirect me elsewhere.
It's fine to ask this kind of questions here.
I could use some help in figuring out why this regexp is so slow. As far as I can tell, there is nothing abnormal about it. I currently use the same regexp in python and its blazes through the input file. Bear in mind, this is the first time that I've used cl-ppcre. It is was an experiment to see if I could lisp for this little application.
Here is the regexp (at least a small portion of it that exhibits the behavior I am seeing):
^(?:\\S+ ){7}(\\S+)\\s+- commAlarm
Here is the input line it is matching against (note: this is a single line albeit a long one):
1105243660 11 Sun Jan 09 04:07:40 2005 sclax02.ibasis.net - commAlarm ovnyc00p.ov.i\vanet.net [1] private.enterprises.2496.1.1.5.5.1.0 (Integer): 0 [2] private.enterprises.\2496.1.1.5.5.2.0 (Integer): 115 [3] private.enterprises.2496.1.1.5.5.3.0 (OctetString): \ISUP: UNEX ANM [4] private.enterprises.2496.1.1.5.5.4.0 (OctetString): ISDN User Part Un\expected ANM [5] private.enterprises.2496.1.1.5.5.5.0 (Integer): 2 [6] private.enterpri\ses.2496.1.1.5.5.6.0 (Integer): 1 [7] private.enterprises.2496.1.1.5.5.7.0 (Integer): 1 \ [8] private.enterprises.2496.1.1.5.5.8.0 (Integer): 2 [9] private.enterprises.2496.1.1.\1.1.1.1.1.1.1.1376258 (Integer): 1376258 [10] private.enterprises.2496.1.1.1.1.1.1.1.1.2.1376258 (Integer): 21 [11] private.enterprises.2496.1.1.1.1.1.1.1.1.4.1376258 (OctetStr\ing): ss7path-att [12] private.enterprises.2496.1.1.1.1.1.1.1.1.5.1376258 (OctetString):\ SS7 Path For ATT and NGT DPC 5.21.39 [13] private.enterprises.2496.1.1.1.1.1.1.1.1.3.13\76258 (Integer): 1245188 [14] private.enterprises.2496.1.1.5.5.9.0 (Integer): 1105243880;1 .1.3.6.1.4.1.2496.1.1.4.1 0
Stuff 51 of those lines above into a into a file and try to match on that regexp and I get the following results:
PGW> (time (parse-file "/tmp/sample")) Evaluation took: 2.984 seconds of real time 1.81 seconds of user run time 1.12 seconds of system run time 0 page faults and 228,191,424 bytes consed.
That's much too slow and much too much consing. FWIW, here's what I get with SBCL 0.8.16 and 50 lines like the one from above:
* (time (parse-file "/tmp/sample")) Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Evaluation took: 0.148 seconds of real time 0.1 seconds of user run time 0.01 seconds of system run time 0 page faults and 506,416 bytes consed. T
My platform is SBCL 0.8.18.23 and version 1.0 of cl-ppcre.
My wild guess is that this is due your version of SBCL being from the new Unicode branch which I haven't tried yet. If you don't need full Unicode support then maybe you should switch it off. Or better, report this to the SBCL maintainers (if my guess is right). Also, see the note about simple strings in the CL-PPCRE docs.
To show you that CL-PPCRE is not necessarily slow with full Unicode support here's the output from AllegroCL 7.0:
CL-USER(5): (time (parse-file "/tmp/sample")) Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net ; cpu time (non-gc) 100 msec user, 30 msec system ; cpu time (gc) 110 msec user, 0 msec system ; cpu time (total) 210 msec user, 30 msec system ; real time 294 msec ; space allocation: ; 15,326 cons cells, 13,453,192 other bytes, 0 static bytes T
I am hoping to parse a file that has close to 75,000 lines in that format. At this rate, I will never make it in a reasonable amount of time.
Here's the output from CMUCL 19a for 100,000 lines like above (with the FORMAT form in your function removed and Linux running within VMWare on my laptop):
* (time (parse-file "/tmp/sample")) ; Compiling LAMBDA NIL: ; Compiling Top-Level Form:
; Evaluation took: ; 20.27 seconds of real time ; 3.01 seconds of user run time ; 17.25 seconds of system run time ; 40,541,141,192 CPU cycles ; [Run times include 1.05 seconds GC run time] ; 0 page faults and ; 938,959,528 bytes consed. ; T
If that doesn't help, let me know.
Cheers, Edi.
On Thu, Jan 20, 2005 at 09:51:05PM +0100, Edi Weitz wrote:
My wild guess is that this is due your version of SBCL being from the new Unicode branch which I haven't tried yet. If you don't need full Unicode support then maybe you should switch it off.
Hmmm ... right now I am using a debian package, but I'll build myself an SBCL without unicode support and see what happens.
Thanks, Pete
On Thu, 20 Jan 2005 17:15:32 -0500, pete-cl-ppcre@kazmier.com wrote:
Hmmm ... right now I am using a debian package, but I'll build myself an SBCL without unicode support and see what happens.
I just checked myself - I installed the SBCL version from Debian unstable (which currently is 0.8.18.34) and got the same bad results that you got (for 50 lines):
Evaluation took: 1.951 seconds of real time 0.4 seconds of user run time 1.54 seconds of system run time 1 page fault and 224,923,896 bytes consed. T
Looks like you should ask on the SBCL list or switch.
Cheers, Edi.
And my SBCL build just finished (no unicode support) and things work as expected. I'll ask the folks on #lisp where to send a report.
On Fri, Jan 21, 2005 at 12:13:27AM +0100, Edi Weitz wrote:
On Thu, 20 Jan 2005 17:15:32 -0500, pete-cl-ppcre@kazmier.com wrote:
Hmmm ... right now I am using a debian package, but I'll build myself an SBCL without unicode support and see what happens.
I just checked myself - I installed the SBCL version from Debian unstable (which currently is 0.8.18.34) and got the same bad results that you got (for 50 lines):
Evaluation took: 1.951 seconds of real time 0.4 seconds of user run time 1.54 seconds of system run time 1 page fault and 224,923,896 bytes consed. T
Looks like you should ask on the SBCL list or switch.
Cheers, Edi.
Actually I just realized that your problem revealed a thinko in CL-PPCRE. I'll release a new version in a couple of hours (have to cook something for my wife and my kid first...) which fixes that.
Cheers, Edi.
OK, now these are the results I get with CL-PPCRE 1.1.0:
* (lisp-implementation-type)
"SBCL" * (lisp-implementation-version)
"0.8.18.34" * *regex-char-code-limit*
1114112 * (defun parse-file (file) (with-open-file (in file) (do ((line (read-line in nil :eof) (read-line in nil :eof))) ((eql line :eof) t) (do-register-groups ((#'intern host)) ("^(?:\S+ ){7}(\S+)\s+- commAlarm" line) ;; do something other than printing for more accurate TIME ;; results host))))
PARSE-FILE * (time (parse-file "/tmp/sample50")) ;; your sample, 50 lines
Evaluation took: 0.03 seconds of real time 0.01 seconds of user run time 0.02 seconds of system run time 0 page faults and 519,008 bytes consed. T * (time (parse-file "/tmp/sample500")) ;; your sample, 500 lines
Evaluation took: 0.314 seconds of real time 0.24 seconds of user run time 0.07 seconds of system run time 0 page faults and 5,194,960 bytes consed. T * (time (parse-file "/tmp/sample5000")) ;; your sample, 5000 lines
Evaluation took: 1.187 seconds of real time 0.34 seconds of user run time 0.85 seconds of system run time 0 page faults and 51,942,672 bytes consed. T
That's better, isn't it? Thanks for making me aware of this problem.
Cheers, Edi.
On Sun, Jan 23, 2005 at 06:51:32PM +0100, Edi Weitz wrote:
OK, now these are the results I get with CL-PPCRE 1.1.0:
- (time (parse-file "/tmp/sample5000")) ;; your sample, 5000 lines
Evaluation took: 1.187 seconds of real time 0.34 seconds of user run time 0.85 seconds of system run time 0 page faults and 51,942,672 bytes consed. T
That's better, isn't it? Thanks for making me aware of this problem.
It is better now. However, I am no longer using DO-REGISTER-GROUPS as I realized that I intended to use REGISTER-GROUPS-BIND all along. In any case, I am thrilled with the performance of cl-ppcre now that I have set *regex-char-code-limit* to an appropriate value.
I think your library is great, my only problem now is the performance of READ-LINE in the free CL implementations (which is certainly not your problem). Its absolutely terrible in SBCL. I've inquired about it, but have not really received any good responses yet:
http://sourceforge.net/mailarchive/forum.php?thread_id=6400934&forum_id=...
On Mon, 24 Jan 2005 15:24:47 -0500, pete-cl-ppcre@kazmier.com wrote:
It is better now. However, I am no longer using DO-REGISTER-GROUPS as I realized that I intended to use REGISTER-GROUPS-BIND all along. In any case, I am thrilled with the performance of cl-ppcre now that I have set *regex-char-code-limit* to an appropriate value.
Actually, if your Lisp implementation has a decent hash table implementation the value of *REGEX-CHAR-CODE-LIMIT* shouldn't matter much during scanning - only when a regular expression is created.
The problem that I fixed in 1.1.0 was that your function created a new scanner each time it was called although this wasn't necessary. This resulted in much too much consing (because of the huge hash tables that were generated).
I think your library is great, my only problem now is the performance of READ-LINE in the free CL implementations (which is certainly not your problem). Its absolutely terrible in SBCL. I've inquired about it, but have not really received any good responses yet:
http://sourceforge.net/mailarchive/forum.php?thread_id=6400934&forum_id=...
Yeah, I've read that and I'm concerned, too. I'd love to see an encouraging response. Have you tried to download the trial versions of AllegroCL and/or LispWorks just to check how they cope with this?
Cheers, Edi.
On Mon, Jan 24, 2005 at 09:57:38PM +0100, Edi Weitz wrote:
Yeah, I've read that and I'm concerned, too. I'd love to see an encouraging response. Have you tried to download the trial versions of AllegroCL and/or LispWorks just to check how they cope with this?
I spent the morning downloading the trial versions of these products to test their speed in comparison to the free implementations. LW was very fast, Allegro was still slow compared to the equivalent python program, but faster than the fastest free implementation (Clisp).
I want to see how fast LW and cl-ppcre compare to the equivalent python program (I think it'll be faster), I just haven't figured out how to get ASDF working with it yet. And I'm still hoping for some developer to comment on my post to the SBCL user list as I'm stuck with the open-source implementations.
LispWorks: CL-USER 17 > (time (tester "/tmp/pgw-logs/trapd.log.ovnyc00p")) Timing the evaluation of (TESTER "/tmp/pgw-logs/trapd.log.ovnyc00p")
user time = 0.880 system time = 0.180 Elapsed time = 0:00:01 Allocation = 36822608 bytes standard / 4070 bytes conses 0 Page faults Calls to %EVAL 34 T
Allegro: CL-USER(7): (time (tester "/tmp/pgw-logs/trapd.log.ovnyc00p")) ; cpu time (non-gc) 2,730 msec user, 90 msec system ; cpu time (gc) 80 msec user, 0 msec system ; cpu time (total) 2,810 msec user, 90 msec system ; real time 3,493 msec ; space allocation: ; 67 cons cells, 67,825,256 other bytes, 0 static bytes T
cl-ppcre-devel@common-lisp.net