Hi!
On Thu, 20 Jan 2005 15:16:52 -0500, pete-cl-ppcre@kazmier.com wrote:
Not sure if this is the appropriate forum as the email is not related to the development of cl-ppcre, but I did not find a list for users. Please feel free to redirect me elsewhere.
It's fine to ask this kind of questions here.
I could use some help in figuring out why this regexp is so slow. As far as I can tell, there is nothing abnormal about it. I currently use the same regexp in python and its blazes through the input file. Bear in mind, this is the first time that I've used cl-ppcre. It is was an experiment to see if I could lisp for this little application.
Here is the regexp (at least a small portion of it that exhibits the behavior I am seeing):
^(?:\\S+ ){7}(\\S+)\\s+- commAlarm
Here is the input line it is matching against (note: this is a single line albeit a long one):
1105243660 11 Sun Jan 09 04:07:40 2005 sclax02.ibasis.net - commAlarm ovnyc00p.ov.i\vanet.net [1] private.enterprises.2496.1.1.5.5.1.0 (Integer): 0 [2] private.enterprises.\2496.1.1.5.5.2.0 (Integer): 115 [3] private.enterprises.2496.1.1.5.5.3.0 (OctetString): \ISUP: UNEX ANM [4] private.enterprises.2496.1.1.5.5.4.0 (OctetString): ISDN User Part Un\expected ANM [5] private.enterprises.2496.1.1.5.5.5.0 (Integer): 2 [6] private.enterpri\ses.2496.1.1.5.5.6.0 (Integer): 1 [7] private.enterprises.2496.1.1.5.5.7.0 (Integer): 1 \ [8] private.enterprises.2496.1.1.5.5.8.0 (Integer): 2 [9] private.enterprises.2496.1.1.\1.1.1.1.1.1.1.1376258 (Integer): 1376258 [10] private.enterprises.2496.1.1.1.1.1.1.1.1.2.1376258 (Integer): 21 [11] private.enterprises.2496.1.1.1.1.1.1.1.1.4.1376258 (OctetStr\ing): ss7path-att [12] private.enterprises.2496.1.1.1.1.1.1.1.1.5.1376258 (OctetString):\ SS7 Path For ATT and NGT DPC 5.21.39 [13] private.enterprises.2496.1.1.1.1.1.1.1.1.3.13\76258 (Integer): 1245188 [14] private.enterprises.2496.1.1.5.5.9.0 (Integer): 1105243880;1 .1.3.6.1.4.1.2496.1.1.4.1 0
Stuff 51 of those lines above into a into a file and try to match on that regexp and I get the following results:
PGW> (time (parse-file "/tmp/sample")) Evaluation took: 2.984 seconds of real time 1.81 seconds of user run time 1.12 seconds of system run time 0 page faults and 228,191,424 bytes consed.
That's much too slow and much too much consing. FWIW, here's what I get with SBCL 0.8.16 and 50 lines like the one from above:
* (time (parse-file "/tmp/sample")) Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Evaluation took: 0.148 seconds of real time 0.1 seconds of user run time 0.01 seconds of system run time 0 page faults and 506,416 bytes consed. T
My platform is SBCL 0.8.18.23 and version 1.0 of cl-ppcre.
My wild guess is that this is due your version of SBCL being from the new Unicode branch which I haven't tried yet. If you don't need full Unicode support then maybe you should switch it off. Or better, report this to the SBCL maintainers (if my guess is right). Also, see the note about simple strings in the CL-PPCRE docs.
To show you that CL-PPCRE is not necessarily slow with full Unicode support here's the output from AllegroCL 7.0:
CL-USER(5): (time (parse-file "/tmp/sample")) Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net Found host sclax02.ibasis.net ; cpu time (non-gc) 100 msec user, 30 msec system ; cpu time (gc) 110 msec user, 0 msec system ; cpu time (total) 210 msec user, 30 msec system ; real time 294 msec ; space allocation: ; 15,326 cons cells, 13,453,192 other bytes, 0 static bytes T
I am hoping to parse a file that has close to 75,000 lines in that format. At this rate, I will never make it in a reasonable amount of time.
Here's the output from CMUCL 19a for 100,000 lines like above (with the FORMAT form in your function removed and Linux running within VMWare on my laptop):
* (time (parse-file "/tmp/sample")) ; Compiling LAMBDA NIL: ; Compiling Top-Level Form:
; Evaluation took: ; 20.27 seconds of real time ; 3.01 seconds of user run time ; 17.25 seconds of system run time ; 40,541,141,192 CPU cycles ; [Run times include 1.05 seconds GC run time] ; 0 page faults and ; 938,959,528 bytes consed. ; T
If that doesn't help, let me know.
Cheers, Edi.