Hi,
In a part of the program, I find out that the below CL-PPCRE:SCAN call really slows down the whole operation:
(cl-ppcre:scan "^\[([^ ]{1,})+[ ]*(.{1,})?\]" "foo [[Main]] [http://baz%5D%27%27bold%27%27%27%27%27%27%27bar''''" :start 13)
When I remove the :START keyword and the beginning `^' regex character, SCAN finishes the operation quite fast, as it should be. What can be the problem in here? (A bug?) How can I fix this weird behaviour?
Regards.
(cl-ppcre:scan "^\[([^ ]{1,})+[ ]*(.{1,})?\]" "foo [[Main]] [http://baz%5D%27%27bold%27%27%27%27%27%27%27bar''''" :start 13)
This is a regular expression that does lots of backtracking when it fails. If you change that you'll most likely see a large performance improvement.
A small change is to simplify the first grouping:
"^\[([^ ]{1,})[ ]*(.{1,})?\]"
The reason that having :start is so much slower is that the regex matches a different string that needs far less backtracking that without the :start.
Cheers, Chris Dean
Chris Dean ctdean@sokitomi.com writes:
(cl-ppcre:scan "^\[([^ ]{1,})+[ ]*(.{1,})?\]" "foo [[Main]] [http://baz%5D%27%27bold%27%27%27%27%27%27%27bar''''" :start 13)
This is a regular expression that does lots of backtracking when it fails. If you change that you'll most likely see a large performance improvement.
A small change is to simplify the first grouping:
"^\[([^ ]{1,})[ ]*(.{1,})?\]"
The reason that having :start is so much slower is that the regex matches a different string that needs far less backtracking that without the :start.
Next time, how can I understand when a regex will need that much backtracking? I'll be really appreciated if you'd explain the pattern a little bit more.
By the way, what I'm trying to do is to parse string patterns like `[href]' and `[href text]'. And as you can realize from :START 13 keyword, I'm previously determined that at 13th character, there exists a `['. Do you suggest any other method to parse such strings more efficiently?
Regards.
By the way, what I'm trying to do is to parse string patterns like `[href]' and `[href text]'. And as you can realize from :START 13 keyword, I'm previously determined that at 13th character, there exists a `['. Do you suggest any other method to parse such strings more efficiently?
A regex seems like a fine way to me.
Next time, how can I understand when a regex will need that much backtracking? I'll be really appreciated if you'd explain the pattern a little bit more.
You can play around with the Regex Coach http://weitz.de/regex-coach/ and step through the matching. If you really want a deeper understanding Jeffrey Friedl's Mastering Regular Expressions is very good. And most compiler text books will cover regexs as well.
Maybe someone else has a simple backtracking explanation.
Cheers, Chris Dean
cl-ppcre-devel@common-lisp.net