For an :ALTERNATION, is there an O(1) way to tell which alternate matched? e.g. for a regexp (a)|(b)|(c)|(d)|... I don't want the O(n) performance of scanning the list returned by SCAN or SCAN-TO-STRINGS. Instead, I'd just like a number saying which one matched. I understand that there can normally be nested groups, alternatives that are not groups, and other complications - so maybe a list/vector of groups that were matched would work as a general interface.
More background:
For a quick and dirty scanner (similar to lex) I thought I could use CL-PPCRE. The "rules" (in lex terminology) are just regular expressions, and so I combine them into a set of alternatives, e.g.
\s+ -> whitespace [0-9]+ -> number [a-zA-Z][a-zA-Z0-9]* -> identifier
=> "(\s+)|([0-9]+)|([a-zA-Z][a-zA-Z0-9]*)"
Actually, I use CL-PPCRE::PARSE-STRING to get a parse tree for each expression, change any :REGISTER tags to :GROUP tags, and combine them under an :ALTERNATION tag:
(defun combine-regexps (&rest regexps) "Combine several regexp strings into a CL-PPCRE parse tree of alternatives." ;; All registers are changed into groups, i.e. (x) -> (?:x) in regexp syntax. ;; That keeps the registers from messing up the scanner's expectation that ;; each register is one of the rules, and allows the list of rules to use ;; () instead of (?:) throughout for readability. (let ((registers (mapcar (lambda (r) `(:register ,(subst :group :register (cl-ppcre::parse-string r)))) regexps))) `(:sequence :start-anchor (:group (:alternation ,@registers)))))
On Mon, 12 Feb 2007 16:18:43 -0800, Harold Lee harold@hotelling.net wrote:
For an :ALTERNATION, is there an O(1) way to tell which alternate matched? e.g. for a regexp (a)|(b)|(c)|(d)|... I don't want the O(n) performance of scanning the list returned by SCAN or SCAN-TO-STRINGS. Instead, I'd just like a number saying which one matched.
You could use filters:
http://weitz.de/cl-ppcre/#filters
However, are you really sure that the O(n) operation of looping through the registers is causing performance problems? Have you profiled the code? This looks like premature optimization to me.
I'm pretty sure that whatever you'll do to "improve" this (filters or changing CL-PPCRE internally to return more information for example) you'll end up with something even slower.
For a quick and dirty scanner (similar to lex) I thought I could use CL-PPCRE. The "rules" (in lex terminology) are just regular expressions, and so I combine them into a set of alternatives, e.g.
\s+ -> whitespace [0-9]+ -> number [a-zA-Z][a-zA-Z0-9]* -> identifier
=> "(\s+)|([0-9]+)|([a-zA-Z][a-zA-Z0-9]*)"
Actually, I use CL-PPCRE::PARSE-STRING to get a parse tree for each expression, change any :REGISTER tags to :GROUP tags, and combine them under an :ALTERNATION tag:
That doesn't feel right. I wouldn't create regex strings first just to parse them into s-expressions afterwards. Why don't you start with s-expressions right away?
Cheers, Edi.
Edi Weitz wrote:
However, are you really sure that the O(n) operation of looping through the registers is causing performance problems? Have you profiled the code? This looks like premature optimization to me.
I'm pretty sure that whatever you'll do to "improve" this (filters or changing CL-PPCRE internally to return more information for example) you'll end up with something even slower.
I am guilty of premature optimization at some level, but I think flex has set the bar high for scanner performance. I'll spend more time examining performance to see if this is really needed.
That doesn't feel right. I wouldn't create regex strings first just to parse them into s-expressions afterwards. Why don't you start with s-expressions right away?
I'll change COMBINE-REGEXPS to only call CL-PPCRE::PARSE-STRING for strings (and assume other data is an appropriate s-expression). I'd like to make this very similar to lex / flex in allowing users of this package to use regular expressions. I'm not worried about the parsing performance because I am doing this at compile time (via a macro, DEFSCANNER).
cl-ppcre-devel@common-lisp.net