Hi Thomas!
Sorry for the delay.
On Sat, 15 May 2004 21:52:25 +0200, Thomas Schilling tjs_ng@yahoo.de wrote:
I'm currently writing a pattern matcher for trees. This is (hopefully) going to become a tool for easily extracting information from tree-structured documents, specifically XML/HTML files. (Yes, I know of SAX and XSLT and XPath but I think this can become more powerful--I plan to allow embeddable code. Nevertheless it could be added to e.g. CL-XML.)
In fact it's a regular expression matcher that two-dimensionally matches on lists. (car is considered to be the node's name, i.e. tag-name, and the cdr it's children). So that's the reason why I'm asking you.
So here's my problem: I, well, stole some ideas from you. Because we need heavy backtracking and looking ahead I implemented it (just like you, I think) largely via closures, i.e. each rule is matched via a function, and the rules are linked via a next-rule (function-) variable that is put inside each closure. So e.g. the repetion function can check by itself if the rest of the pattern matches when the number of repetitions is tried. This works great except in the folloing case:
When I have (in perl-notation) "abaa" =~ /^(a|ab)*a/ the match failes. That's because the alternative which has only the information of the two choices "a" and "ab" but no successor information (because it can have two--the alternative itself or the "a" after the repetition) succeeds when matching the first alternative and never tries to match the second, although then the repetition could match.
I currently have no real solution to this problem. How did you solve this? I looked it up in your code but I couldn't find it (maybe it's masked by the many optimizations?).
I would be very thankful if you could help me.
Ok, so here's the code: http://www.inf.tu-dresden.de/~s3815210/lisp tree-match.lisp. You can simply C-c C-k it. and then C-x C-e the (test-all) at the end of the file. It should print the one failing test. The interesting functions are probably OR-M and GREEDY-REPEAT-M. The problem is mentioned in line 286 in the PARSE-PATTERN function.
I only looked briefly at your code 'cause I'm busy with other things but I /think/ the problem is in OR-M:
(defun or-m (options next-rule) (declare (type function next-rule)) (matcher-fn (do ((r options (cdr r))) ((null r) nil) (multiple-value-bind (pos inp) (funcall (the function (car r)) input position) (when pos (multiple-value-bind (pos2 inp2) (funcall next-rule inp pos) (when pos2 (return (values position inp2)))))))))
Here you try each function in OPTIONS in turn and if one of them succeeds you immediately call NEXT-RULE so there's no chance for backtracking. What you have to do is to construct a new function for each element in OPTIONS which itself knows how to proceed (namely via NEXT-RULE) and try these new functions instead of the ones in OPTIONS.
FWIW, here's the relevant code from CL-PPCRE:
(defmethod create-matcher-aux ((alternation alternation) next-fn) ;; first create closures for all alternations of ALTERNATION (let ((all-matchers (mapcar #'(lambda (choice) (create-matcher-aux choice next-fn)) (choices alternation)))) ;; now create a closure which checks if one of the closures ;; created above can succeed (lambda (start-pos) (declare (type fixnum start-pos)) (loop for matcher in all-matchers thereis (funcall (the function matcher) start-pos)))))
I don't think this is masked by too many optimizations... :)
The important part is the creation of ALL-MATCHERS from (CHOICES ALTERNATION) (which would be your OPTIONS).
Does that help?
This is just a wild guess from looking at your code for a couple of minutes. If that doesn't help feel free to ask me again.
Please ask me if you have any questions.
If you think of a place to which should post this please tell me.
I've sent a copy to the cl-ppcre-devel mailing list. We should take further discussion there so they'll be archived (if you don't mind).
HTH, Edi.