[cl-ppcre-devel] Buffered multi-line question

older
[cl-ppcre-devel] New version 0.9.0

Sébastien Saint-Sevin

11 Oct 2004 11 Oct '04

4:52 p.m.

Hi Edi & list, I'm doing multi-lines regex searches over big files that can't be converted to single string. So I introduced a kind of buffer that I'm using to search. Now, I need to add a constraint to scan, do-scans & others (in addition to (&key start end)) : I want to be able to specify to the engine that a scan must start before a certain index in the string (to avoid searching further results that will be cancelled later because of my buffered multi-line matching process). Logically, this :must-start-before value correspond to the first line of my buffer. If nothing starts at first line, I need to move the search one line forward, so everything that the engine would match later on in the string is wasted time. How can I do it ? Cheers, Sebastien. PS: Edi, if you are back, my previous post is still an open question ;-) (the one with FILTER...)

Show replies by date

Edi Weitz

11 Oct 11 Oct

5:32 p.m.

Hi Sébastien! On Mon, 11 Oct 2004 18:52:56 +0200, Sébastien Saint-Sevin <seb-cl-mailist@matchix.com> wrote:

...

I'm doing multi-lines regex searches over big files that can't be converted to single string. So I introduced a kind of buffer that I'm using to search.

Now, I need to add a constraint to scan, do-scans & others (in addition to (&key start end)) : I want to be able to specify to the engine that a scan must start before a certain index in the string (to avoid searching further results that will be cancelled later because of my buffered multi-line matching process).

Logically, this :must-start-before value correspond to the first line of my buffer. If nothing starts at first line, I need to move the search one line forward, so everything that the engine would match later on in the string is wasted time.

How can I do it ?

Have you considered using something like (?s:(?=.{n}))<actual-regular-expression> where n obviously is an integer computed from your constraints above? I don't know how this'll behave performance-wise but you could just try it... :) Or have I misunderstood your question? Actually, I'm not sure why the END keyword parameter doesn't suffice. Can you give an example?

...

PS: Edi, if you are back, my previous post is still an open question ;-) (the one with FILTER...)

Yes, I'm back but unfortunately I'm very busy with commercial stuff right now. Sorry, filters will have to wait some more. Cheers, Edi.

Sébastien Saint-Sevin

7:35 p.m.

...

Hi Sébastien!

On Mon, 11 Oct 2004 18:52:56 +0200, Sébastien Saint-Sevin <seb-cl-mailist@matchix.com> wrote:

...
I'm doing multi-lines regex searches over big files that can't be converted to single string. So I introduced a kind of buffer that I'm using to search.

Now, I need to add a constraint to scan, do-scans & others (in addition to (&key start end)) : I want to be able to specify to the engine that a scan must start before a certain index in the string (to avoid searching further results that will be cancelled later because of my buffered multi-line matching process).

Logically, this :must-start-before value correspond to the first line of my buffer. If nothing starts at first line, I need to move the search one line forward, so everything that the engine would match later on in the string is wasted time.

How can I do it ?

Have you considered using something like

(?s:(?=.{n}))<actual-regular-expression>

where n obviously is an integer computed from your constraints above? I don't know how this'll behave performance-wise but you could just try it... :)

Or have I misunderstood your question? Actually, I'm not sure why the END keyword parameter doesn't suffice. Can you give an example?

As far as I understand it, (?s:(?=.{n})) will only garantee that at least n chars are remaining from match-start in the consumed string. This is not what I want. I want something that garantee that match-start will be before index n (meaning n'th char in consumed string), wether match-end is before or after this index n.

...

...
PS: Edi, if you are back, my previous post is still an open question ;-) (the one with FILTER...)

Yes, I'm back but unfortunately I'm very busy with commercial stuff right now. Sorry, filters will have to wait some more.

Cheers, Edi.

Here is what I've got right now (it's ok for my needs actually). (defclass filter (regex) ((num :initarg :num :accessor num :type fixnum :documentation "The number of the register this filter refers to.") (predicate :initarg :predicate :accessor predicate :documentation "The predicate to validate the register with")) (:documentation "FILTER objects represent the combination of a register and a predicate. This is not available in regex string, but only used in parse tree.")) (defmethod create-matcher-aux ((filter filter) next-fn) (declare (type function next-fn)) ;; the position of the corresponding REGISTER within the whole ;; regex; we start to count at 0 (let ((num (num filter))) (lambda (start-pos) (declare (type fixnum start-pos)) (let ((reg-start (svref *reg-starts* num)) (reg-end (svref *reg-ends* num))) ;; only bother to check if the corresponding REGISTER as ;; matched successfully already (and reg-start (funcall (predicate filter) (subseq *string* reg-start reg-end)) (funcall next-fn start-pos)))))) ADDED TO (defun convert-aux (parse-tree) ... ;; (:FILTER <number> <predicate>) ((:filter) (let ((backref-number (second parse-tree)) (predicate (third parse-tree))) (declare (type fixnum backref-number)) (when (or (not (typep backref-number 'fixnum)) (<= backref-number 0)) (signal-ppcre-syntax-error "Illegal back-reference: ~S" parse-tree)) (unless (or (typep predicate 'symbol) (typep predicate 'function)) (signal-ppcre-syntax-error "Illegal predicate: ~S" parse-tree)) ;; stop accumulating into STARTS-WITH and increase ;; MAX-BACK-REF if necessary (setq accumulate-start-p nil max-back-ref (max (the fixnum max-back-ref) backref-number)) (make-instance 'filter ;; we start counting from 0 internally :num (1- backref-number) :predicate predicate))) ADDED FOR MY PURPOSES... (defmethod create-scanner-with-predicate ((regex-string string) predicate &key case-insensitive-mode multi-line-mode single-line-mode extended-mode destructive) (declare (optimize speed (safety 0) (space 0) (debug 0) (compilation-speed 0) #+:lispworks (hcl:fixnum-safety 0))) (declare (ignore destructive)) ;; parse the string into a parse-tree and then call CREATE-SCANNER again (let* ((*extended-mode-p* extended-mode) (quoted-regex-string (if *allow-quoting* (quote-sections (clean-comments regex-string extended-mode)) regex-string)) (*syntax-error-string* (copy-seq quoted-regex-string)) (parse-tree (parse-string quoted-regex-string))) ;; wrap the result with FILTER to check for predicate (create-scanner `(:sequence (:register ,(shift-back-reference parse-tree)) (:filter 1 ,predicate)) :case-insensitive-mode case-insensitive-mode :multi-line-mode multi-line-mode :single-line-mode single-line-mode :destructive t))) (defun shift-back-reference (tree) (if (and (consp tree) (eq (first tree) :back-reference)) `(:back-reference ,(1+ (second tree))) (if (atom tree) tree (cons (shift-back-reference (car tree)) (shift-back-reference (cdr tree))))))

Edi Weitz

13 Oct 13 Oct

11:05 p.m.

On Mon, 11 Oct 2004 21:35:41 +0200, Sébastien Saint-Sevin <seb-cl-mailist@matchix.com> wrote:

...

As far as I understand it, (?s:(?=.{n})) will only garantee that at least n chars are remaining from match-start in the consumed string. This is not what I want. I want something that garantee that match-start will be before index n (meaning n'th char in consumed string), wether match-end is before or after this index n.

Well, you could compute n from what you know but that would imply creating a new regular expression for each iteration which is probably not what you want.

...

Here is what I've got right now (it's ok for my needs actually).

I was actually thinking about a simpler version which was just a zero-length thingy that you could insert anywhere in your code and which would call a user-defined function. It'd be more efficient and I think you could still achieve with it what you want. I'll try to release something in the next days. Cheers, Edi.

Sébastien Saint-Sevin

14 Oct 14 Oct

8:46 a.m.

...

...
As far as I understand it, (?s:(?=.{n})) will only garantee that at least n chars are remaining from match-start in the consumed string. This is not what I want. I want something that garantee that match-start will be before index n (meaning n'th char in consumed string), wether match-end is before or after this index n.

Well, you could compute n from what you know but that would imply creating a new regular expression for each iteration which is probably not what you want.

Exactly, in fact I need n to be a parameter of the engine, or a parameter of the compiled regex (like prepared SQL statements !).

...

...
Here is what I've got right now (it's ok for my needs actually).

I was actually thinking about a simpler version which was just a zero-length thingy that you could insert anywhere in your code and which would call a user-defined function. It'd be more efficient and I think you could still achieve with it what you want.

I'll try to release something in the next days.

Cheers, Edi.

I'm not sure to fully understand what you mean. I've coupled filter with registers coz I plan to use it at multiple places in the regex. Ex: If I use two dictionaries, I can say (regex string in double quote (no parse tree here)). (:sequence (:register "\b\w+\b") (:filter 1 check-dic1) " *[0-9]{5} *" (:register "\b\w+\b") (:filter 2 check-dic2)) This would match the full string that consists of two words that are in my dictionaries and that are separated by space(s)-fivedigits-space(s). Plus, I can extract via registers the two elected values. Cheers, Sebastien.

Edi Weitz

12:53 p.m.

On Thu, 14 Oct 2004 10:46:56 +0200, Sébastien Saint-Sevin <seb-cl-mailist@matchix.com> wrote:

...

I'm not sure to fully understand what you mean. I've coupled filter with registers coz I plan to use it at multiple places in the regex.

Yes, but the coupling with registers is costly if your filter doesn't use registers.

...

Ex: If I use two dictionaries, I can say (regex string in double quote (no parse tree here)).

(:sequence (:register "\b\w+\b") (:filter 1 check-dic1) " *[0-9]{5} *" (:register "\b\w+\b") (:filter 2 check-dic2))

This would match the full string that consists of two words that are in my dictionaries and that are separated by space(s)-fivedigits-space(s). Plus, I can extract via registers the two elected values.

I've just released a new version which implements a filter variant that should enable you to do this as well. These filters are also (hopefully) properly integrated into the optimization process. Thanks for urging me to do this... :) Cheers, Edi.

Sébastien Saint-Sevin

2:14 p.m.

FIRST POINT -----------

...

Thanks for urging me to do this... :)

Thanks for making it that quick. I will try it. SECOND POINT ------------

...

Well, you could compute n from what you know but that would imply creating a new regular expression for each iteration which is probably not what you want.

Can you confirm me that you see no other way of doing it? Right now in my code, I just do the full scan and throw the results away if the start was to far in the string. I've not tried compiling a new regex at each iteration but I guess it will be longer. Cheers, Sebastien.

Edi Weitz

2:22 p.m.

On Thu, 14 Oct 2004 16:14:27 +0200, Sébastien Saint-Sevin <seb-cl-mailist@matchix.com> wrote:

...

SECOND POINT ------------

...
Well, you could compute n from what you know but that would imply creating a new regular expression for each iteration which is probably not what you want.

Can you confirm me that you see no other way of doing it? Right now in my code, I just do the full scan and throw the results away if the start was to far in the string. I've not tried compiling a new regex at each iteration but I guess it will be longer.

With the new filter facility you should be able to create a filter which checks the current position against some special variable, say *MAX-START*. You can set *MAX-START* accordingly before each scan but the regular expression will only be compiled once because it doesn't change. Something like that should work, shouldn't it? Cheers, Edi.

Sébastien Saint-Sevin

3:01 p.m.

...

...
SECOND POINT ------------

...
Well, you could compute n from what you know but that would imply creating a new regular expression for each iteration which is probably not what you want.

Can you confirm me that you see no other way of doing it? Right now in my code, I just do the full scan and throw the results away if the start was to far in the string. I've not tried compiling a new regex at each iteration but I guess it will be longer.

With the new filter facility you should be able to create a filter which checks the current position against some special variable, say *MAX-START*. You can set *MAX-START* accordingly before each scan but the regular expression will only be compiled once because it doesn't change. Something like that should work, shouldn't it?

It should. I'll have to try it. How can I then abort the scan quickly, while avoiding funcalling the filter with the rest of the string ? Something like (setf *start-pos* end-of-string-value) ? Thanks a lot, you're so good ;-) Sebastien.

Edi Weitz

3:32 p.m.

On Thu, 14 Oct 2004 17:01:39 +0200, Sébastien Saint-Sevin <seb-cl-mailist@matchix.com> wrote:

...

How can I then abort the scan quickly, while avoiding funcalling the filter with the rest of the string ? Something like (setf *start-pos* end-of-string-value) ?

No, never change these internal values unless you're looking for trouble - see docs. Just return NIL from the filter. (I suppose you're talking about the 0.9.0 filters here.) Something like (defvar *max-start-pos* 0) (defun my-filter (pos) (and (< pos *max-start-pos*) pos)) (scan '(:sequence ... (:filter my-filter 0) ...) target) should assure that there's only a match if the position between the first ... and the second ... is below *MAX-START-POS*. The zero is optional but it'll potentially help the regex engine to optimize the scanner depending on the rest of the parse tree. Here's an example for optimization: * (defun my-filter (pos) (print "I was called") pos) MY-FILTER * (cl-ppcre:scan '(:sequence "fo" (:filter my-filter) "bar") "xyzfoobar") "I was called" NIL * (cl-ppcre:scan '(:sequence "fo" (:filter my-filter 0) "bar") "xyzfoobar") NIL Note that in the second example the filter wasn't called at all because due to the zero-length declaration the regex engine was able to determine that the target string must end with "fobar" - which it didn't. In the first example this couldn't be done because there wasn't enough information available. You shouldn't lie to the regex engine, though... :)

...

Thanks a lot, you're so good ;-)

Nah... :) Cheers, Edi.

Sébastien Saint-Sevin

4:18 p.m.

...

...
How can I then abort the scan quickly, while avoiding funcalling the filter with the rest of the string ? Something like (setf *start-pos* end-of-string-value) ?

No, never change these internal values unless you're looking for trouble - see docs. Just return NIL from the filter. (I suppose you're talking about the 0.9.0 filters here.)

Something like

(defvar *max-start-pos* 0)

(defun my-filter (pos) (and (< pos *max-start-pos*) pos))

(scan '(:sequence ... (:filter my-filter 0) ...) target)

should assure that there's only a match if the position between the first ... and the second ... is below *MAX-START-POS*.

The zero is optional but it'll potentially help the regex engine to optimize the scanner depending on the rest of the parse tree.

The majority of regex I'm using are unfortunately not optimizable. Going back to my buffer. Let's say I'm looking at ten lines at a time. I want start to occurs only at first line and I can do it with filters (that's great !). But the engine will still continue moving forward into the string for the nine remaining lines, and it will call my filter for each position in each line to just get nil everytime. So the question for forcing a full abort immediatly and not calling so many times the filter. In fact this is the case for all filter that once it has returned nil, will return nil forever (and are in a position in the parse tree where they can't be shadowed by some backtracking!). I know it's an optimization problem but I'm running regex on big files... Cheers, Sebastien.

Edi Weitz

5:20 p.m.

On Thu, 14 Oct 2004 18:18:46 +0200, Sébastien Saint-Sevin <seb-cl-mailist@matchix.com> wrote:

...

...
(defvar *max-start-pos* 0)

(defun my-filter (pos) (and (< pos *max-start-pos*) pos))

(scan '(:sequence ... (:filter my-filter 0) ...) target)

...

The majority of regex I'm using are unfortunately not optimizable.

Are you sure?

...

Going back to my buffer. Let's say I'm looking at ten lines at a time. I want start to occurs only at first line and I can do it with filters (that's great !). But the engine will still continue moving forward into the string for the nine remaining lines, and it will call my filter for each position in each line to just get nil everytime.

I'm sorry but I still don't fully understand your problem. Could you give an example with actual code and data?

...

So the question for forcing a full abort immediatly and not calling so many times the filter. In fact this is the case for all filter that once it has returned nil, will return nil forever (and are in a position in the parse tree where they can't be shadowed by some backtracking!).

Are you using DO-SCANS or another loop construct? How about this? (defvar *max-start-pos* 0) (defvar *stop-immediately* nil) (defun my-filter (pos) (cond ((< pos *max-start-pos*) pos) (t (setq *stop-immediately* t) nil))) (let (*stop-immediately*) (do-scans (...) (when *stop-immediately* (return)) ;;; your stuff here )) So once *STOP-IMMEDIATELY* is set by your filter the loop will be instantly exited. Cheers, Edi.

Sébastien Saint-Sevin

6:38 p.m.

...

...
Going back to my buffer. Let's say I'm looking at ten lines at a time. I want start to occurs only at first line and I can do it with filters (that's great !). But the engine will still continue moving forward into the string for the nine remaining lines, and it will call my filter for each position in each line to just get nil everytime.

I'm sorry but I still don't fully understand your problem. Could you give an example with actual code and data?

(defvar *my-string* "line1 word1 word2 line2 word1 word2 line3 word1 word2") (defvar *my-scanner* '(:sequence (:filter my-filter 0) :WORD-BOUNDARY (:GREEDY-REPETITION 1 NIL :WORD-CHAR-CLASS) :WORD-BOUNDARY)) (let ((end-of-first-line 17)) (defun my-filter (pos) (format t "Called at: ~A~%" pos) (and (< pos end-of-first-line) pos))) CL-PPCRE 87 > (scan *my-scanner* *my-string*) Called at: 0 0 5 #() #() ==> OK, A match is found on first line. CL-PPCRE 88 > (setf *my-scanner* '(:sequence (:filter my-filter 0) :WORD-BOUNDARY "line2" (:GREEDY-REPETITION 1 NIL :WORD-CHAR-CLASS) :WORD-BOUNDARY)) (:SEQUENCE (:FILTER MY-FILTER) :WORD-BOUNDARY "line2" (:GREEDY-REPETITION 1 NIL :WORD-CHAR-CLASS) :WORD-BOUNDARY) CL-PPCRE 89 > (scan *my-scanner* *my-string*) Called at: 0 Called at: 1 Called at: 2 Called at: 3 Called at: 4 Called at: 5 Called at: 6 Called at: 7 Called at: 8 Called at: 9 Called at: 10 Called at: 11 Called at: 12 Called at: 13 Called at: 14 Called at: 15 Called at: 16 Called at: 17 Called at: 18 Called at: 19 Called at: 20 Called at: 21 Called at: 22 Called at: 23 Called at: 24 Called at: 25 Called at: 26 Called at: 27 Called at: 28 Called at: 29 Called at: 30 Called at: 31 Called at: 32 Called at: 33 Called at: 34 Called at: 35 Called at: 36 Called at: 37 Called at: 38 Called at: 39 Called at: 40 Called at: 41 Called at: 42 Called at: 43 Called at: 44 Called at: 45 Called at: 46 Called at: 47 NIL ==> Here is the trouble: how to make the match abort when position 17 is reach. Coz from there, the filter will always returns nil. So the last 30 calls are wasted time.

...

...
So the question for forcing a full abort immediatly and not calling so many times the filter. In fact this is the case for all filter that once it has returned nil, will return nil forever (and are in a position in the parse tree where they can't be shadowed by some backtracking!).

Are you using DO-SCANS or another loop construct? How about this?

No. I think the loop I'm speaking about is created by "insert-advance-fn" & "create-scanner-aux" (while not understanding all the details by now...) Last point, I can't access the position where the match actually has started (the first of the fourth values returned by scan), so I have no way to extract the current global match without using register. Cheers, Sebastien.

Edi Weitz

9:11 p.m.

On Thu, 14 Oct 2004 20:38:17 +0200, Sébastien Saint-Sevin <seb-cl-mailist@matchix.com> wrote:

...

==> Here is the trouble: how to make the match abort when position 17 is reach. Coz from there, the filter will always returns nil. So the last 30 calls are wasted time.

Well, this is Common Lisp... CL-USER> (defvar *my-string* "line1 word1 word2 line2 word1 word2 line3 word1 word2") *MY-STRING* CL-USER> (defvar *my-scanner* '(:sequence (:filter my-filter 0) :word-boundary (:greedy-repetition 1 nil :word-char-class) :word-boundary)) *MY-SCANNER* CL-USER> (let ((end-of-first-line 17)) (defun my-filter (pos) (format t "Called at: ~A~%" pos) (cond ((< pos end-of-first-line) pos) (t (throw 'stop-it nil))))) ; Converted MY-FILTER. MY-FILTER CL-USER> (catch 'stop-it (scan *my-scanner* *my-string*)) Called at: 0 0 5 #() #() CL-USER> (setf *my-scanner* '(:sequence (:filter my-filter 0) :word-boundary "line2" (:greedy-repetition 1 nil :word-char-class) :word-boundary)) (:SEQUENCE (:FILTER MY-FILTER 0) :WORD-BOUNDARY "line2" (:GREEDY-REPETITION 1 NIL :WORD-CHAR-CLASS) :WORD-BOUNDARY) CL-USER> (catch 'stop-it (scan *my-scanner* *my-string*)) Called at: 0 Called at: 1 Called at: 2 Called at: 3 Called at: 4 Called at: 5 Called at: 6 Called at: 7 Called at: 8 Called at: 9 Called at: 10 Called at: 11 Called at: 12 Called at: 13 Called at: 14 Called at: 15 Called at: 16 Called at: 17 NIL

...

I think the loop I'm speaking about is created by "insert-advance-fn"

Yes. It's the normal loop that advances through the regular expression.

...

Last point, I can't access the position where the match actually has started (the first of the fourth values returned by scan), so I have no way to extract the current global match without using register.

Sure you can: CL-USER> (let (match-start) (defun set-match-start (pos) (setq match-start pos)) (defun show-match-start (pos) (format t "Match start is ~A, pos is ~A~%" match-start pos) pos)) ; Converted SET-MATCH-START. ; Converted SHOW-MATCH-START. SHOW-MATCH-START CL-USER> (setf *my-scanner* '(:sequence (:filter set-match-start 0) "abc" (:filter show-match-start 0) (:alternation #\x #\y))) (:SEQUENCE (:FILTER SET-MATCH-START 0) "abc" (:FILTER SHOW-MATCH-START 0) (:ALTERNATION #\x #\y)) CL-USER> (scan *my-scanner* "abczabcabcx") Match start is 0, pos is 3 Match start is 4, pos is 7 Match start is 7, pos is 10 7 11 #() #() Just make sure SET-MATCH-START is at the very beginning of your regular expression and not within a group or alternation or somesuch. Cheers, Edi.

Sébastien Saint-Sevin

9:34 p.m.

...

...
==> Here is the trouble: how to make the match abort when position 17 is reach. Coz from there, the filter will always returns nil. So the last 30 calls are wasted time.

Well, this is Common Lisp...

CL-USER> (defvar *my-string* "line1 word1 word2 line2 word1 word2 line3 word1 word2") *MY-STRING* CL-USER> (defvar *my-scanner* '(:sequence (:filter my-filter 0) :word-boundary (:greedy-repetition 1 nil :word-char-class) :word-boundary)) *MY-SCANNER* CL-USER> (let ((end-of-first-line 17)) (defun my-filter (pos) (format t "Called at: ~A~%" pos) (cond ((< pos end-of-first-line) pos) (t (throw 'stop-it nil))))) ; Converted MY-FILTER. MY-FILTER CL-USER> (catch 'stop-it (scan *my-scanner* *my-string*)) Called at: 0 0 5 #() #() CL-USER> (setf *my-scanner* '(:sequence (:filter my-filter 0) :word-boundary "line2" (:greedy-repetition 1 nil :word-char-class) :word-boundary)) (:SEQUENCE (:FILTER MY-FILTER 0) :WORD-BOUNDARY "line2" (:GREEDY-REPETITION 1 NIL :WORD-CHAR-CLASS) :WORD-BOUNDARY) CL-USER> (catch 'stop-it (scan *my-scanner* *my-string*)) Called at: 0 Called at: 1 Called at: 2 Called at: 3 Called at: 4 Called at: 5 Called at: 6 Called at: 7 Called at: 8 Called at: 9 Called at: 10 Called at: 11 Called at: 12 Called at: 13 Called at: 14 Called at: 15 Called at: 16 Called at: 17 NIL

Throw & Catch, of course. I'm just not very familiar with this kind of big jumps. I should !!!!

...

...
I think the loop I'm speaking about is created by "insert-advance-fn"

Yes. It's the normal loop that advances through the regular expression.

...
Last point, I can't access the position where the match actually has started (the first of the fourth values returned by scan), so I have no way to extract the current global match without using register.

Sure you can:

CL-USER> (let (match-start) (defun set-match-start (pos) (setq match-start pos)) (defun show-match-start (pos) (format t "Match start is ~A, pos is ~A~%" match-start pos) pos)) ; Converted SET-MATCH-START. ; Converted SHOW-MATCH-START. SHOW-MATCH-START CL-USER> (setf *my-scanner* '(:sequence (:filter set-match-start 0) "abc" (:filter show-match-start 0) (:alternation #\x #\y))) (:SEQUENCE (:FILTER SET-MATCH-START 0) "abc" (:FILTER SHOW-MATCH-START 0) (:ALTERNATION #\x #\y)) CL-USER> (scan *my-scanner* "abczabcabcx") Match start is 0, pos is 3 Match start is 4, pos is 7 Match start is 7, pos is 10 7 11 #() #()

Just make sure SET-MATCH-START is at the very beginning of your regular expression and not within a group or alternation or somesuch.

It just add a little work to craft the parse tree but that's OK. It seems that filters are really powerful !!! I've got everything I need for now. I will try all that & will give you some feedback when it's done in a few days. Finally, I just want to thank you very much, Edi, for all your help & work. Cheers, Sebastien.

Edi Weitz

10 p.m.

On Thu, 14 Oct 2004 23:34:30 +0200, Sébastien Saint-Sevin <seb-cl-mailist@matchix.com> wrote:

...

Throw & Catch, of course. I'm just not very familiar with this kind of big jumps. I should !!!!

BTW, I don't know if these were just examples or if they're actually related to your real problem but in this specific case there's certainly room for improvement: CL-USER> (defvar *my-string* "line1 word1 word2 line2 word1 word2 line3 word1 word2") *MY-STRING* CL-USER> (let ((end-of-first-line 17)) (defun my-filter (pos) (format t "Called at: ~A~%" pos) (cond ((< pos end-of-first-line) pos) (t (throw 'stop-it nil))))) MY-FILTER CL-USER> (defvar *my-scanner* '(:sequence (:filter my-filter 0) :word-boundary "line2" (:greedy-repetition 1 nil :word-char-class) :word-boundary)) *MY-SCANNER* CL-USER> (catch 'stop-it (scan *my-scanner* *my-string*)) Called at: 0 Called at: 1 Called at: 2 Called at: 3 Called at: 4 Called at: 5 Called at: 6 Called at: 7 Called at: 8 Called at: 9 Called at: 10 Called at: 11 Called at: 12 Called at: 13 Called at: 14 Called at: 15 Called at: 16 Called at: 17 NIL CL-USER> (setf *my-scanner* '(:sequence :multi-line-mode-p :start-anchor (:filter my-filter 0) :word-boundary "line2" (:greedy-repetition 1 nil :word-char-class) :word-boundary)) (:SEQUENCE :MULTI-LINE-MODE-P :START-ANCHOR (:FILTER MY-FILTER 0) :WORD-BOUNDARY "line2" (:GREEDY-REPETITION 1 NIL :WORD-CHAR-CLASS) :WORD-BOUNDARY) CL-USER> (catch 'stop-it (scan *my-scanner* *my-string*)) Called at: 0 Called at: 18 NIL Not sure if that's relevant, though. Cheers, Edi.

Sébastien Saint-Sevin

10:24 p.m.

...

...
Throw & Catch, of course. I'm just not very familiar with this kind of big jumps. I should !!!!

BTW, I don't know if these were just examples or if they're actually related to your real problem but in this specific case there's certainly room for improvement:

CL-USER> (defvar *my-string* "line1 word1 word2 line2 word1 word2 line3 word1 word2") *MY-STRING* CL-USER> (let ((end-of-first-line 17)) (defun my-filter (pos) (format t "Called at: ~A~%" pos) (cond ((< pos end-of-first-line) pos) (t (throw 'stop-it nil))))) MY-FILTER CL-USER> (defvar *my-scanner* '(:sequence (:filter my-filter 0) :word-boundary "line2" (:greedy-repetition 1 nil :word-char-class) :word-boundary)) *MY-SCANNER* CL-USER> (catch 'stop-it (scan *my-scanner* *my-string*)) Called at: 0 Called at: 1 Called at: 2 Called at: 3 Called at: 4 Called at: 5 Called at: 6 Called at: 7 Called at: 8 Called at: 9 Called at: 10 Called at: 11 Called at: 12 Called at: 13 Called at: 14 Called at: 15 Called at: 16 Called at: 17 NIL CL-USER> (setf *my-scanner* '(:sequence :multi-line-mode-p :start-anchor (:filter my-filter 0) :word-boundary "line2" (:greedy-repetition 1 nil :word-char-class) :word-boundary)) (:SEQUENCE :MULTI-LINE-MODE-P :START-ANCHOR (:FILTER MY-FILTER 0) :WORD-BOUNDARY "line2" (:GREEDY-REPETITION 1 NIL :WORD-CHAR-CLASS) :WORD-BOUNDARY) CL-USER> (catch 'stop-it (scan *my-scanner* *my-string*)) Called at: 0 Called at: 18 NIL

Not sure if that's relevant, though.

It was just an example. I chose "line2" as the first word that wasn't in line1 so that the match fails. Usually, I use as much anchors as I can in the regex coz it considerably decrease the number of backtracking with quantifiers of all kind. Cheers, Sebastien.

7572

Age (days ago)

7575

Last active (days ago)

List overview

Download

16 comments

2 participants

participants (2)

Edi Weitz
Sébastien Saint-Sevin

[cl-ppcre-devel] Buffered multi-line question

tags

participants (2)