Hello,
This is my first post to this list. I have looked through the archives (searched on "greedy" and some other terms, actually) but don't find anything that seems to relate to my problem. So I'm writing to see if anyone else on the list has encountered something like this. There is something about the "." operator, especially the "non-greedy" version of it, and in particular its behaviour when used in conjunction with a parenthesized term which is optional.
I've put the pattern and a sample target string and written comments about the results I get from Regex Coach. I ran the pattern with "i" checked.
Pattern: ^\s*An appeal.+?(Joined )?Cases? ?t ?[-] ?\d{1,3}/ ?\d{2}.+?(between)?
Target string: An appeal against the judgment delivered on 15 January 2003 by the Second Chamber (Extended Composition) of the Court of First Instance of the European Communities in joined cases T-377/00 (1), T-379/00 (2), T-380/00 (2), T-260/01 (3) and T-272/01 (4) between Philip Morris International, Inc., R.J. Reynolds Tobacco Holdings, Inc., RJR Acquisition Corp., R.J. Reynolds Tobacco Company, R.J. Reynolds Tobacco International Inc., and Japan Tobacco, Inc., and Commission of the European Communities, supported by European Parliament, Kingdom of Spain, French Republic, Italian Republic, Portuguese Republic, Republic of Finland, Federal Republic of Germany, Hellenic Republic, Kingdom of the Netherlands, was brought before the Court of Justice of the European Communities on 25 March 2003 by R.J. Reynolds Tobacco Holdings, Inc., established in Winston-Salem, North Carolina (United States), RJR Acquisition Corp., established in Wilmington, Delaware (United States), R.J. Reynolds Tobacco Company, established in Winston-Salem, North Carolina (United States), R.J. Reynolds Tobacco International Inc., established in Winston-Salem, North Carolina (United States) and Japan Tobacco, Inc., established in Tokyo (Japan), represented by O.W. Brouwer, lawyer, and P. Lomas, solicitor. ============
What I want it to do is match the string from the beginning through "between", and when there is no instance of "between", I want it to match the entire string.
I would expect the example above to give me a match on 0-259 (i.e. through "between". But instead I get a match only on 0-189 (through the first case number). This makes no sense to me whatsoever. I would consider it a bug but Regex Coach and Perl v5.8.3 on FreeBSD give me the same results.
^\s*An appeal.+?(Joined )?Cases? ?t ?[-] ?\d{1,3}/ ?\d{2}.+(between)? gives me a match on 0-1279 (the whole string). Why doesn't it stop when it finds "between"?
^\s*An appeal.+?(Joined )?Cases? ?t ?[-] ?\d{1,3}/ ?\d{2}.+?(between) gives me the match I expect, 0-259.
^\s*An appeal.+?(Joined )?Cases? ?t ?[-] ?\d{1,3}/ ?\d{2}.+(between) also gives me the match I expect, 0-259.
But if I make the "(between)" optional, by putting a "?" after it, - the regex engine doesn't stop there when the ".+" is greedy, and - the regex engine doesn't find "between" when the ".+" is non-greedy, i.e. ".+?"
Can anyone enlighten me?
Many thanks, John Clements
On Sun, 22 Aug 2004 14:04:47 +0100, John Clements johnjc-regex@publicinfo.net wrote:
I ran the pattern with "i" checked.
I guess you also had "s" checked because your target string contained line breaks.
What I want it to do is match the string from the beginning through "between", and when there is no instance of "between", I want it to match the entire string.
This regex should work:
^\s*An appeal.+?(Joined )?Cases? ?t ?[-] ?\d{1,3}/ ?\d{2}(.+?between|.*)
The behaviour you saw was right. (As a rule of thumb Regex Coach is always right as long as it does the same as Perl... :)
You had ".+?(between)?" which meant "match as few characters as possible up to ..." where ... was "the string 'between' OR ANYTHING" because you made 'between' optional, i.e. you regex was equivalent to ".+?". So, the regex engine matched exactly zero characters.
Does that help?
Cheers, Edi.
That is absolutely brilliant, Edi! Thank you so much!
At 14:55 22/08/04, you wrote:
On Sun, 22 Aug 2004 14:04:47 +0100, John Clements johnjc-regex@publicinfo.net wrote:
I ran the pattern with "i" checked.
I guess you also had "s" checked because your target string contained line breaks.
If it had line breaks they were introduced by the mailer(s) because Regex Coach didn't show any.
What I want it to do is match the string from the beginning through "between", and when there is no instance of "between", I want it to match the entire string.
This regex should work:
^\s*An appeal.+?(Joined )?Cases? ?t ?[-] ?\d{1,3}/ ?\d{2}(.+?between|.*)
I was just looking over some tutorial material which was talking about what you enclose in parentheses and what not, and it hadn't dawned on me that it was relevant to my problem!
Yes, putting the ".+?" inside the parenthesis does the trick. And the "|.*" makes perfect sense. It says so directly "or the rest of the string".
I had settled for a solution that used the "greedy" version of ".+" before "between", which in the presence of a second instance of the word "between" would have brought in unwanted text. Now it's just right. I really appreciate this!
The behaviour you saw was right. (As a rule of thumb Regex Coach is always right as long as it does the same as Perl... :)
Yeah, that's what I thought, too. :)
You had ".+?(between)?" which meant "match as few characters as possible up to ..." where ... was "the string 'between' OR ANYTHING" because you made 'between' optional, i.e. you regex was equivalent to ".+?". So, the regex engine matched exactly zero characters.
Does that help?
Indeed, indeed! Thanks for that explanation, too. I need to see the logic of something to really absorb it. I had accepted what I saw as the limitation of the regex engine but without understanding its logic hadn't worked out how to get that refinement that I needed.
All the best, John
John Clements john.clements@publicinfo.net +44(0)20 8959-6432
PublicInfo.Net Ltd. 29 Gibbs Green Edgware, Middlesex United Kingdom HA8 9RS
On Sun, 22 Aug 2004 17:37:36 +0100, John Clements johnjc-regex@publicinfo.net wrote:
^\s*An appeal.+?(Joined )?Cases? ?t ?[-] ?\d{1,3}/ ?\d{2}(.+?between|.*)
Yes, putting the ".+?" inside the parenthesis does the trick. And the "|.*" makes perfect sense. It says so directly "or the rest of the string".
What I forgot to say: Note that the order is important. This regex
^\s*An appeal.+?(Joined )?Cases? ?t ?[-] ?\d{1,3}/ ?\d{2}(.*|.+?between)
won't work because the engine will try the "rest of the string" first and will succeed, so it will stop.
Cheers, Edi.