the "|m" was just in there for the example. I'm actually trying to match html tags with small contents inside like; <tag> V </tag> <tag> I </tag> <tag> A </tag> <tag> G </tag> <tag> R </tag> <tag> A </tag>
with the expression similar to "/(>.{1,4}<[\S\s]*){4}/" and noticed the behaviour when I changed the ".{1,4}" to "[.^<]{1,4}". I did notice that "\s" and "\S" are matched inside a character class, which is what I think led me to assume other meta-characters would be too. I've only been using regex for a while, so I am stumbling along. Thanks for the man page. I'm reading it now. Any other advice you could give for this expression would be great.
Thanks, Bryn
----- Original Message ----- From: Edi Weitz edi@agharta.de To: sites@brynmosher.com Cc: regex-coach@common-lisp.net Date: Wed, 26 Apr 2006 23:17:25 +0200 Subject: Re: [regex-coach] problem with dot '.' inside brackets
On Wed, 26 Apr 2006 12:44:30 -0700, sites@brynmosher.com wrote:
I've been using Regex-Coach 0.8.4 on Windows to test some SpamAssassin rules and noticed something odd:
Placing the following expression: [.|m]
to match the following data: bleh.com
Matches the '.' in bleh.com and not the first non-linefeed character as the '.' character in the expression should match. It's almost as if I had excaped the '.' like '.'. Using the expression '[.]' yields the same result. I've also noticed that the non-match character '^' doesn't work inside brackets as well.
Is this an error or am I crazy?
Well, at least it's not an error... :)
Most characters that have a special meaning in regular expressions (like the dot or the pipe symbol, for example) are treated like normal characters within character classes, i.e. within square brackets.
See 'man perlre' for details.
BTW, it seems that your understanding of character classes as a whole is wrong. If the dot /would/ match every non-linefeed character, then "[.|m]" would be equivalent to "[.]".
Cheers, Edi.
On Wed, 26 Apr 2006 14:41:09 -0700, sites@brynmosher.com said:
I'm actually trying to match html tags with small contents inside like; <tag> V </tag> with the expression similar to "/(>.{1,4}<[\S\s]*){4}/" and noticed the behaviour when I changed the ".{1,4}" to "[.^<]{1,4}".
This also explains your comment about ^ not working. It needs to be first in the character class, like [^.<]
But actually, I'm guessing your real problem is with the greediness of the * operator, which would skip over as much as possible, and that is why you have artificially constrained it with the {1,4} to only match a few characters.
What you're actually looking for, then, is "a > followed by anything except <", i.e. >[^<]*, yielding /(>[^<]*<[^>]*){4}/ ... or even, in Perl-compatible regular expressions, the non-greedy *?, but that is a bit hard to apply here without more knowledge of what you are actually trying to match. (I still don't understand the significance of the final {4}, for example. Or maybe you were meaning to say <.{1,4} but applying the repeat to the wrong scope?)
Hope this helps,
/* era */