Hi,
I'm using CL-PPCRE to develop a character-at-a-time lexer. This is causing me some perplexity, though, with regexes like the common notation for hexadecimal literals:
"^0(?:x[0-9A-Fa-f]+)?$"
This should match both the string "0", between positions 0 and 1, as just a bare literal zero, and should also match things like "0xa6" between positions 0 and 3, but should not match simply "0x". But I want the longest match possible, so (for example) I'd like to know that while "0x" didn't match, parts of the regex *did* match and might produce a "real" match depending on what comes after "x".
So, in succession, if the input is "0xa6 ", my scanner gets called thus:
1. Input: "0". a) A match. b) But it *could* possibly match more, depending on what comes next. 2. Input: "0x". a) Not a match. b) But, once again, the possibility exists that more input could still produce a longer match than "0". 3. Input: "0xa". a) A match. b) Because of the "+" attached to the character class, a longer match is still possible. 4. Input: "0xa6". a) A match. b) As above. 5. Input: "0xa6 ". a) Not a match. b) Will *never* match no matter how much more input you add to it.
CL-PPCRE just tells me a), and I also want to know b). Is there any way to get this information (if it even exists) out of the scanner?
TIA,
-Dan -- Dan Debertin | airboss@nodewarrior.org | www.nodewarrior.org |
On Wed, 26 Apr 2006 15:46:35 -0500, Dan Debertin airboss@nodewarrior.org wrote:
CL-PPCRE just tells me a), and I also want to know b). Is there any way to get this information (if it even exists) out of the scanner?
If I understand you correctly, this information isn't available. The regex engine doesn't make any plans about what might have happened if it had scanned another target string instead of the current one, so to say... :)
You might want to look at filters, though:
http://weitz.de/cl-ppcre/#filters
But generally, your problem to me looks as if regular expressions aren't the right way to tackle it.
Cheers, Edi.
cl-ppcre-devel@common-lisp.net