Hi Daniel!
On Sat, 12 Jun 2004 15:54:11 +0200, Daniel Skarda <0rfelyus(a)ucw.cz> wrote:
> today I explored the possibilities of regular expressions
> implementations in various Debian Common Lisp packages. I really
> liked your library - thank you for writing cl-ppcre library.
You're welcome.
> I also looked into elegant cl-lexer package built on top of
> cl-regex library. What I missed in cl-ppcre is a parse-tree node
> similar to cl-regex's 'success node, which defines return value of
> match/scan functions. With 'success node one can build `deflexer'
> macro on top of cl-ppcre as easy as on top of cl-regex package.
>
> Is it possible to extend cl-ppcre with similar feature?
I might look into this for a future version but see below.
> Footnote: In cl-lexer, deflexer macro
>
> (deflexer foo
> ("regexp" some action) ; 0
> ("another regexp" another action) ; 1
> ...))
>
> numbers each pair of regexp and action, then combine regexp parse
> trees into one big parse tree
>
> `(alt
> (seq (regexp tree) (success 0))
> (seq (another regexp tree) (success 1))
> ...)
>
> and use return value from match (ie regexp serial number) to select
> an action associated to matching regexp)
I've recently written demo code like this for another CL-PPCRE user
who also wanted to build a lexer:
(in-package :cl-user)
(eval-when (:compile-toplevel :load-toplevel :execute)
(defmacro with-unique-names ((&rest bindings) &body body)
;; see <http://www.cliki.net/Common%20Lisp%20Utilities>
`(let ,(mapcar #'(lambda (binding)
(check-type binding (or cons symbol))
(if (consp binding)
(destructuring-bind (var x) binding
(check-type var symbol)
`(,var (gensym ,(etypecase x
(symbol (symbol-name x))
(character (string x))
(string x)))))
`(,binding (gensym ,(symbol-name binding)))))
bindings)
,@body)))
(defmacro deflexer (name &body body)
(with-unique-names (regex-table regex token sexpr-regex anchored-regex string start scanner next-pos)
`(let ((,regex-table
(loop for (,regex . ,token) in (list ,@(loop for (regex token) in body
collect `(cons ,regex ,token)))
for ,sexpr-regex = (etypecase ,regex
(function
(error "Compiled scanners are not allowed here"))
(string
(cl-ppcre::parse-string ,regex))
(list
,regex))
for ,anchored-regex = (cl-ppcre:create-scanner `(:sequence
:modeless-start-anchor
,,sexpr-regex))
collect (cons ,anchored-regex ,token))))
(defun ,name (,string &key ((:start ,start) 0))
(loop for (,scanner . ,token) in ,regex-table
for ,next-pos = (nth-value 1 (cl-ppcre:scan ,scanner ,string :start ,start))
when ,next-pos do (return (values ,token ,next-pos)))))))
You should be able to use it like this:
* (deflexer mylexer
("'.*'" 'string)
("#.*$" 'comment)
("[ \t\r\f]+" 'ws)
(":=" 'assign)
("[\[]" 'lbrack)
("[\]]" 'rbrack)
("[\,]" 'comma)
("[\:]" 'colon)
("[\;]" 'semicolon)
("[+-]?[0-9]*[\.][0-9]+([eE][+-]?[0-9]+)?" 'float)
("[+-]?[0-9]+" 'integer)
("[a-zA-Z0-9_]+" 'id)
("." 'unknown))
; Converted MYLEXER.
MYLEXER
* (mylexer "a:=123.4?")
ID
1
* (mylexer "a:=123.4?" :start 1)
ASSIGN
3
* (mylexer "a:=123.4?" :start 3)
FLOAT
8
* (mylexer "a:=123.4?" :start 8)
UNKNOWN
9
This one only returns tokens but it should be trivial to change the
macro such that the newly-defined lexer invokes functions
instead. Wouldn't that already do what you want? I'm not sure what the
approach you sketched above would buy you compared to this one.
Cheers,
Edi.
PS: Please, if possible, continue this conversation on the mailing
list. Thanks.
I have just posted version v0.2.1 of 'defpatt' - a mechanism for
defining and using regular expression abstractions with CL-PPCRE. The
update fixes a bothersome error affilicting certain types of defpatt
expressions. I strongly recommend anyone looking at or using defpatt.
'defpatt' can be downloaded from http://www.harbo.net/downloads.
best regards,
-Klaus.
Working with cl-ppcre, I have found that I increasingly use the s-expr
representation rather than the traditional string representation with
its infix operators. To make it easier to work with the s-expressions,
I've developed 'defpatt' - a package which implements a notation for
defininig and referring to regular expressions in terms of cl-ppcre
s-expressions. I thought it might interest the readers of this list.
The package can be downloaded from
http://www.harbo.net/downloads/defpatt-0.2.tar.gz .
Suggestions, comments, improvements are welcome.
best regards,
-Klaus.
------
defpatt examples (from defpatt.lisp):
------
#| EXAMPLES
; If you want to try the examples, be sure to evaluate the
; expression below first - otherwise the other ones won't work.
> (defpatt:defpatt-set-default-macro-char)
; Defines #\¤ as macro character
=> T
> (cl-ppcre:all-matches-as-strings ¤(alt "a" "c" "f") "abcdefghi")
; Note: Equivalent to "a|c|f"
=> ("a" "c" "f")
; That's all very well, but doesn't buy us very much.
; However `defpatt' (as per cl-ppcre's sexpr-based
; representation of REs) enables us to both document
; the patterns much better by letting us insert comments
; into REs...
> (cl-ppcre:scan-to-strings
¤(seq digit+ ; used space
ws+
digit+ ; available space
ws+
digit+ ; remaining space
) "123 4567 7887")
; Note: `ws+' and `digit+' are defined above, in `defpatt-initialize'.
=> "123 4567 7887", #()
; ...as well as lets us capture data in a structured fashion...
> (cl-ppcre:register-groups-bind (used avail remain)
(¤(seq (reg digit+) ; used space
ws+
(reg digit+) ; available space
ws+
(reg digit+) ; remaining space
) "123 4567 7887")
(mapcar #'parse-integer (list used avail remain)))
; Note: `(reg ...)' creates a register binding
=> (123 4567 7887)
; ...but also lets us _FIRST_ define and document the abstraction...
> (defpatt match-nums ()
¤(seq (reg digit+) ; used space
ws+
(reg digit+) ; available space
ws+
(reg digit+) ; remaining space
))
=> MATCH-NUMS
; ...and _THEN_ use it...
> (cl-ppcre:register-groups-bind (used avail remain)
(¤match-nums "123 4567 7887")
(mapcar #'parse-integer (list used avail remain)))
=> (123 4567 7887)
; which is a lot more easily understood, as I am sure you will
; agree.
> (cl-ppcre:scan-to-strings ¤(upto "efg") "abcdefghi")
=> "abcd", #()
> (cl-ppcre:scan-to-strings ¤(upto+ "efg") "abcdefghi")
=> "abcdefg", #()
; To see the raw cl-ppcre expansion of a `defpatt' expression,
; simply enter it:
> ¤(seq (reg digit+) ; used space
ws+
(reg digit+) ; available space
ws+
(reg digit+) ; remaining space
)
=> (:SEQUENCE
(:REGISTER (:GREEDY-REPETITION 1 NIL (:CHAR-CLASS (:RANGE #\0 #\9))))
(:GREEDY-REPETITION 1 NIL :WHITESPACE-CHAR-CLASS)
(:REGISTER (:GREEDY-REPETITION 1 NIL (:CHAR-CLASS (:RANGE #\0 #\9))))
(:GREEDY-REPETITION 1 NIL :WHITESPACE-CHAR-CLASS)
(:REGISTER (:GREEDY-REPETITION 1 NIL (:CHAR-CLASS (:RANGE #\0 #\9)))))
; To see _HOW_ `defpatt' expands an expression use `macroexpand':
> (macroexpand-1 '¤(seq (reg digit+) ; used space
ws+
(reg digit+) ; available space
ws+
(reg digit+) ; remaining space
))
=> (LABELS ((++ (PATT) (REP PATT 1 NIL))
(UPTO (PATT)
`(:SEQUENCE
(:FLAGS :SINGLE-LINE-MODE-P)
(:GREEDY-REPETITION
0
NIL
(:SEQUENCE :EVERYTHING (:NEGATIVE-LOOKAHEAD ,PATT)))
:EVERYTHING))
(?? (PATT) (REP PATT 0 1))
(UPTO+ (PATT) `(:SEQUENCE ,(UPTO PATT) ,PATT))
(ALT (&REST ARGS) `(:ALTERNATION ,@ARGS))
(** (PATT) (REP PATT 0 NIL))
(SEQ (&REST ARGS) `(:SEQUENCE ,@ARGS))
(REG (&REST ARGS) `(:REGISTER ,@ARGS))
(REP (PATT &OPTIONAL (MIN 0) (MAX NIL))
`(:GREEDY-REPETITION ,MIN ,MAX ,PATT)))
(SYMBOL-MACROLET ((WS+
'(:GREEDY-REPETITION
1
NIL
:WHITESPACE-CHAR-CLASS))
(WS*
'(:GREEDY-REPETITION
0
NIL
:WHITESPACE-CHAR-CLASS))
(DIGIT '(:CHAR-CLASS (:RANGE #\0 #\9)))
(DIGIT+ (++ DIGIT))
(MATCH-NUMS
(DEFPATT-PATTERN (SEQ
(REG DIGIT+)
WS+
(REG DIGIT+)
WS+
(REG DIGIT+))))
(DIGIT* (** DIGIT)))
(SEQ
(REG DIGIT+)
WS+
(REG DIGIT+)
WS+
(REG DIGIT+))))
; `upto' and `upto+' are good examples of how having an abstraction
; mechanism helps keep maintainable and understandable REs. See
; their definitions above.
|#