[pro] Is cl-yacc going to cut it?

newer
[pro] Common Lisp Library for MS...

older
[pro] Final empty parameters to...

Matthew D. Swank

4 Feb 2011 4 Feb '11

6:33 a.m.

I suppose this is only marginally related to common lisp, but everything I'm talking about is written in common lisp. I use cl-yacc for a lot of parsing, but one thing that has always seemed harder than it needs to be is writing lexers to feed it. One thing that I've found helpful is the creation of a custom lexer for each parser state by making the action table entry for that state available to the lexer. This provides the terminals the parser is looking for, and narrows the tokens the lexer has to look for at each step. However, this means I am also maintaining my own fork of cl-yacc. It seems (from my admittedly limited search) that this is not a common modification of yacc. Before I start bugging the maintainer about my changes, I want to know: am I abusing yacc? I do like the separation between low level tokenization the higher level symbol parse, but is there another general parsing technique, available as a lisp library of course, that either works at a lower level than yacc usually does or allows the lexer to access more context about the parse? Matt

Show replies by date

Cyrus Harmon

4 Feb 4 Feb

7:14 a.m.

I've been very pleased with cl-parser-combinators. Not sure what you're trying to parse, but it's pretty flexible and powerful. I've used it for parsing a printed representation of molecules, SMILES strings, and have found it to be a pleasure to work with. Cyrus On Feb 3, 2011, at 10:33 PM, Matthew D. Swank wrote:

...

I suppose this is only marginally related to common lisp, but everything I'm talking about is written in common lisp.

I use cl-yacc for a lot of parsing, but one thing that has always seemed harder than it needs to be is writing lexers to feed it. One thing that I've found helpful is the creation of a custom lexer for each parser state by making the action table entry for that state available to the lexer. This provides the terminals the parser is looking for, and narrows the tokens the lexer has to look for at each step. However, this means I am also maintaining my own fork of cl-yacc.

It seems (from my admittedly limited search) that this is not a common modification of yacc. Before I start bugging the maintainer about my changes, I want to know: am I abusing yacc?

I do like the separation between low level tokenization the higher level symbol parse, but is there another general parsing technique, available as a lisp library of course, that either works at a lower level than yacc usually does or allows the lexer to access more context about the parse?

Matt

_______________________________________________ pro mailing list pro@common-lisp.net http://common-lisp.net/cgi-bin/mailman/listinfo/pro

Matthew D. Swank

7:33 a.m.

It does look nice, and parser combinators are very cool, but I scrape a lot of semi structured documents. Sometimes it's hard to operate with all of it in memory, which is a current requirement of cl-parser-combinators. Thank you though, Matt On 02/04/2011 01:14 AM, Cyrus Harmon wrote:

...

I've been very pleased with cl-parser-combinators. Not sure what you're trying to parse, but it's pretty flexible and powerful. I've used it for parsing a printed representation of molecules, SMILES strings, and have found it to be a pleasure to work with.

Cyrus

On Feb 3, 2011, at 10:33 PM, Matthew D. Swank wrote:

...
I suppose this is only marginally related to common lisp, but everything I'm talking about is written in common lisp.

I use cl-yacc for a lot of parsing, but one thing that has always seemed harder than it needs to be is writing lexers to feed it. One thing that I've found helpful is the creation of a custom lexer for each parser state by making the action table entry for that state available to the lexer. This provides the terminals the parser is looking for, and narrows the tokens the lexer has to look for at each step. However, this means I am also maintaining my own fork of cl-yacc.

It seems (from my admittedly limited search) that this is not a common modification of yacc. Before I start bugging the maintainer about my changes, I want to know: am I abusing yacc?

I do like the separation between low level tokenization the higher level symbol parse, but is there another general parsing technique, available as a lisp library of course, that either works at a lower level than yacc usually does or allows the lexer to access more context about the parse?

Matt

_______________________________________________ pro mailing list pro@common-lisp.net http://common-lisp.net/cgi-bin/mailman/listinfo/pro

Raymond Wiker

7:21 a.m.

I've used cl-yacc exactly once, and chose to implement "start conditions" in my lexer - I use a closed-over variable to restrict the set of patterns that I want to match in the lexer. I set and test the start-condition in the lexer only, but it would certainly to be possible to modify the start condition from parser rules, too. I'll definitely use cl-yacc again, if the need arises - it's another of those good-quality, low-profile Lisp libraries that "just work" (tm). [I use cl-yacc to implement a filter parser for trivial-ldap; the one supplied with trivial-ldap is somewhat unconventional.] On Fri, Feb 4, 2011 at 7:33 AM, Matthew D. Swank <akopa@charter.net> wrote:

...

I suppose this is only marginally related to common lisp, but everything I'm talking about is written in common lisp.

I use cl-yacc for a lot of parsing, but one thing that has always seemed harder than it needs to be is writing lexers to feed it. One thing that I've found helpful is the creation of a custom lexer for each parser state by making the action table entry for that state available to the lexer. This provides the terminals the parser is looking for, and narrows the tokens the lexer has to look for at each step. However, this means I am also maintaining my own fork of cl-yacc.

It seems (from my admittedly limited search) that this is not a common modification of yacc. Before I start bugging the maintainer about my changes, I want to know: am I abusing yacc?

I do like the separation between low level tokenization the higher level symbol parse, but is there another general parsing technique, available as a lisp library of course, that either works at a lower level than yacc usually does or allows the lexer to access more context about the parse?

Matt

_______________________________________________ pro mailing list pro@common-lisp.net http://common-lisp.net/cgi-bin/mailman/listinfo/pro

Matthew D. Swank

5 Feb 5 Feb

3:55 a.m.

Yeah, that is basically what I'm doing: driving start conditions from the parser. One hint that there was unnecessary code duplication going on was when the state in some of my older lexers started looking like shift entries in the parser's action table. Matt On 02/04/2011 01:21 AM, Raymond Wiker wrote:

...

I've used cl-yacc exactly once, and chose to implement "start conditions" in my lexer

Paul Tarvydas

4 Feb 4 Feb

2:39 p.m.

...

symbol parse, but is there another general parsing technique, available as a lisp library of course, that either works at a lower level than yacc usually does or allows the lexer to access more context about the parse?

The relatively new PEG packrat parser technologies make it possible to use just one universal description for, both, scanning and parsing. I see that cl-peg exists, but I haven't tried it out. [I, also, have built for myself a variant of Holt's S/SL (syntax semantic language) that allows preservation of white space and for changing scanners on the fly (that allows parsing different languages embedded in one another), but I have not taken the time to polish it enough for release.] pt

Nikodemus Siivola

3:20 p.m.

On 4 February 2011 16:39, Paul Tarvydas <paul.tarvydas@rogers.com> wrote:

...

The relatively new PEG packrat parser technologies make it possible to use just one universal description for, both, scanning and parsing. I see that cl-peg exists, but I haven't tried it out.

Esrap is another packrat parser for CL: https://github.com/nikodemus/esrap I had to parse some semi-structured text and wrote Esrap for that. Its primary limitations are lacking support for parsing from streams (it wants a string) and very little documentation. Cheers, -- Nikodemus

Thomas M. Hermann

3:31 p.m.

I am absolutely biased towards meta-sexp: "A META parser generator using LL(1) grammars with s-expressions." https://github.com/vy/meta-sexp It seems dirt simple to use, at least to me and the performance has been acceptable. Regards, ~ Tom ---------------------------------------------------------------- Thomas M. Hermann Odonata Research LLC http://www.odonata-research.com/ http://www.linkedin.com/in/thomasmhermann On Fri, Feb 4, 2011 at 9:20 AM, Nikodemus Siivola < nikodemus@random-state.net> wrote:

...

On 4 February 2011 16:39, Paul Tarvydas <paul.tarvydas@rogers.com> wrote:

...
The relatively new PEG packrat parser technologies make it possible to use just one universal description for, both, scanning and parsing. I see that cl-peg exists, but I haven't tried it out.

Esrap is another packrat parser for CL:

https://github.com/nikodemus/esrap

I had to parse some semi-structured text and wrote Esrap for that. Its primary limitations are lacking support for parsing from streams (it wants a string) and very little documentation.

Cheers,

-- Nikodemus

_______________________________________________ pro mailing list pro@common-lisp.net http://common-lisp.net/cgi-bin/mailman/listinfo/pro

Matthew D. Swank

5 Feb 5 Feb

4:22 a.m.

Meta does seem like a parser for people allergic to parsing. Of course it has a great lisp pedigree as well. I am biased by early exposure to Ken Thompson's work on regular expressions and LALR parsers. Matt On 02/04/2011 09:31 AM, Thomas M. Hermann wrote:

...

I am absolutely biased towards meta-sexp:

"A META parser generator using LL(1) grammars with s-expressions."

https://github.com/vy/meta-sexp

It seems dirt simple to use, at least to me and the performance has been acceptable.

Regards,

~ Tom ---------------------------------------------------------------- Thomas M. Hermann Odonata Research LLC http://www.odonata-research.com/ http://www.linkedin.com/in/thomasmhermann

On Fri, Feb 4, 2011 at 9:20 AM, Nikodemus Siivola <nikodemus@random-state.net <mailto:nikodemus@random-state.net>> wrote:

On 4 February 2011 16:39, Paul Tarvydas <paul.tarvydas@rogers.com <mailto:paul.tarvydas@rogers.com>> wrote:

> The relatively new PEG packrat parser technologies make it possible > to use just one universal description for, both, scanning and > parsing. I see that cl-peg exists, but I haven't tried it out.

Esrap is another packrat parser for CL:

https://github.com/nikodemus/esrap

I had to parse some semi-structured text and wrote Esrap for that. Its primary limitations are lacking support for parsing from streams (it wants a string) and very little documentation.

Cheers,

-- Nikodemus

_______________________________________________ pro mailing list pro@common-lisp.net <mailto:pro@common-lisp.net> http://common-lisp.net/cgi-bin/mailman/listinfo/pro

_______________________________________________ pro mailing list pro@common-lisp.net http://common-lisp.net/cgi-bin/mailman/listinfo/pro

Matthew D. Swank

4:16 a.m.

I had heard about cl-peg, but not Esrap. Most of the time I can get away with the space requirements need by both parser combinators and and pack-rat parsers, but there are some large documents where that isn't practical. One thing I would like to eventually see is a mature library that can take a spec like a PEG and generate different types of parsers for it. Matt On 02/04/2011 09:20 AM, Nikodemus Siivola wrote:

...

On 4 February 2011 16:39, Paul Tarvydas <paul.tarvydas@rogers.com> wrote:

...
The relatively new PEG packrat parser technologies make it possible to use just one universal description for, both, scanning and parsing. I see that cl-peg exists, but I haven't tried it out. Esrap is another packrat parser for CL:

https://github.com/nikodemus/esrap I had to parse some semi-structured text and wrote Esrap for that. Its primary limitations are lacking support for parsing from streams (it wants a string) and very little documentation.

Cheers,

-- Nikodemus

_______________________________________________ pro mailing list pro@common-lisp.net http://common-lisp.net/cgi-bin/mailman/listinfo/pro

Pascal J. Bourguignon

4 Feb 4 Feb

4:06 p.m.

Paul Tarvydas <paul.tarvydas@rogers.com> writes:

...

...
symbol parse, but is there another general parsing technique, available as a lisp library of course, that either works at a lower level than yacc usually does or allows the lexer to access more context about the parse?

The relatively new PEG packrat parser technologies make it possible to use just one universal description for, both, scanning and parsing. I see that cl-peg exists, but I haven't tried it out.

Well, if we may distract the OP from cl-yacc, I'll note that Zebu contains also a (unoptimized) lexer (it just uses regexps naively to match tokens). I would also note that given that context free languages include regular languages, there's also little theorical point in distinguishing a lexer from a parser: you can describe the tokens using normal grammar rules. space-or-comment := { space | comment } . comment := '#' '|' { comment-chars } '|' '#' . comment-chars := '|' not-sharp | not-pipe . not-sharp := space | letter | digit | '|' | '+' | '-' | ... . not-pipe := space | letter | digit | '#' | '+' | '-' | ... . identifier := space-or-comment letter { digit | letter } . etc, so basically the only lexer you need is READ-CHAR. -- __Pascal Bourguignon__ http://www.informatimago.com/ A bad day in () is better than a good day in {}.

Scott L. Burson

5 Feb 5 Feb

9:29 p.m.

On Fri, Feb 4, 2011 at 8:06 AM, Pascal J. Bourguignon <pjb@informatimago.com> wrote:

...

I would also note that given that context free languages include regular languages, there's also little theorical point in distinguishing a lexer from a parser: you can describe the tokens using normal grammar rules.

space-or-comment := { space | comment } . comment := '#' '|' { comment-chars } '|' '#' . comment-chars := '|' not-sharp | not-pipe . not-sharp := space | letter | digit | '|' | '+' | '-' | ... . not-pipe := space | letter | digit | '#' | '+' | '-' | ... . identifier := space-or-comment letter { digit | letter } .

etc, so basically the only lexer you need is READ-CHAR.

Except that as a matter of notational convenience, we normally allow lexical grammars to be ambiguous, and apply the additional disambiguating rule that the longest possible match at each point in the input is to be preferred over shorter ones (thus the token "ifx" is not the token "if" followed by the token "x"). This rule is not directly expressible in the context-free formalism. Also, as shown by your example, expressing negation is tedious -- you have to enumerate the complement of the set of characters you wish to exclude. Boolean grammars solve both these problems. Let's look at how they work. The key insight is that we consider nonterminal definitions to be operations on predicates over strings. The definition A ::= B | C is fundamentally a disjunction: a string is an A if it's a B, or if it's a C. Why not also allow the other boolean operations of conjunction and negation? This is what boolean grammars do. So you can have rules like A ::= B & ~C which says that a string is an A if it's a B _and_ it's not a C. If you consider the input to consist not of single characters but of pairs of (a character and its following character), you can express the longest-match rule. In practice, rather than making you do this explicitly, implementations like SBP provide a "followed by" notation that allows you to constrain the success of a production on the immediately following character. Thus you can write a production for "`if' followed by a token break character" fairly succinctly. See Megacz' paper, which you can download from here: http://www.cs.berkeley.edu/~megacz/research/ -- Scott

Samium Gromoff

14 Feb 14 Feb

10:43 a.m.

On Sat, 5 Feb 2011 13:29:39 -0800, "Scott L. Burson" <Scott@sympoiesis.com> wrote:

...

Boolean grammars solve both these problems. Let's look at how they work.

The key insight is that we consider nonterminal definitions to be operations on predicates over strings. The definition

A ::= B | C

is fundamentally a disjunction: a string is an A if it's a B, or if it's a C. Why not also allow the other boolean operations of conjunction and negation? This is what boolean grammars do. So you can have rules like

A ::= B & ~C

which says that a string is an A if it's a B _and_ it's not a C.

If you consider the input to consist not of single characters but of pairs of (a character and its following character), you can express the longest-match rule. In practice, rather than making you do this explicitly, implementations like SBP provide a "followed by" notation that allows you to constrain the success of a production on the immediately following character. Thus you can write a production for "`if' followed by a token break character" fairly succinctly.

I'm surprised that nobody has mentioned OMeta/ometa2 as of yet. It has this boolean functionality you speak of: http://subvert-the-dominant-paradigm.net/blog/?p=38 It's unfortunate that the author wasn't very experienced with CL, so his implementation doesn't feel idiomatic and the code isn't always clean. Besides this, and a somewhat idiosynractic syntax of OMeta grammars (which is, to degree, by necessity) it's a very interesting framework. I've got modifications adding match tracing and source location support, which I've never got around to submit to the author. I'd guess the next step would be converting it to use SEXP's for grammar descriptions. -- regards, Samium Gromoff -- "Actually I made up the term 'object-oriented', and I can tell you I did not have C++ in mind." - Alan Kay (OOPSLA 1997 Keynote)

Scott L. Burson

8:26 p.m.

On Mon, Feb 14, 2011 at 2:43 AM, Samium Gromoff <_deepfire@feelingofgreen.ru> wrote:

...

I'm surprised that nobody has mentioned OMeta/ometa2 as of yet.

It has this boolean functionality you speak of:

I wasn't familiar with it. Based on a quick glance I don't think it can handle boolean grammars in their full generality. It's PEG-based, and as I understand, PEGs cannot handle all context-free grammars. The full boolean grammar formalism, on the other hand, is a strict superset of context-free. The big difference in practice is what is happening at runtime. PEGs are deterministic; SBP, on the other hand, is GLR-based, meaning that it is constructing multiple possible parse trees simultaneously, discarding the ones that don't work out. If a string is ambiguous under the given grammar, GLR will actually return all the possible parses (there's a reasonably compact representation for this called a "parse forest"). PEG by definition will only ever give one parse. Of course, either of those could be a desirable property, depending on what you're trying to parse and why. -- Scott

William Halliburton

10:26 p.m.

I know I'm late to the parade but if you need to write grammars that can be easily augmented by other people without them needing to know much, if any, lisp I can recommend the ebnf parser written by Daniel Herring. Its onion of macros expands into pretty understandable code also. http://git.androdna.com/?p=lisp/ebnf-parser

Daniel Herring

15 Feb 15 Feb

3:27 a.m.

On Mon, 14 Feb 2011, William Halliburton wrote:

...

I know I'm late to the parade but if you need to write grammars that can be easily augmented by other people without them needing to know much, if any, lisp I can recommend the ebnf parser written by Daniel Herring. Its onion of macros expands into pretty understandable code also.

http://git.androdna.com/?p=lisp/ebnf-parser

Thank you for the kind words. That was one of my first real CL projects. A great attraction of CL was that I would no longer need to write parser code for DSLs... I always meant to revisit that library, clean it up, and add some more features (e.g. support for infinite streams and source locations) but got distracted in the land of continuations and never made it back. Here's a reply to an earlier post. It was stuck in my drafts folder. Re the OP: Yes, it sounded like you were abusing the YACC paradigm. On Sat, 5 Feb 2011, Scott L. Burson wrote:

...

Boolean grammars solve both these problems. Let's look at how they work.

... A few years ago when I was parsing a lot (a motivator towards lisp), I grew to like recursive-descent parsers and EBNF (the ISO 14977 variant). AFAICT, the tokenization/parsing divide was mostly a performance hack. The tokenizer uses a minimal syntax and thus can run quickly. The parser has more conditions, but EQL is faster than STRING=. Thus the two put together ran fast, and the factorization also simplified specification. Today, recursive-descent parsers are fast enough for most uses. Add a form that prevents backtracks, and they can handle infinite streams of data. It is easy to convert most formalisms into this form, and nothing is more flexible than "write whatever function you want here". If people are looking for inspiration, ANTLR is a good framework to borrow ideas from. Parsing is only half the battle. Then you reach trees, walkers, output generation, error handling, etc. Boost::spirit also has some good features. Later, Daniel

Scott L. Burson

4 Feb 4 Feb

8:39 p.m.

On Thu, Feb 3, 2011 at 10:33 PM, Matthew D. Swank <akopa@charter.net> wrote:

...

It seems (from my admittedly limited search) that this is not a common modification of yacc. Before I start bugging the maintainer about my changes, I want to know: am I abusing yacc?

I've had to do that kind of thing for parsing languages like Cobol that were designed before the advent of formal parsing theory. It is an abuse in the sense that it makes it harder to say formally exactly what language you're parsing, but hey, you do what you have to do in this business :-) My own pet parser generator project is a CL reimplementation of Adam Megacz' Scannerless Boolean Parser: http://research.cs.berkeley.edu/project/sbp/ Scannerless parsing obviates the kind of games you're having to play by integrating the lexer into the grammar. Boolean grammars are more expressive than context-free grammars. Both of these things are cool. What you don't get in this framework, though, is a proof that your grammar is unambiguous. My reimplementation is not far enough along to release, alas, nor do I really have any time to work on it. Maybe later this year... -- Scott

Daniel Weinreb

8:59 p.m.

So the next time anyone says that there aren't any libraries for Common Lisp, we can reply that there are so many good parser libraries that one must compare notes to figure out which is best for which situation. So there, ye of little faith! :) -- Dan Scott L. Burson wrote:

...

On Thu, Feb 3, 2011 at 10:33 PM, Matthew D. Swank <akopa@charter.net> wrote:

...
It seems (from my admittedly limited search) that this is not a common modification of yacc. Before I start bugging the maintainer about my changes, I want to know: am I abusing yacc?

I've had to do that kind of thing for parsing languages like Cobol that were designed before the advent of formal parsing theory.

It is an abuse in the sense that it makes it harder to say formally exactly what language you're parsing, but hey, you do what you have to do in this business :-)

My own pet parser generator project is a CL reimplementation of Adam Megacz' Scannerless Boolean Parser: http://research.cs.berkeley.edu/project/sbp/

Scannerless parsing obviates the kind of games you're having to play by integrating the lexer into the grammar. Boolean grammars are more expressive than context-free grammars. Both of these things are cool. What you don't get in this framework, though, is a proof that your grammar is unambiguous.

My reimplementation is not far enough along to release, alas, nor do I really have any time to work on it. Maybe later this year...

-- Scott

_______________________________________________ pro mailing list pro@common-lisp.net http://common-lisp.net/cgi-bin/mailman/listinfo/pro

Matthew D. Swank

5 Feb 5 Feb

4:32 a.m.

In fact I should have consolidated my replies, but by the time I realized I was probably spamming the list, I was too far in. I used to get impression that writing parsers in lisp was like moon-shining: lots of people had rough and ready parsers out in the back woods, but didn't really like talking about them. I am grateful, and a little overwhelmed by all the responses here. Thank you, Matt On 02/04/2011 02:59 PM, Daniel Weinreb wrote:

...

So the next time anyone says that there aren't any libraries for Common Lisp, we can reply that there are so many good parser libraries that one must compare notes to figure out which is best for which situation.

Matthew D. Swank

4:26 a.m.

I would be very interested to see that if it ever become something you'd be comfortable releasing. On 02/04/2011 02:39 PM, Scott L. Burson wrote:

...

My own pet parser generator project is a CL reimplementation of Adam Megacz' Scannerless Boolean Parser: http://research.cs.berkeley.edu/project/sbp/

Matthew D. Swank

7 Feb 7 Feb

3:15 a.m.

On 02/04/2011 02:39 PM, Scott L. Burson wrote:

...

It is an abuse in the sense that it makes it harder to say formally exactly what language you're parsing, but hey, you do what you have to do in this business :-)

I ended up writing a lexer start condtion generator that derives its state machine from the compiled yacc parser. So the point is moot, I guess. Matt

5396

Age (days ago)

5407

Last active (days ago)

List overview

Download

20 comments

12 participants

participants (12)

Cyrus Harmon
Daniel Herring
Daniel Weinreb
Matthew D. Swank
Nikodemus Siivola
Pascal J. Bourguignon
Paul Tarvydas
Raymond Wiker
Samium Gromoff
Scott L. Burson
Thomas M. Hermann
William Halliburton