On Fri, Feb 4, 2011 at 8:06 AM, Pascal J. Bourguignon pjb@informatimago.com wrote:
I would also note that given that context free languages include regular languages, there's also little theorical point in distinguishing a lexer from a parser: you can describe the tokens using normal grammar rules.
space-or-comment := { space | comment } . comment := '#' '|' { comment-chars } '|' '#' . comment-chars := '|' not-sharp | not-pipe . not-sharp := space | letter | digit | '|' | '+' | '-' | ... . not-pipe := space | letter | digit | '#' | '+' | '-' | ... . identifier := space-or-comment letter { digit | letter } .
etc, so basically the only lexer you need is READ-CHAR.
Except that as a matter of notational convenience, we normally allow lexical grammars to be ambiguous, and apply the additional disambiguating rule that the longest possible match at each point in the input is to be preferred over shorter ones (thus the token "ifx" is not the token "if" followed by the token "x"). This rule is not directly expressible in the context-free formalism. Also, as shown by your example, expressing negation is tedious -- you have to enumerate the complement of the set of characters you wish to exclude.
Boolean grammars solve both these problems. Let's look at how they work.
The key insight is that we consider nonterminal definitions to be operations on predicates over strings. The definition
A ::= B | C
is fundamentally a disjunction: a string is an A if it's a B, or if it's a C. Why not also allow the other boolean operations of conjunction and negation? This is what boolean grammars do. So you can have rules like
A ::= B & ~C
which says that a string is an A if it's a B _and_ it's not a C.
If you consider the input to consist not of single characters but of pairs of (a character and its following character), you can express the longest-match rule. In practice, rather than making you do this explicitly, implementations like SBP provide a "followed by" notation that allows you to constrain the success of a production on the immediately following character. Thus you can write a production for "`if' followed by a token break character" fairly succinctly.
See Megacz' paper, which you can download from here: http://www.cs.berkeley.edu/~megacz/research/
-- Scott