Dear mailing list member,
There have been a few exciting developments since the previous
progress report.
First, we (me, Brian Mastenbrook, Christophe Rhodes) have a paper
about Climacs syntax protocols accepted at the ILC 2005. In it, we
explain the difference between per-buffer and per-window syntax
updating and give examples of modules that use these methods. When
reading what my co-authors had written, I was struck by the number of
syntax modules that already exist for Climacs and how varied they are:
a very complete syntax module for ISO Prolog, a module for Tabcode
notation used for Lute scores, a module for Testing and Test Control
Notation 3 (TTCN3), and of course existing but incomplete syntax
modules for Common Lisp and HTML.
Second, due to difficulties achieving good performance with the Earley
parser, I have written an alternative parser based on incremental LR
parsing technology. I also wrote the definitions necessary for this
module to parse Common Lisp. This syntax module is named "Lisp" to
distinguish it from the one called "Common Lisp". It is experimental
for now, so both are available. The incremental LR parser keeps a
parse tree for the entire buffer up to date. It is thus a per-buffer
module, which would be impossible with the current Earley parser.
Parsing all of lisp-syntax.lisp (around 1000 lines) takes around 2
seconds on my machine, but this is only done when the file is
initially loaded, or when edits to the buffer are made that makes the
contents parse entirely differently from last time, such as when a
closing parenthesis is introduced at the beginning of the buffer. In
almost all cases, the incremental LR parser is able to patch the
existing parse tree. As an example, typing a function definition at
the beginning of a buffer containing lisp-syntax. lisp requires less
than 10ms per keystroke to update the entire parse tree.
Current feature of the "Lisp" syntax module include C-M-b and C-M-f
(backward and forward expression), C-M-x (eval defun), and of course
syntax highlighting of strings, keywords, and comments.
One interesting feature of the LR parser that I implemented is that
the lexer is a generic function that depends on the state of the
parser. This method is slower than the one used for the Earley
parser, but we could afford this since the parser is incremental. The
consequences of this lexer organization is that the parser could enter
a state in which a totally different language, with different lexemes,
is to be parsed. This feature opens the door to modular parsers
allowing us to parse a string of PHP code inside an HTML buffer, or to
check the grammar of strings and comments of a program.
Plans are to factor out code that could be common for more LR-based
syntax modules, either into syntax.lisp or to a new file, but probably
the former since there are definitions like that of a parse tree that
look very similar in the Earley parser and in the LR parser. I also
haven't given up on trying to make the Earley parser truly incremental
in the same way that the LR parser now is.
Plans also include writing a new LALR parser generator, simply because
the one in Climacs uses generic functions instead of tables for the
traditional `action' and `goto' (called `new-state' in Climacs)
functions of LR parsing. The new parser generator would have to
discover inheritance relations between parser states. Once such a
generator exists, I might rewrite the HTML syntax module to take
advantage of it.
For a detailed list of updates, I refer you to the CVS archives.
Take care,
--
Robert Strandh
---------------------------------------------------------------------
Greenspun's Tenth Rule of Programming: any sufficiently complicated C
or Fortran program contains an ad hoc informally-specified bug-ridden
slow implementation of half of Common Lisp.
---------------------------------------------------------------------