[mac-lisp-ide] Thoughts on syntax editing

9 Jul 2004

      I've been thinking for several weeks about the issues involved with
supporting syntax-aware editing in an editor, particularly Common Lisp
syntax. Recently I wrote an incremental parser to handle syntax
coloring for lisppaste, and so that experience has helped me to get a
better handle on what the relevant issues are to supporting robust
syntax-aware editing in a text editor.

What I'm including in the umbrella of syntax-aware editing are the
following tasks:

* Coloring the source based on syntactic type
* Detecting invalid syntax and (a) informing the user and (b) skipping
  over detectable sections of invalid syntax when performing other
  commands
* Detecting sections of comment and strings, and ignoring these for
  other commands appropriately
* and, commands which operate on the parse tree of the source,
  including the C-M-* functions which operate on an entire
* s-expression

There are several approaches to making this work. Two of them have
been done before en masse, with various degrees of success:

* Force the user to only edit valid syntax, and insert balanced pairs
  of parens / quotes / comment delimiters. This approach works fine
  for editing new source but does not work so well when reading in an
  existing, possibly invalid file, and also can feel much like a
  straightjacket. It also requires a large amount of adaptation to the
  environment. The best exemplar of this approach is Interlisp's
  S-Edit structural editor.
* Maintain the view that the text is merely an octet stream, and
  locally use regexps to try to determine what syntax things are,
  setting character properties along the way. This approach is (as far
  as I understand it) the approach used by Emacs, which leads to
  massive confusion when editing unbalanced strings and
  comments. Sometimes Emacs never recovers; especially with CL-style
  #||# multiline comments.

What I would like to see is a third path: a robust editor which always
knows the syntactic type of the text, but allows the user to edit code
as if they were using a plain text editor (or allows a slightly
smarter mode where balanced syntax is inserted by default). This is
the holy grail of useful syntactic editing. It's also very
complicated.

My first approach in writing such an editor revolved around viewing a
buffer as a doubly-linked list of lines, which themselves were a
doubly linked list composed of segments of text broken up by markers
(which delineate syntactic type and also include the cursor and
mark). This approach quickly got very complex: it was too difficult to
write general text-manipulation routines when text was constantly
being broken up by markers, and there could be an unbounded number of
markers between each character.

 ------------------------------------------------------
| "a" | #<marker> | "bc" | #<syntactic-marker> | "def" |
 ------------------------------------------------------
 The "markers embedded in lines" approach: too complex.

Dan Barlow had a better idea: keep the text as a doubly linked list of
strings representing lines, but still use markers to represent the
beginning and end of sections with a particular syntactic type. In
other words, the markers would be separated out from the text itself,
making it possible to have views of the same text with different
collections of markers. Primitive editing operations would then be
responsible for making sure that whatever marker invariants were
necessary were preserved, but this could be simpler when markers are
disjoint and it's easy to pull out the set of markers affected by an
operation. This is, I think, the best approach, as it would allow
keeping the "line" abstraction that hemlock uses as-is.

I am envisioning that markers can be used for several different
purposes - extensible at the user's request. This would include: the
cursor, the mark, delimiters for syntax types, even the beginning and
end of the line. Allowing multiple cursor-type markers and setting one
as primary would allow collaborative editing fairly easily. (On the
subject of collaborative editing, the disjoint-markers idea is
probably a good one here too: it means that updates to the text can be
sent out without syntactic information, and each client updates the
marker set to account for the new text on its own.)

The next question of relevance is to figure out how to use these
markers to implement knowledge of the file syntax. This Ain't Easy,
for several reasons.

First, we need to know the raw syntactic role of a various section of
text. Is it a symbol? A string? A list opening or closing? But if we
view s-expression editing via the likes of C-M-t as the same problem
as syntax coloring, this means we need to maintain a nested view of
the current syntax, because finding the end of the current
s-expression means understanding not just the raw syntactic role of an
element but its level of nesting inside other syntactic types. Nesting
is also necessary for robust coloring in Common Lisp: emacs famously
fails to handle nested #||# comments (and even sometimes non-nested
ones), leaving most of your buffer showing in the font lock comment
color.

Maintaining this nested view of syntax while allowing the user to
insert unbalanced syntactic elements is more than merely a SMOP,
however. A simple insertion of a character might affect the syntactic
type of the entire rest of the file. Inserting a closing character
would revert it back then - but possibly not a clean
reversion. Deleting a character might require restarting the parser at
some point prior to the previous character or syntactic type change.

Here are some specific examples to think about:

The cursor is on #\A in "#3A((1) (2) (3))". The user hits the delete
key.

The cursor is on the first #\( in "() (+ 1 2))". The user hits the #\(
key.

The cursor is at the beginning of the second line of the following
section of text. The user hits the #\" key.

-------------------------------
(format nil
Mary had a little lamb. Its fleece was as white as ~A. #|"
'|#snow|) ; |
-------------------------------

These examples demonstrate that robust syntax-aware editing is not a
simple matter of local regular expressions or of a parser that can be
run until the top-level syntactic role matches what it was before the
edit. A single edit may actually have a deeper meaning - for instance,
in the second example, inserting a paren means "insert a level of
list-ness at this position in the syntactic role stack until an
unmatched close paren is found on this level".

To solve the problem of editing unbalanced strings or multiline
comments, Andreas Fuchs suggested that it be possible to revert the
syntactic type of an entire region when the insertion of an unmatched
opening element would otherwise destroy this information. I think this
is a good idea. It can be implemented by "hiding" one set of markers
when the unmatched opening element is inserted, and un-hiding them
after. Any edits the user makes in the meantime can be rectified to
both the hidden and unhidden syntactic markers, thus meaning zero
reparsing when the close element is inserted. However, this sounds
like just the type of SMOP that is far more difficult than it sounds.

I'm curious to know what other people think about this. It seems to me
that this is a rather difficult problem when the various nooks and
crannies of CL syntax are taken into account (unless you're willing to
live with the possibility of reparsing the entire source file on many
edits). If this doesn't make any sense at all I'd like to know that
too.

Brian

--
Brian Mastenbrook                         "God made the natural numbers;
http://www.cs.indiana.edu/~bmastenb/       all else is the work of man."
bmastenb@cs.indiana.edu                    -- Leopold Kroneker