Update of /project/climacs/cvsroot/papers/ilc2005/syntax In directory common-lisp.net:/tmp/cvs-serv1803
Modified Files: climacssyntax.tex Added Files: parserclasses.eps parserclasses.pdf ttcn3msc.eps ttcn3msc.pdf Log Message: Buffer protocol description; some syntax
Date: Mon May 23 03:05:55 2005 Author: bmastenbrook
Index: papers/ilc2005/syntax/climacssyntax.tex diff -u papers/ilc2005/syntax/climacssyntax.tex:1.13 papers/ilc2005/syntax/climacssyntax.tex:1.14 --- papers/ilc2005/syntax/climacssyntax.tex:1.13 Sun May 22 13:47:26 2005 +++ papers/ilc2005/syntax/climacssyntax.tex Mon May 23 03:05:55 2005 @@ -58,42 +58,65 @@ \section{Introduction}
The field of advanced text editors is a crowded one, with a long -history and an apparent ability to cause passionate argument. Climacs is philosophically part of the Emacs editor tradition, which has spawned many variants with many different approaches to buffer management, incremental redisplay, and syntax analysis. Emacs itself traces its lineage to TECO, where Emacs was originally implemented as a set of TECO macros. Climacs is compared to several interesting variants of Emacs in Table \ref{table:editorcompare}. +history and an apparent ability to cause passionate argument. Climacs +is philosophically part of the Emacs editor tradition, which has +spawned many variants with many different approaches to buffer +management, incremental redisplay, and syntax analysis. Emacs itself +traces its lineage to TECO, where Emacs was originally implemented as +a set of TECO macros. Climacs is compared to several interesting +variants of Emacs in Table \ref{table:editorcompare}.
-\begin{figure*} \begin{center} +\begin{figure*} +\begin{center} \begin{tabular}{|c|c|c|c|c|} \hline \textbf{Editor} & \textbf{Buffer Implementation} & \textbf{Syntax Analysis} & \textbf{Language} \ -\hline TECO & Gap Buffer & Unknown & Assembly + TECO Macros +\hline TECO & Gap buffer & Unknown & Assembly + TECO Macros \ \hline Zmacs & Probably doubly-linked list of lines & None & MacLisp \ -\hline GNU Emacs & Gap Buffer & Regular Expressions & C + Emacs Lisp +\hline GNU Emacs & Gap buffer & Regular Expressions & C + Emacs Lisp \ \hline Hemlock & Doubly-linked list of lines & None & Common Lisp \ -\hline FRED & Gap Buffer & None & PPC Assembly + Common Lisp +\hline FRED & Gap buffer & None & PPC Assembly + Common Lisp \ \hline Deuce & Doubly-linked lists of lines & Unknown & Dylan \ \hline Climacs & Multiple & Multiple & Common Lisp \\hline \end{tabular} -\caption{Implementation strategies of multiple Emacs variants} \end{center} \label{table:editorcompare} \end{figure*} - -Climacs' syntax analysis is a flexible protocol which can be implemented with a full language lexer and parser. GNU Emacs, the most commonly used Emacs-like editor, uses regular expressions for its syntax analysis. Because these regular expressions are applied lazily and not on the whole buffer, constructs such as Common Lisp's nestable \verb+#| |#+ block comments will often confuse the regular expressions. If the parser starts after the opening \verb+#|+ then the closing \verb+|#+ will be treated as the start of an escaped symbol name. Even if the regular expression parses the whole block comment correctly, other expressions can still match on the contents of the comment, leading to issues when the first character in a column in the block comment is the start of a definition. Emacs users quickly learn to insert a space before the open paren to work around Emacs' font-lock deficiencies. +\caption{Implementation strategies of multiple Emacs variants} +\end{center} +\label{table:editorcompare} +\end{figure*} + +Climacs' syntax analysis is a flexible protocol which can be +implemented with a full language lexer and parser. GNU Emacs, the most +commonly used Emacs-like editor, uses regular expressions for its +syntax analysis. Because these regular expressions are applied lazily +and not on the whole buffer, constructs such as Common Lisp's nestable +\verb+#| |#+ block comments will often confuse the regular +expressions. If the parser starts after the opening \verb+#|+ then the +closing \verb+|#+ will be treated as the start of an escaped symbol +name. Even if the regular expression parses the whole block comment +correctly, other expressions can still match on the contents of the +comment, leading to issues when the first character in a column in the +block comment is the start of a definition. Emacs users quickly learn +to insert a space before the open paren to work around Emacs' +font-lock deficiencies.
The Climacs text editor is a combination of frameworks for buffer representation and parsing, loosely coupled with a CLIM-based display engine. It includes the Flexichain library \cite{flexichain}, which provides an editable sequence representation and mark (cursor) management using a simple linked lists used for implementing the -buffer protocol; and an implementation of a -slight modification of the Earley parsing algorithm \cite{earley}, to -assist in the creation of syntax-aware editing modes. An application -can combine a particular implementation of the buffer protocol, the -syntax protocol, and its own display methods to produce a -sophisticated editor for a particular language. +buffer protocol; and an implementation of a slight modification of the +Earley parsing algorithm \cite{earley}, to assist in the creation of +syntax-aware editing modes. An application can combine a particular +implementation of the buffer protocol, the syntax protocol, and its +own display methods to produce a sophisticated editor for a particular +language.
The Climacs buffer protocol, which provides a standard interface to common text editor buffer operations, uses the Flexichain library; we @@ -112,39 +135,119 @@ \section{Buffer Protocol} \label{sec:buffer}
-The buffer protocol provides a set of methods for modifying and -reading the contents of a buffer which can contain arbitrary objects. -Typically these objects are characters, but the buffer can contain any -object that the syntax protocol can display in the Climacs window. The -buffer protocol is independent of any implementation of the protocol, -which allows flexible representations of buffers. - -The aim is for Climacs to provide several implementations of the -buffer protocol, one of which will use a sequence of lines organized -into a tree for quick access. In this implementation, a line can be -considered opened or closed. Editing operations are performed on -opened lines, which are represented as a cursorchain (a Flexichain -with an arbitrary number of cursors). A fixed number of lines are kept -opened according to a LRU scheme. When a line is closed it is -converted to a vector. If the line contains only base-char objects -this vector is a base-string; otherwise, it is unspecialized. -Currently, however, Climacs uses a single cursorchain for the entire -buffer, in effect making it a simple gap-buffer implementation. - -``Protocol'' is not just an empty claim, as there are already multiple -buffer implementations: Aleksandar Bakic's persistent buffer -implementations, providing a cheap implementation of the undo protocol. -(FIXME: a citation, either to something ``in preparation'' or to some -previous description of the ideas?) +Climacs abstracts the implementation of editable sequences as editor +buffers using the buffer protocol. The buffer protocol provides a set +of methods for modifying and reading the contents of a buffer, and +setting and retrieving marks in the buffer. Buffers can contain +arbitrary objects. Typically these objects are characters, but the +buffer can contain any object that the syntax protocol can display in +the Climacs window. The buffer protocol is independent of any +implementation of the protocol, which allows flexible representations +of buffers. + +Currently Climacs uses a single Flexichain ``cursorchain'' as the +editable sequence representation for standard buffers. A cursorchain +is a circular gap buffer with an arbitrary number of cursors in the +buffer. Climacs uses these cursors as the implementation of marks in +the buffer. The single gap-buffer implementation is used by many other +editors, including GNU Emacs. Flexichain improves on this by making +the gap buffer circular in its underlying representation; the start of +the sequence is stored in a separate slot, along with the beginning of +the gap. + +Climacs also provides a buffer implementation utilizing functional +data structures as the editable sequence representation. This buffer +implementation provides a low-cost implementation of undo along +multiple edit histories. (FIXME: a citation, either to something ``in +preparation'' or to some previous description of the ideas? It would +help if someone with more understanding than I of this buffer +implementation would sketch this out.) + +Climacs is intended to provide other buffer implementations, one of +which will use a sequence of lines organized into a tree for quick +access. In this structure, a line can be considered opened or +closed. When a line is opened, it is represented as a Flexichain +cursorchain. All editing operations are performed on open lines. A +fixed number of lines are kept opened according to a +least-recently-used scheme. When a line is closed it is converted to a +vector for efficient storage and access. If the line contains only +base-char objects this vector is a base-string; otherwise, it is +unspecialized. + +This structure has the advantage of efficient line-based access in the +buffer. It also provides much less pessimistic behavior than that of a +single gap buffer when a user is editing two disparate sections in a +large file. With a single gap buffer, the gap must be moved to the +point of edit before an edit operation is allowed. When the buffer is +large and edit operations occur frequently at multiple locations in +the buffer, this requires a substantial amount of copying between +edits. In this situation single-gap-buffer editors like Emacs will +noticably pause between edits to move the gap. A structure which +contains a sequence of lines and keeps the most recently used lines +open as gap buffers can operate as a multi-gap buffer with automatic +gap placement, while not providing pessimistic performance when +accessing a specific line in the buffer. + +The efficiency of Climacs buffers depends greatly on the +implementation of the buffer protocol that is used. Space efficiency +will also depend on the implementation of Common Lisp which is used to +run Climacs. In Steel Bank Common Lisp, the type character is +represented with the full 22 bits used by the Unicode character +space. External character encodings other than Unicode are converted +to Unicode when a file is read in to memory. If the characters are +stored in a specialized array, this will net a worst case space +efficiency of four bytes of every byte in the file. However, the time +advantages of this representation outweigh the space +inefficiency. Searching for an individual character in a sequence of +characters encoded in UTF-8 is $O(n)$, because each individual +character must be examined to determine the number of octets which are +stored to represent that character. + +The Flexichain library uses an unspecialized vector for its storage, +which uses one machine word per element, either as an immediate value +or as a pointer to a larger element. Climacs buffers can contain any +object, so in a suitably complex syntax and buffer protocol +implementation an object in a buffer may
\section{Syntax Protocol} \label{sec:syntax}
-A syntax in Climacs is an incremental parser which creates and updates -a parse tree of a buffer's contents, and provides a mechanism for -drawing these parsed objects in a window. The parser accepts lexemes -produced by an incremental lexical analyzer associated with the -syntax. +\begin{figure*} + \begin{center} + \includegraphics{parserclasses} + \caption{Organization of classes used by a typical syntax} + \label{fig:syntaxclasses} + \end{center} +\end{figure*} + +Climacs is designed to allow multiple implementation strategies for +buffer parsers and syntax-aware display mechanisms. The set of hooks +that Climacs provides to allow this is the syntax protocol. A syntax +in Climacs is a class and set of methods which provide a lexical +analyzer, parser, and display methods for a buffer. The incremental +parser associated with a syntax creates and updates a parse tree of a +buffer's contents, and provides a mechanism for drawing these parsed +objects in a window. The parser accepts lexemes produced by an +incremental lexical analyzer. Display is handled by drawing methods +implemented using the CLIM high-level display toolkit. + +Though a syntax is free to choose its own implementation strategy, +lexical analysis and parsing of the buffer is typically done in an +object-oriented fashion. The lexer operates on objects in the buffer, +usually characters, and returns objects of various classes. Each +parser production is represented by a class. In complex syntaxes, the +parser rules can be quite complicated and involve arbitrary code, but +for a simple grammar the parsing rules can be entirely represented by +matching on the classes returned by the tokenizer and parser. Figure +\ref{fig:syntaxclasses} shows the organization of classes in the TTCN3 +grammar. + +Climacs provides an Earley parser \cite{earley} which is sufficient to +parse most context-free grammars. Earley parser: discussion of +generality and asymptotic efficiency (in general and for usual cases). +Possibility of replacement in case of complicated grammars? (What was +that parsing algorithm that someone mentioned on \verb+#lisp+? +\textit{I have no idea.})
The syntax analysis can be applied either in a per-window or per-buffer function. The per-window approach is best suited to @@ -154,11 +257,6 @@ approach is appropriate when the parse tree will also be used for some other display or analysis of the text in the buffer.
-Earley parser: discussion of generality and asymptotic efficiency (in -general and for usual cases). Possibility of replacement in case of -complicated grammars? (What was that parsing algorithm that someone -mentioned on \verb+#lisp+?) - \section{Syntaxes} \label{sec:syntaxes}
@@ -186,7 +284,7 @@ Christophe might want to describe the Prolog syntax a bit here. In particular any details having to do with the implementation of operator precedence as specified in the ISO Prolog standard might be -interesting. +interesting.
[ Maybe just the \textit{priority} stuff? I think that's the only interesting bit of non-context-freeness; user-defined @@ -209,6 +307,12 @@ slightly-buggy or incomplete syntax mode will render the whole thing useless. Violation of WiB?
+The Testing and Test Control Notation 3 (TTCN3) language is a language +which captures detailed test specifications. TTCN provides both a +textual ``core'' grammar and a graphical presentation format similar +to Message Sequence Charts (see figure \ref{fig:ttcn3gr}). Climacs +currently provides an editor for a subset of the TTCN3 core language. + The TTCN3 syntax is implemented with a high-level macro which defines classes and adds syntax rules using the syntax protocol for each terminal and non-terminal in the grammar. The syntax of this macro @@ -218,6 +322,20 @@ receive their own non-terminal entry in the grammar. In addition, this macro defines basic display functions for the syntax objects produced by the parser, with language keywords appearing in a separate color. + +\begin{figure*} + \begin{center} + \parbox{0.45\linewidth}{\includegraphics{ttcn3msc}} + \parbox{0.25\linewidth}{\texttt{\ [1cm]\noindent + testcase MyTestCase (inout integer MyPar)\ + runs on MyMTCtype system SystemType {\ + var integer MyVar := 1;\ + MyMTCPort.send(MyTemplate);\ + }}} + \caption{A fragment of TTCN3 and its corresponding Graphical Representation (GR)} + \label{fig:ttcn3gr} + \end{center} +\end{figure*}
\subsection{Per-Buffer Syntax: a tablature editor} \label{sec:tabeditor}