Hello all,
[just in case some of you forgot to subscribe to the list, I put you on CC this one time]
I have been thinking about the API and the internal representation of a buffer.
First, I think we should start by generalizing the contents to Unicode characters. It would be OK with me, if we choose some normalized form and stick to it. I also think that we should not generalize further right now (allowing arbitrary objects), but what I am suggesting below is compatible with such an extension.
Concerning the API first, we should get rid of the "doubly-linked list of lines". The API should not mention the existence of lines, except as the text that is contained between two newline characters.
It is probably still a good thing to represent the buffer internally as a sequence (but not necessarily a doubly-linked list) of lines, but that's another story. I know Gilbert would like to associate reader state with lines, but as I will try to show below, it does not have to be lines, and in fact, it would not be very optimal to associate it with lines.
For the internal representation, I suggest representing the buffer as a cursorchain (see our flexichain paper at the first European workshop on Lisp and Scheme) of lines. There would be many different representations of a line. At first, I thought that it would be a good idea to have a small number of `open' lines represented as cursorchains of Unicode characters (or fixnums if the Lisp implementation does not have Unicode characters), and the others would have some kind of compressed format such as a UTF-8 string or simply a gzipped version of the line. Such transformations would complicate search functions that would have to `open' every line, with possibly huge performance penalties, both in terms of CPU and memory consumption.
Instead I suggest that open lines be represented as cursorchains and closed lines as vectors, but in both cases, there can be different element types. At least three different element types would be used: one-byte Unicode character (used when the line contains only Unicode characters from 0 to 255, which includes latin-1), two-byte Unicode character (in most other cases) or four-byte Unicode characters (for serious Unicode users). The buffer would automatically convert from one-byte to two-byte to four-byte whenever an attempt is made to insert a Unicode character that is beyond what can be represented. When a line is closed, it is scanned to see whether a more compact representation can be used. I suggest keeping several lines open and closing them in an LRU-like way.
This representation has the interesting property that every character in the buffer is always represented as a valid Unicode character, so that nothing special needs to be done for search functions. One could identify some interesting special cases, such as latin-15, which would normally require more than one byte, except that very few characters need more than one. The search function would be slightly harder for such special cases, but not much. Also, sometime in the future, one might identify line types that can have arbitrary Lisp objects in them. Another obvious property of the representation is that latin-1 texts are very compact indeed, an important special case, I imagine.
Now, concerning syntax awareness, I think that what Gilbert is thinking of is the most promising idea so far (with minor modifications). His idea is have a `read' function that can have its state stored, and to store such state in association with (the beginning of) each line in the buffer. Modifying the contents of the buffer would require recomputing read state from the beginning of the line where the modification took place to the end of the (last) window on display. In most cases, where there is only one window on display, there would be a very modest amount of work to do. In the worst case (when the first line of a huge buffer is modified, and there is another window at the end of it), the entire buffer would need to be recomputed.
The only objection I have to Gilbert's proposal is that he would like read states to be associated with lines. In fact, if it weren't for memory consumption, we could associate such state with every character in the buffer. Using a line as a unit is just a way of storing such information every so often, often enough to minimize computations and rarely enough that memory consumption remain reasonable. But this is suboptimal because there could be empty lines or lines with just a few characters on them, and other lines that have hundreds of lines. A better idea would be to associate read state with marks (say) every couple hundred characters. Such marks would be added and removed dynamically as the text grows and shrinks so that the distribution of marks remain roughly the same. These read-state marks would probably be stored in another cursorchain specific to each buffer.
phemlock-devel@common-lisp.net