Good data-structure for characters in editor buffer?

From: Christopher R. Barry
Subject: Good data-structure for characters in editor buffer?
Date: Fri, 25 Feb 2000 00:00:00 +0000
Message-ID: <87aekpokzh.fsf@2xtreme.net>

What's an ideal data-structure for storing the characters of an editor
buffer? I'm thinking a doubly-linked list. Also, isn't there something
in the GOF book about this very thing? (Don't own the book, but
flipped through it at Barnes & Noble and got the impression that there
are probably better books for Lisp programmers to be reading, but I
really don't know.)

Christopher

Re: Good data-structure for characters in editor buffer? Roger Corman
- Re: Good data-structure for characters in editor buffer? Tim Bradshaw
Re: Good data-structure for characters in editor buffer? Erik Naggum
Re: Good data-structure for characters in editor buffer? ·······@labri.u-bordeaux.fr
Re: Good data-structure for characters in editor buffer? Gareth Rees
- Re: Good data-structure for characters in editor buffer? nonya
- Re: Good data-structure for characters in editor buffer? Robert Monfera
  - Re: Good data-structure for characters in editor buffer? Gareth Rees
    - Re: Good data-structure for characters in editor buffer? Barry Margolin
  - Re: Good data-structure for characters in editor buffer? Christopher C Stacy
Re: Good data-structure for characters in editor buffer? Paolo Amoroso

From: Roger Corman
Subject: Re: Good data-structure for characters in editor buffer?
Date: Fri, 25 Feb 2000 00:00:00 +0000
Message-ID: <38b6d898.66162887@nntp.best.com>

On Fri, 25 Feb 2000 06:24:01 GMT, ······@2xtreme.net (Christopher R.
Barry) wrote:

>What's an ideal data-structure for storing the characters of an editor
>buffer? I'm thinking a doubly-linked list. Also, isn't there something
>in the GOF book about this very thing? (Don't own the book, but
>flipped through it at Barnes & Noble and got the impression that there
>are probably better books for Lisp programmers to be reading, but I
>really don't know.)
>
>Christopher
There is no single ideal data structure. It really depends what the
requirements are. To make a high-quality editor which handles very
large files (multi-megabyte) and keeps good performance, can be very
difficult and challenging (I've written a few, with middling success).
If your goals are more modest, things get simpler. In all cases, I
would say keeping text as raw bytes in a buffer will be better than
using a linked list of characters.

One of the most popular (because it gets used so much) and unpopular
(because of its limitations) editors in the world is the built-in Edit
control in Windows 95/98. It is limited to 32k of text, and obviously
this size of file could be easily kept and manipulated as a single
string of text.

If you want to handle larger files, a good upper limit might be 4 or 8
megs. Not that you necessarily need an upper limit, but it almost
always helps to establish practical limits to determine architecture
and data structure requirements. Text editors in the past
traditionally rolled their own virtual memory management scheme
(swapping text between a file and memory as needed) but I would think
on any modern Unix system or an OS such as Windows which has good
virtual memory support it probably makes sense to just use memory and
rely on the OS to do the swapping. If you can handle performance
degrading linearly with file size (as most text files tend to be on
the small end of the range, you just have to suffer when you need to
edit a particularly large file) then you can probably manipulate the
text using simple operations.

A few techniques, such as others have mentioned in this thread, will
make it much more efficient of course. One thing, be very careful of
operations that require exponential time (avoid them, of course) like
inserting text one character at a time, when each character requires
all the following characters to be moved. Develop efficient functions
for cutting and pasting arbitrary regions of text from/to the buffer.
This is where a raw text buffer will be much more efficient than
linked lists of text, and leverage these operations as much as
possible when adding editor functionality.

Roger Corman

From: Tim Bradshaw
Subject: Re: Good data-structure for characters in editor buffer?
Date: Fri, 25 Feb 2000 00:00:00 +0000
Message-ID: <ey3bt55ro85.fsf@tfeb.org>

* Roger Corman wrote:
> If you want to handle larger files, a good upper limit might be 4 or 8
> megs. Not that you necessarily need an upper limit, but it almost
> always helps to establish practical limits to determine architecture
> and data structure requirements. 

I'd find a hard upper limit pretty unpalatable -- especially one that
small.  I guess many people wouldn't, but I find it very convenient
that you can edit really large files if you need to.  Emacs used to be
a pain because it had some fairly draconian limit (16M? 8M?) on buffer
size, but it now seems to be quite large (128M and heading for 1G by
the look of it).

--tim

From: Erik Naggum
Subject: Re: Good data-structure for characters in editor buffer?
Date: Fri, 25 Feb 2000 00:00:00 +0000
Message-ID: <3160459010404972@naggum.no>

* ······@2xtreme.net (Christopher R. Barry)
| What's an ideal data-structure for storing the characters of an editor
| buffer?

  a string or, more probably, a region of foreign-allocated memory.  you
  will have to engage in your own memory management stuff with such editor
  buffers no matter what you do, as the operations you want to perform on
  one are highly specialized and you would waste a lot of memory and CPU
  time on more general data structures.

  however, that said, you would want to think hard about how to store the
  results of the operations on the buffer efficiently.  it is not the text
  that is the problem.

#:Erik

From: ·······@labri.u-bordeaux.fr
Subject: Re: Good data-structure for characters in editor buffer?
Date: Fri, 25 Feb 2000 00:00:00 +0000
Message-ID: <6wd7plo9tq.fsf@napperon.labri.u-bordeaux.fr>

······@2xtreme.net (Christopher R. Barry) writes:

> What's an ideal data-structure for storing the characters of an editor
> buffer? I'm thinking a doubly-linked list.

I don't know what the ideal structure is, but a doubly linked list of
characters would be very wasteful in terms of memory use.

IIRC, Multics Emacs used a (doubly?) linked list of _lines_, where
each line was a string.  Multics had special hardware to operate on
these strings, but you could probably do it just as well in software
nowadays.  The problem with this representation is that performance is
severely degraded if you have long lines.  Moving by lines is very
fast instead.

GNU Emacs uses (at least it did at some point) a big buffer of
characters usually larger than the space needed for the text.  The
text in the buffer is divided into two parts.  The first part starts
at the beginning of the buffer and the second part ends at the end of
the buffer.  The result is a _hole_ that is moved whenever an
insertion or a deletion operation is called for, so that such an
operation always operates on the character just before or just after
the hole.  The idea is that insert and delete operations are usually
not randomly scattered, but clustered.  An operation is likely
followed by several others not too far away.  Moving by lines requires
scanning the contents of the buffer to find newline characters. 
-- 
Robert Strandh

---------------------------------------------------------------------
Greenspun's Tenth Rule of Programming: any sufficiently complicated C
or Fortran program contains an ad hoc informally-specified bug-ridden
slow implementation of half of Common Lisp.
---------------------------------------------------------------------

From: Gareth Rees
Subject: Re: Good data-structure for characters in editor buffer?
Date: Fri, 25 Feb 2000 00:00:00 +0000
Message-ID: <uln49h7yw.fsf@pobox.com>

Christopher R. Barry <······@2xtreme.net> wrote:
> What's an ideal data-structure for storing the characters of an editor
> buffer?

Line-oriented editors often have two linked lists of lines: one
containing lines before the current line (in reverse order), one
containing lines following the current line.  This is a good approach
for stream editors, or for editing files larger than memory.  It works
poorly when there are very long lines or when operations are not
line-oriented.

Robert Standh (in <··············@napperon.labri.u-bordeaux.fr>)
describes an approach using a buffer of characters with a hole at the
current editing position which is suitable when operations are largely
character-oriented and tend to cluster -- if you have a mixture of
character-oriented and line-oriented operation you might adopt a hybrid
representation consisting of a character buffer together with linked
lists of lines (actually pointers into the character buffer).

There's a book on the subject:

   C. A. Finseth 
   The craft of text editing
   Springer-Verlag, 1991
   ISBN: 3540976167

-- 
Gareth Rees

From: nonya
Subject: Re: Good data-structure for characters in editor buffer?
Date: Fri, 25 Feb 2000 00:00:00 +0000
Message-ID: <mHvt4.5553$W92.126035@typhoon.ne.mediaone.net>

> There's a book on the subject:
>
>    C. A. Finseth
>    The craft of text editing
>    Springer-Verlag, 1991
>    ISBN: 3540976167

You can find this book online at:
http://www.finseth.com/~fin/craft/index.html See chapter 6 for information
on the buffer gap data structure.

-Nonya

From: Robert Monfera
Subject: Re: Good data-structure for characters in editor buffer?
Date: Fri, 25 Feb 2000 00:00:00 +0000
Message-ID: <38B68A5A.5F8496B7@fisec.com>

Gareth Rees wrote:

> Line-oriented editors often have two linked lists of lines: one
> containing lines before the current line (in reverse order), one
> containing lines following the current line.

I've heard of something very similar, called a gap editor, which
introduced the gap not at the line level but at the character level, so
all characters before the cursor were in one pack and all after the
cursor in another.  Maybe you are also thinking of this one.

Robert

From: Gareth Rees
Subject: Re: Good data-structure for characters in editor buffer?
Date: Fri, 25 Feb 2000 00:00:00 +0000
Message-ID: <uhfexgyqw.fsf@pobox.com>

Gareth Rees wrote:
> Line-oriented editors often have two linked lists of lines: one
> containing lines before the current line (in reverse order), one
> containing lines following the current line.

Robert Monfera wrote:
> I've heard of something very similar, called a gap editor, which
> introduced the gap not at the line level but at the character level, so
> all characters before the cursor were in one pack and all after the
> cursor in another.  Maybe you are also thinking of this one.

"Zed", a stream editor than ran on Phoenix (an IBM 3070 mainframe at
Cambridge UK), used the "pair of linked lists of lines" technique.  When
the "lines before the current line" ran out of space, lines could be
written out to the destination and removed from memory; when the "lines
after the current line" became empty, more lines could be read into
memory from the source.  But as long as you stayed in a section of the
file that fit into memory you could move back and forth in the file as
you chose.

-- 
Gareth Rees

From: Barry Margolin
Subject: Re: Good data-structure for characters in editor buffer?
Date: Fri, 25 Feb 2000 00:00:00 +0000
Message-ID: <%TAt4.64$I31.1734@burlma1-snr2>

In article <·············@pobox.com>,
Gareth Rees  <···········@pobox.com> wrote:
>Gareth Rees wrote:
>> Line-oriented editors often have two linked lists of lines: one
>> containing lines before the current line (in reverse order), one
>> containing lines following the current line.
>
>Robert Monfera wrote:
>> I've heard of something very similar, called a gap editor, which
>> introduced the gap not at the line level but at the character level, so
>> all characters before the cursor were in one pack and all after the
>> cursor in another.  Maybe you are also thinking of this one.
>
>"Zed", a stream editor than ran on Phoenix (an IBM 3070 mainframe at
>Cambridge UK), used the "pair of linked lists of lines" technique.  When
>the "lines before the current line" ran out of space, lines could be
>written out to the destination and removed from memory; when the "lines
>after the current line" became empty, more lines could be read into
>memory from the source.  But as long as you stayed in a section of the
>file that fit into memory you could move back and forth in the file as
>you chose.

This design is a remnant of the days when memory was not plentiful and many
systems didn't have virtual memory.  Nowadays, a single doubly-linked list
or buffer-with-gap is almost always OK, and virtual memory will handle most
of the details.

-- 
Barry Margolin, ······@bbnplanet.com
GTE Internetworking, Powered by BBN, Burlington, MA
*** DON'T SEND TECHNICAL QUESTIONS DIRECTLY TO ME, post them to newsgroups.
Please DON'T copy followups to me -- I'll assume it wasn't posted to the group.

From: Christopher C Stacy
Subject: Re: Good data-structure for characters in editor buffer?
Date: Sat, 26 Feb 2000 00:00:00 +0000
Message-ID: <x8ld7pjmut9.fsf@world.std.com>

>>>>> On Fri, 25 Feb 2000 08:57:46 -0500, Robert Monfera ("Robert") writes:
 Robert> I've heard of something very similar, called a gap editor, which
 Robert> introduced the gap not at the line level but at the character level

EMACS

From: Paolo Amoroso
Subject: Re: Good data-structure for characters in editor buffer?
Date: Tue, 29 Feb 2000 00:00:00 +0000
Message-ID: <pSK8OFyL7=u0JvZ3FLCzXU4NzFe1@4ax.com>

On Fri, 25 Feb 2000 06:24:01 GMT, ······@2xtreme.net (Christopher R. Barry)
wrote:

> What's an ideal data-structure for storing the characters of an editor
> buffer? I'm thinking a doubly-linked list. Also, isn't there something

If you can get your hands on "Project Oberon" by Wirth, it may be worth
checking the chapter on text management. The internal representation of
text in Oberon was based on "piece chains", which were used by the Bravo
text editor developed at Xerox PARC. I don't know whether the data
structures illustrated in the book are ideal, but they are a useful source
of ideas.

Paolo
-- 
EncyCMUCLopedia * Extensive collection of CMU Common Lisp documentation
http://cvs2.cons.org:8000/cmucl/doc/EncyCMUCLopedia/