From: Ron Garret
Subject: To diff or not to diff
Date: 
Message-ID: <rNOSPAMon-92BE47.15051714082004@nntp1.jpl.nasa.gov>
I am starting a multi-developer project and have come to the conclusion 
that none of the existing revision control systems meet my needs, so 
I've decided to write my own (in Lisp, of course).  There's a major 
design decision that I'm having trouble making: should I store revision 
chains as full files, or as diffs?  Storing original files uses more 
storage but makes the coding simpler.  Storing diffs makes the coding 
somewhat more complicated, but feels more elegant and less wasteful.  
Still, disk space is cheap, and redundancy is not necessarily a bad 
thing.  I thought I'd ask if anyone here had any experience with this 
sort of thing and had an opinion one way or the other.

Thanks,
rg

From: Dave Watson
Subject: Re: To diff or not to diff
Date: 
Message-ID: <8d2fc94f.0408141747.5e23ca06@posting.google.com>
> design decision that I'm having trouble making: should I store revision 
> chains as full files, or as diffs?  Storing original files uses more 

Depends how much time you have.  It seems diffs are the ideal answer. 
 A better question to ask is if you want the archive itself to be made
up of a set of changes (arch), or a set of files with history (cvs).
From: Bruce Stephens
Subject: Re: To diff or not to diff
Date: 
Message-ID: <87ekm9b1m6.fsf@cenderis.demon.co.uk>
·········@docwatson.org (Dave Watson) writes:

>> design decision that I'm having trouble making: should I store revision 
>> chains as full files, or as diffs?  Storing original files uses more 
>
> Depends how much time you have.  It seems diffs are the ideal
> answer.

Storing diffs to save space is an implementation detail.

>  A better question to ask is if you want the archive itself to be
> made up of a set of changes (arch), or a set of files with history
> (cvs).

I don't think that's quite the best way to describe the choices.  I
think a better division would be between systems that are about
storing and recreating directory trees (like subversion), and systems
that care about recording changes (like darcs).  None of the new
systems seem to be interested in dealing with individual files as CVS
did; I doubt anyone would want to create such a thing nowadays:
handling atomic multifile changes seems like such an obviously useful
thing to be able to do.
From: Matthew Danish
Subject: Re: To diff or not to diff
Date: 
Message-ID: <20040815002327.GQ15746@mapcar.org>
On Sat, Aug 14, 2004 at 03:05:17PM -0700, Ron Garret wrote:
> I am starting a multi-developer project and have come to the conclusion 
> that none of the existing revision control systems meet my needs, so 
> I've decided to write my own (in Lisp, of course).  There's a major 
> design decision that I'm having trouble making: should I store revision 
> chains as full files, or as diffs?  Storing original files uses more 
> storage but makes the coding simpler.  Storing diffs makes the coding 
> somewhat more complicated, but feels more elegant and less wasteful.  
> Still, disk space is cheap, and redundancy is not necessarily a bad 
> thing.  I thought I'd ask if anyone here had any experience with this 
> sort of thing and had an opinion one way or the other.

I don't know what revision control systems you looked at, but the issues
go a lot further than this.  I hope you have a lot of free time.

CVS and Subversion are based around the notion of versioned files or
trees of files.  They also use a central repository and require the
check-out of `working copies'.  On the other hand, Darcs and Arch are
centered around the idea of a `changeset' ie. the set of patches which
compose a particular change to the source.  A repository is just a
collection of changesets.  Also, these two don't require a `central'
repository; when you wish to work on the source you `get' your own
personal repository which you can use as your own with all the features
of the version control system.  Changesets that you record into your
repository can be sent to (or received from) other repositories.  It's a
more distributed way of working.  The darcs manual describes the theory
behind it fairly thoroughly:
http://abridgegame.org/darcs/manual/node8.html

There are lots of other systems out there, too.  Personally, I've
started to use darcs a lot, because it's really easy to get started and
use, and it seems to have big advantages for projects with developers in
disparate locations.

-- 
;;;; Matthew Danish -- user: mrd domain: cmu.edu
;;;; OpenPGP public key: C24B6010 on keyring.debian.org
From: Frank Buss
Subject: Re: To diff or not to diff
Date: 
Message-ID: <cfm5us$81q$1@newsreader2.netcologne.de>
Ron Garret <·········@flownet.com> wrote:

> I am starting a multi-developer project and have come to the conclusion 
> that none of the existing revision control systems meet my needs, so 
> I've decided to write my own (in Lisp, of course).

why doesn't CVS meet your needs? It's not perfect, but used in many 
projects and it is stable.

> There's a major 
> design decision that I'm having trouble making: should I store revision 
> chains as full files, or as diffs?  Storing original files uses more 
> storage but makes the coding simpler.  Storing diffs makes the coding 
> somewhat more complicated, but feels more elegant and less wasteful.  

I think most of the time you are accessing the latest version, so saving 
the last version and diffs to previous versions, like CVS does, looks like 
a good idea.

-- 
Frank Bu�, ··@frank-buss.de
http://www.frank-buss.de, http://www.it4-systems.de
From: Hannah Schroeter
Subject: Re: To diff or not to diff
Date: 
Message-ID: <cfm7up$v6u$1@c3po.use.schlund.de>
Hello!

Frank Buss  <··@frank-buss.de> wrote:
>Ron Garret <·········@flownet.com> wrote:

>> I am starting a multi-developer project and have come to the conclusion 
>> that none of the existing revision control systems meet my needs, so 
>> I've decided to write my own (in Lisp, of course).

>why doesn't CVS meet your needs? It's not perfect, but used in many 
>projects and it is stable.

CVS has many downsides. But even if someone's doing their own, they
should still first check out existing systems, so they can learn from
what others have done.

Things like Subversion and GNU arch (tla) are definitely worth a look,
and there're other systems, in part also as a kind of extensions of CVS
out there.

>[...]

Kind regards,

Hannah.
From: Ron Garret
Subject: Re: To diff or not to diff
Date: 
Message-ID: <rNOSPAMon-9996C5.10012015082004@nntp1.jpl.nasa.gov>
In article <············@newsreader2.netcologne.de>,
 Frank Buss <··@frank-buss.de> wrote:

> why doesn't CVS meet your needs?

Mainly because, as far as I can tell, its branch model is broken or, at 
the very least, hopelessly confusing.  I don't understand it, and I've 
watched people who do understand it struggle mightily to get it to do 
the Right Thing and fail.  Everyone I know who gets into really serious 
development eventually abandons CVS.

I had not heard of subversion before.  It looks promising, but it too 
(like most version control systems I've looked at) is missing what I 
consider to be a crucial feature: the ability to create a private 
"checkpoint" of changes you've made to a working copy without making 
those changes visible to others.

rg
From: Steven E. Harris
Subject: Re: To diff or not to diff
Date: 
Message-ID: <83n00w6jv1.fsf@torus.sehlabs.com>
Ron Garret <·········@flownet.com> writes:

> It looks promising, but it too (like most version control systems
> I've looked at) is missing what I consider to be a crucial feature:
> the ability to create a private "checkpoint" of changes you've made
> to a working copy without making those changes visible to others.

BitKeeper� supports that concept. You can commit "deltas" to individual
files, but none of those deltas can even be seen by another repository
until you bundle them up into a "changeset".� Even then, propagating
changesets from your repository is optional.


Footnotes: 
� http://www.bitkeeper.com/
� http://www.bitkeeper.com/UG/BitKeeper.ChangeSets.html

-- 
Steven E. Harris
From: Ron Garret
Subject: Re: To diff or not to diff
Date: 
Message-ID: <rNOSPAMon-74F7AF.11070216082004@nntp1.jpl.nasa.gov>
In article <··············@torus.sehlabs.com>,
 "Steven E. Harris" <···@torus.sehlabs.com> wrote:

> Ron Garret <·········@flownet.com> writes:
> 
> > It looks promising, but it too (like most version control systems
> > I've looked at) is missing what I consider to be a crucial feature:
> > the ability to create a private "checkpoint" of changes you've made
> > to a working copy without making those changes visible to others.
> 
> BitKeeper� supports that concept.

Yes, but I'm cheap and Bitkeeper isn't.  :-)

rg
From: Steven E. Harris
Subject: Re: To diff or not to diff
Date: 
Message-ID: <jk4fz6mon6m.fsf@W003275.na.alarismed.com>
Ron Garret <·········@flownet.com> writes:

> Yes, but I'm cheap and Bitkeeper isn't.  :-)

I agree it's not cheap, unless you're more or less working alone and
plan on using less than 1000 files per repository. At least those were
the licensing terms for free, non-openlogging use the last time I
looked. Oh, and you can't be developing a competing system.

I aspire to either convince an employer to pay for BitKeeper, or have
a good enough reason and means to pay for it myself. It's that good.

-- 
Steven E. Harris
From: Ron Garret
Subject: Re: To diff or not to diff
Date: 
Message-ID: <rNOSPAMon-489999.13351116082004@nntp1.jpl.nasa.gov>
In article <···············@W003275.na.alarismed.com>,
 "Steven E. Harris" <···@panix.com> wrote:

> Ron Garret <·········@flownet.com> writes:
> 
> > Yes, but I'm cheap and Bitkeeper isn't.  :-)
> 
> I agree it's not cheap, unless you're more or less working alone and
> plan on using less than 1000 files per repository. At least those were
> the licensing terms for free, non-openlogging use the last time I
> looked. Oh, and you can't be developing a competing system.
> 
> I aspire to either convince an employer to pay for BitKeeper, or have
> a good enough reason and means to pay for it myself. It's that good.

That's quite an endorsement.  What makes it so much better than 
everything else?

rg
From: Steven E. Harris
Subject: Re: To diff or not to diff
Date: 
Message-ID: <jk48ycemwzu.fsf@W003275.na.alarismed.com>
Ron Garret <·········@flownet.com> writes:

> That's quite an endorsement.  What makes it so much better than
> everything else?

BitKeeper formalizes and enables a useful working model that many
other systems either emulate poorly or just barely permit. At my last
job, we struggled for months, no, years, trying to make ClearCase work
a little more like BitKeeper.

BitKeeper is simple to set up and is immediately useful on a lone
computer. It is designed to function mostly disconnected from any
peers. And even then, it's not just queuing up would-be "real"
changes. Every copy of a repository is a wholly valid, potentially
independent repository unto itself. Consequently, there's less chance
of losing the Centralized Big Pile of Stuff. Losing access to some
shared upstream repository won't leave an army of developers with
nothing to do for the day.

It tracks a complete view of one's files: directories, files, and
permissions are all versioned. At first I found its lock-to-edit model
annoying, but came to appreciate how it allows me to monitor and
comment on which files I'm deliberately working on. Since two people
would rarely share a given repository, the locking is less about
keeping other people out as distinguishing between intentional and
accidental changes.

Its dual delta-changeset model is useful. Small, granular checkins
("deltas") are encouraged. This allows one to comment on progress
frequently. These checkins are private, volatile even, until batched
up in a multi-file changeset. The changeset itself carries a comment
to explain the batched deltas. As only changesets can be propagated
among repositories, there is less perceived risk in receiving a
partial or inconsistent update.

The repository cloning and flexible inter-repository topology allows
for almost any kind of workflow. BitKeeper has some primitive
facilities to control workflow (repository levels), but one could
argue that enforcing that workflow belongs at a higher level. Branches
need not be first-class, formal entities; any user can clone any
repository and effectively create a new "branch"; whether and how that
work gets integrated into other repositories is up to the cooperating
team. Hence, BitKeeper enables safe, private experimentation, without
a developer worrying about corrupting his one good sandbox or having
others even know of his activity.

Finally, BitKeeper is fast. Interaction with other, possibly-remote
repositories is the exception in normal workflow; most operations need
only touch the local repository. The GUI tools are adequate but not
great. I prefer to ignore them. The real beauty is there not only in
the command-line tools, but in the underlying model of the system.

There is also the hosting angle, with BitKeeper offering various ways
to host repositories, but I haven't used those facilities enough to
comment on their usefulness.

-- 
Steven E. Harris
From: David Steuber
Subject: Re: To diff or not to diff
Date: 
Message-ID: <87u0v12hy6.fsf@david-steuber.com>
Ron Garret <·········@flownet.com> writes:

> In article <···············@W003275.na.alarismed.com>,
>  "Steven E. Harris" <···@panix.com> wrote:
> 
> > Ron Garret <·········@flownet.com> writes:
> > 
> > > Yes, but I'm cheap and Bitkeeper isn't.  :-)
> > 
> > I agree it's not cheap, unless you're more or less working alone and
> > plan on using less than 1000 files per repository. At least those were
> > the licensing terms for free, non-openlogging use the last time I
> > looked. Oh, and you can't be developing a competing system.
> > 
> > I aspire to either convince an employer to pay for BitKeeper, or have
> > a good enough reason and means to pay for it myself. It's that good.
> 
> That's quite an endorsement.  What makes it so much better than 
> everything else?

Linus Torvalds uses it ;-)

The real question is which do you have more of, time or money?  I
guess the same argument applies to Free Lisps vs commercial Lisps.

-- 
An ideal world is left as an excercise to the reader.
   --- Paul Graham, On Lisp 8.1
From: Johannes Groedem
Subject: Re: To diff or not to diff
Date: 
Message-ID: <87657jbkqw.fsf@ifi.uio.no>
* Ron Garret <·········@flownet.com>:

> I had not heard of subversion before.  It looks promising, but it too 
> (like most version control systems I've looked at) is missing what I 
> consider to be a crucial feature: the ability to create a private 
> "checkpoint" of changes you've made to a working copy without making 
> those changes visible to others.

You can do this with Arch.  (That is, you create private branches, for
example locally on your development host, which you can then commit to
as much as you like, and then later commit it to your central
repository.)

http://gnuarch.org/

-- 
Johannes Groedem <OpenPGP: 5055654C>
From: Ron Garret
Subject: Re: To diff or not to diff
Date: 
Message-ID: <rNOSPAMon-3EC358.11053416082004@nntp1.jpl.nasa.gov>
In article <··············@ifi.uio.no>,
 Johannes Groedem <······@ifi.uio.no> wrote:

> * Ron Garret <·········@flownet.com>:
> 
> > I had not heard of subversion before.  It looks promising, but it too 
> > (like most version control systems I've looked at) is missing what I 
> > consider to be a crucial feature: the ability to create a private 
> > "checkpoint" of changes you've made to a working copy without making 
> > those changes visible to others.
> 
> You can do this with Arch.  (That is, you create private branches, for
> example locally on your development host, which you can then commit to
> as much as you like, and then later commit it to your central
> repository.)
> 
> http://gnuarch.org/

Yes, I considered that.  The problem is that makes it more cumbersome to 
get revisions that other people are checking in to the parent branch.  
But maybe the right thing to do is to start with arch or subversion or 
some such thing and layer on some scripts that do what I need.

One of the reasons I decided to try to roll my own is that I 
specifically don't want much of the power and flexibility (and, more to 
the point, complexity) that comes with most of the available rcs 
systems.  It sometimes feels like a comparable amount of work to slog 
through the documentation of many of these systems as it is for me to 
just write what I need from scratch.   I mean, once you have diff3 -m 
everything else is just a little bit of glue code, right?  ;-)

BTW, I've already written the equivalent of RCS (which I know is the 
easy part).  It's <180 lines of CLisp code, and took me about three 
hours to write.  http://www.flownet.com/ron/srcs in case anyone is 
interested.

Thanks for all the responses.

rg
From: Bruce Stephens
Subject: Re: To diff or not to diff
Date: 
Message-ID: <87oelaxamo.fsf@cenderis.demon.co.uk>
Ron Garret <·········@flownet.com> writes:

[...]

> Yes, I considered that.  The problem is that makes it more
> cumbersome to get revisions that other people are checking in to the
> parent branch.  But maybe the right thing to do is to start with
> arch or subversion or some such thing and layer on some scripts that
> do what I need.

Possibly, although I can see a definite attraction in having the whole
system in some decent language.  You might also find the problems of
mismatching that Josh McDowell encountered with PRCS (which was based
on RCS): RCS provided too much functionality, and it got in the way of
some things he wanted to do.

[...]

> BTW, I've already written the equivalent of RCS (which I know is the
> easy part).  It's <180 lines of CLisp code, and took me about three
> hours to write.  http://www.flownet.com/ron/srcs in case anyone is
> interested.

Which is cool, but not necessarily the right way to begin.  Some
systems have a per-file history system underneath (BitKeeper has an
SCCS-like system, for example), but not all do.  I think it's worth
looking at some of the existing systems, if only to make sure you know
about the variety of design choices (or those that are being explored
currently, anyway).
From: Stefan Scholl
Subject: Re: To diff or not to diff
Date: 
Message-ID: <1n7ikplloytzx$.dlg@parsec.no-spoon.de>
On 2004-08-15 19:01:20, Ron Garret wrote:
> In article <············@newsreader2.netcologne.de>,
>  Frank Buss <··@frank-buss.de> wrote:
>> why doesn't CVS meet your needs?
> 
> Mainly because, as far as I can tell, its branch model is broken or, at 
> the very least, hopelessly confusing.

In Subversion it's just a copy. Very easy to understand.


> I had not heard of subversion before.  It looks promising, but it too 
> (like most version control systems I've looked at) is missing what I 
> consider to be a crucial feature: the ability to create a private 
> "checkpoint" of changes you've made to a working copy without making 
> those changes visible to others.

You can kind of mirror a project in a private repository and then
commit all the changes at once to the main repository. I haven't
tried it but I think http://svk.elixus.org/ is what you are looking
for.
From: David Golden
Subject: Re: To diff or not to diff
Date: 
Message-ID: <JqxTc.25003$Z14.7772@news.indigo.ie>
Ron Garret wrote:

> 
> I am starting a multi-developer project and have come to the conclusion
> that none of the existing revision control systems meet my needs, so
> I've decided to write my own (in Lisp, of course). 

I do hope it's not a time-constrained project... revision control systems
can get quite complex quite quickly.  Though lisp should be a good
choice for implementing one...

In the meantime, just in case: have you looked at OpenCM? It's often missed
from people's lists because they use the term "configuration management"
rather that (version|revision) control.

It was written when the guys developing EROS ("Extremely Reliable
Operating System") got pissed off at existing systems... and then,
EROS development slowed to a crawl, and OpenCM appeared...

The web site hasn't changed since 2003, but the mailing lists still look
active, so I guess it hasn't been abandoned - there's a bit of a story
about c++ causing enough grief to cause a 3 month development hiatus:
http://srl.cs.jhu.edu/~shap/complexity++.html
"In abstract, I'ld love to switch this code base over t oJava or C#.
God knows we've rewritten it enough times that we should be able to do
so with our eyes more or less closed. At the end of the day, though,
Java remains a language without any conforming, open reference
implementation or a publicly available, conforming class library."

Maybe someone should write to him and suggest common lisp as 
an implementation language? :-)


> There's a major 
> design decision that I'm having trouble making: should I store revision
> chains as full files, or as diffs?

Storage space is now extremely cheap, so I'd consider full files. 
This is also because lisp is IMHO less diff-friendly
than many languages (unless you use a sexp-aware diff tool, which I 
believe did exist at one stage?)
From: Bruce Stephens
Subject: Re: To diff or not to diff
Date: 
Message-ID: <87zn4xw8be.fsf@cenderis.demon.co.uk>
David Golden <············@oceanfree.net> writes:

[...]

> In the meantime, just in case: have you looked at OpenCM? It's often missed
> from people's lists because they use the term "configuration management"
> rather that (version|revision) control.
>
> It was written when the guys developing EROS ("Extremely Reliable
> Operating System") got pissed off at existing systems... and then,
> EROS development slowed to a crawl, and OpenCM appeared...
>
> The web site hasn't changed since 2003, but the mailing lists still look
> active,

As far as I know, there's only one mailing list, and it seems pretty
low-activity to me.  (Last message 18 July.)  Not dead, but certainly
not lively.

[...]

> Maybe someone should write to him and suggest common lisp as an
> implementation language? :-)

I think he knows about common lisp.

>> There's a major 
>> design decision that I'm having trouble making: should I store revision
>> chains as full files, or as diffs?
>
> Storage space is now extremely cheap, so I'd consider full files.
> This is also because lisp is IMHO less diff-friendly than many
> languages (unless you use a sexp-aware diff tool, which I believe
> did exist at one stage?)

Traditional line-based diff tools might work less well for lisp, but
there are other possibilities now.  Version control tools also use
diffs for merging changes and things, but storing full files doesn't
help with that---using sexp-aware tools might.  subversion, monotone
and OpenCM (and possibly others) use binary change formats, and can
store binary files with revisions efficiently (provided the versions
of the binary files are similar).

I must admit, a year ago it looked to me like building a new free
version control system would be a good thing to do, because there just
weren't that many (CVS, mcvs, Aegis).  Nowadays, though, there are
lots.  Good ones to look at include subversion (a superior replacement
for CVS); GNU Arch (a distributed system, with good support for
merging distributed branches); darcs (a distributed system based
strongly on patches); monotone (a distributed system where lots of
things are stored as certificates).  It's also worth looking at
OpenCM, since the design looked interesting, even if the
implementation seems to have stalled; also Aegis, which is mature and
presumably stable, even if it lacks popularity for whatever reason.
Stellation's also interesting, but mostly for people interested in
Java (currently, anyway).

Subversion's probably the safe one to go for.  I think monotone is the
most promising, but there's enough of a variety that there's something
interesting in each of them.
From: David Golden
Subject: Re: To diff or not to diff
Date: 
Message-ID: <2zyTc.25007$Z14.7596@news.indigo.ie>
Bruce Stephens wrote:

>> Maybe someone should write to him and suggest common lisp as an
>> implementation language? :-)
> 
> I think he knows about common lisp.

I would think so too, of course, it would seem rather unlikely that
he [Shapiro] wouldn't.


The smiley was for resurgent CL hypemongering that has IMHO been evident
recently.  I realise that ":-)" doesn't exactly  express that very well
without a bit more context.
From: Pascal Bourguignon
Subject: Re: To diff or not to diff
Date: 
Message-ID: <871xi9cmjt.fsf@thalassa.informatimago.com>
Ron Garret <·········@flownet.com> writes:

> I am starting a multi-developer project and have come to the conclusion 
> that none of the existing revision control systems meet my needs, so 
> I've decided to write my own (in Lisp, of course).  


mcvs aka meta-cvs is written in common-lisp.  
It's based on cvs, but you won't notice it.
Did you evaluate it?
http://freshmeat.net/projects/mcvs/
In anycase, it could be a base for your own branch.


> There's a major 
> design decision that I'm having trouble making: should I store revision 
> chains as full files, or as diffs?  Storing original files uses more 
> storage but makes the coding simpler.  Storing diffs makes the coding 
> somewhat more complicated, but feels more elegant and less wasteful.  
> Still, disk space is cheap, and redundancy is not necessarily a bad 
> thing.  I thought I'd ask if anyone here had any experience with this 
> sort of thing and had an opinion one way or the other.

Usually, the last version is kept as plain file, and the previous
versions are stored as diff (in reverse chronological order).

For a lot of files often the difference between two revisions are *small*.
So it's much worthwhile to store diffs.

On the other hand, if have to store binary files, diff does not work
as well and we have to store the whole file for each revision anyway.

I'd rather have the diff stored. For example, I've got a couple of
library code totaling  3 453 029 bytes of sources, whose cvs
repository is only 4 025 344 bytes while there's up to 20 revisions.
It'd be much harder to backup if it was stored as a 60 MB repository,
and I'd have to buy one more hard disk.


-- 
__Pascal Bourguignon__                     http://www.informatimago.com/

Our enemies are innovative and resourceful, and so are we. They never
stop thinking about new ways to harm our country and our people, and
neither do we.
From: Ron Garret
Subject: Re: To diff or not to diff
Date: 
Message-ID: <rNOSPAMon-C0A346.10065815082004@nntp1.jpl.nasa.gov>
In article <··············@thalassa.informatimago.com>,
 Pascal Bourguignon <····@mouse-potato.com> wrote:

> Ron Garret <·········@flownet.com> writes:
> 
> > I am starting a multi-developer project and have come to the conclusion 
> > that none of the existing revision control systems meet my needs, so 
> > I've decided to write my own (in Lisp, of course).  
> 
> 
> mcvs aka meta-cvs is written in common-lisp.  
> It's based on cvs, but you won't notice it.
> Did you evaluate it?

Nope, this is the first I've heard of it.  Thanks for the pointer.  It 
looks promising.  I love the tagline.

rg
From: Gareth McCaughan
Subject: Re: To diff or not to diff
Date: 
Message-ID: <87u0v4lpu4.fsf@g.mccaughan.ntlworld.com>
Ron Garret <·········@flownet.com> writes:

> I am starting a multi-developer project and have come to the conclusion 
> that none of the existing revision control systems meet my needs, so 
> I've decided to write my own (in Lisp, of course).  There's a major 
> design decision that I'm having trouble making: should I store revision 
> chains as full files, or as diffs?  Storing original files uses more 
> storage but makes the coding simpler.  Storing diffs makes the coding 
> somewhat more complicated, but feels more elegant and less wasteful.  
> Still, disk space is cheap, and redundancy is not necessarily a bad 
> thing.  I thought I'd ask if anyone here had any experience with this 
> sort of thing and had an opinion one way or the other.

Disc space is cheap but network bandwidth isn't. You're
going to need diffs somewhere in the system. It seems
unlikely that the extra cost of storing data diffily
is going to be that high.

But if you really think building your own revision control
system from scratch is the most efficient way to proceed,
I reckon there's about a 90% chance that you're nuts.[1]
Would you like to say briefly what's wrong with the leading
contenders that are already out there?

    [1] If I didn't know that you're a very clever chap
        and therefore less likely to be spectacularly
        wrong about this than most people, and if you
        were proposing to write the system in something
        like C++ or Java, then I'd say 99% instead.

-- 
Gareth McCaughan
.sig under construc
From: ·······@Yahoo.Com
Subject: Re: To diff or not to diff
Date: 
Message-ID: <REM-2004aug15-003@Yahoo.Com>
> From: Gareth McCaughan <················@pobox.com>
> Disc [sic] space is cheap but network bandwidth isn't. You're going
> to need diffs somewhere in the system.

That might be misleading. You don't actually need to maintain any
static diffs as files anywhere. It's sufficient to do a remote-compare
as needed to reconcile versions that have drifted from each other
and/or to verify that two alleged identical base versions of a file
actually are identical. I conceived the algorithm way back when the
fastest modem in ordinary use was 1200 baud (but I had only 300 baud at
the time) and the fastest net backbone was 56k bps. I never found
anybody interested in my algorithm at the time, and as the net and
modems sped up every one said more and more it's fast enough to just
FTP or e-mail-attach one of the two files to the other site and do a
comparison locally there. But lately disks have gotten awfully bigger
(as many gigabytes as I currently have megabytes) and CPUs have gotten
awfully fast, yet the net has gotten clogged with GIFs and JPGs via Web
browsing that there isn't really a lot of bandwidth for any given
process, so maybe my algorithm ought to be reconsidered? The basic
idea is to do a checksum of the whole file, so if they agree then
presumably the files are indeed identical, but do it in such a way that
checksums of the pieces are available already, so pieces can be
compared to see which pieces match and which don't, and then break down
the different pieces into smaller pieces, until we're down to a simple
list of which tiny pieces are different and at that point it's faster
to copy those tiny pieces en masse rather than do any more
sub-dividing. The key trick is using statistics of the data in the
files to generate canonical places to split them, so the splits are
done at matching places at both sites, so the sub-checksums line up, or
using ProxHash to support building a nearest-neighbor network of
unsynchronized sub-checksums from which fuzzy segment-maching can be
done before exact matching.

Note that unlike diff (the unix utility) etc., my proposed
remote-compare would correctly handle any decent sized segments that
are rearranged instead of treating them as unrelated insertions and
deletions. Also because it goes to extra work, effectively full
lookahead instead of one-pass with tiny lookahead window, to make sure
the longest matching segments are eliminated first, it doesn't fall out
of sync like diff does and match some random common thing such as
(defun out of context and then report function #47 of one file totally
different from function #48 of the other file etc. If you've seen those
genome-comparison diagrams, showing how the DNA loci in one genome are
rearranged with respect to similar genes in another genome, you get an
idea of the goal of remote-compare. For example, if some function is
moved around in one copy of the file but not in the other, and also a
change is made to that function (for example it was pulled out of the
master file and put in an interpreted debugging file, and fixed there,
and then it was put back in the master file but not where it came
from). remote-compare should show first that the entire function was
moved, by showing a REFERENCE to where it came from and where it went
to, and second that the two versions of that function are slightly
different, by showing at the target of that reference the local diffs.
Before posting this, I'll concoct a file showing hypothetical sample output:
  http://www.rawbw.com/~rem/gareth1.txt (old draft with my typo fix)
  http://www.rawbw.com/~rem/gareth2.txt (his final draft, missing my fix)
  http://www.rawbw.com/~rem/gareth.diff.txt
  http://www.rawbw.com/~rem/gareth-remote-compare.txt
I did that very crudely because I don't feel like going to all the
trouble of coding the HTML to make drop-down menus for each of the
changes. So I just made a text file. I probably also made one or two
mistakes in manually matching up text.

Would be anybody with a good source of funds be interesting in funding
my further work on (either version of) my algorithm?
From: Rob Warnock
Subject: Re: To diff or not to diff
Date: 
Message-ID: <rLCdnTcwc6izCL3cRVn-iw@speakeasy.net>
·······@Yahoo.Com <··········@YahooGroups.Com> wrote:
+---------------
| > From: Gareth McCaughan <················@pobox.com>
| > Disc [sic] space is cheap but network bandwidth isn't. You're going
| > to need diffs somewhere in the system.
| 
| That might be misleading. You don't actually need to maintain any
| static diffs as files anywhere. It's sufficient to do a remote-compare
| as needed to reconcile versions that have drifted from each other
| and/or to verify that two alleged identical base versions of a file
| actually are identical. I conceived the algorithm way back when the
| fastest modem in ordinary use was 1200 baud (but I had only 300 baud at
| the time) and the fastest net backbone was 56k bps. I never found
| anybody interested in my algorithm at the time...  The basic idea
| is to do a checksum of the whole file, so if they agree then
| presumably the files are indeed identical, but do it in such a way that
| checksums of the pieces are available already, so pieces can be
| compared to see which pieces match and which don't..
+---------------

The "rsync" program [a standard utility on Linux and xxxBSD and most
other Unixes, see <http://samba.anu.edu.au/rsync/features.html>]
does exactly this, although I believe its unit of granularity is
from the first change through te end of the file. Still, it does
a *very* nice job on files which are mostly appended to (such as
programs during development, mailboxes, and archives of saved mail
or netnews), as well as handling large tree of same.


-Rob

-----
Rob Warnock			<····@rpw3.org>
627 26th Avenue			<URL:http://rpw3.org/>
San Mateo, CA 94403		(650)572-2607
From: Bruce Stephens
Subject: Re: To diff or not to diff
Date: 
Message-ID: <87brhbl4vs.fsf@cenderis.demon.co.uk>
··········@YahooGroups.Com (·······@Yahoo.Com) writes:

[...]

> It's sufficient to do a remote-compare as needed to reconcile
> versions that have drifted from each other and/or to verify that two
> alleged identical base versions of a file actually are identical. I
> conceived the algorithm way back when the fastest modem in ordinary
> use was 1200 baud (but I had only 300 baud at the time) and the
> fastest net backbone was 56k bps. I never found anybody interested
> in my algorithm at the time, and as the net and modems sped up every
> one said more and more it's fast enough to just FTP or e-mail-attach
> one of the two files to the other site and do a comparison locally
> there.

This sounds very similar to netsync/rsync/xdelta (webdav also has
something).  There's also merkle trees.

[...]

> Would be anybody with a good source of funds be interesting in
> funding my further work on (either version of) my algorithm?

I think you'll have to show potential improvements with the existing
techniques.
From: Pascal Bourguignon
Subject: Re: To diff or not to diff
Date: 
Message-ID: <87hdr39wai.fsf@thalassa.informatimago.com>
··········@YahooGroups.Com (·······@Yahoo.Com) writes:
> That might be misleading. You don't actually need to maintain any
> static diffs as files anywhere. It's sufficient to do a remote-compare
> as needed to reconcile versions that have drifted from each other
> and/or to verify that two alleged identical base versions of a file
> actually are identical. I conceived the algorithm way back when the
> fastest modem in ordinary use was 1200 baud (but I had only 300 baud at
> the time) and the fastest net backbone was 56k bps. I never found
> anybody interested in my algorithm at the time, and as the net and
> modems sped up every one said more and more it's fast enough to just
> FTP or e-mail-attach one of the two files to the other site and do a
> comparison locally there. But lately disks have gotten awfully bigger
> (as many gigabytes as I currently have megabytes) and CPUs have gotten
> awfully fast, yet the net has gotten clogged with GIFs and JPGs via Web
> browsing that there isn't really a lot of bandwidth for any given
> process, so maybe my algorithm ought to be reconsidered? The basic
> idea is to do a checksum of the whole file, so if they agree then
> presumably the files are indeed identical, but do it in such a way that
> checksums of the pieces are available already, so pieces can be
> compared to see which pieces match and which don't, and then break down
> the different pieces into smaller pieces, until we're down to a simple
> list of which tiny pieces are different and at that point it's faster
> to copy those tiny pieces en masse rather than do any more
> sub-dividing. The key trick is using statistics of the data in the
> files to generate canonical places to split them, so the splits are
> done at matching places at both sites, so the sub-checksums line up, or
> using ProxHash to support building a nearest-neighbor network of
> unsynchronized sub-checksums from which fuzzy segment-maching can be
> done before exact matching.

I doubt that it's the .gif and .jpg of the Web that are clogging the
bandwidth. I'd bet on the .mp3 and the .avi of the P2P...

rsync use the same block check-summing algorihtm.  It may be more
naive at guessing the block limits though.

> Would be anybody with a good source of funds be interesting in funding
> my further work on (either version of) my algorithm?

-- 
__Pascal Bourguignon__                     http://www.informatimago.com/

Our enemies are innovative and resourceful, and so are we. They never
stop thinking about new ways to harm our country and our people, and
neither do we.
From: Jens Axel Søgaard
Subject: Re: To diff or not to diff
Date: 
Message-ID: <41209c80$0$224$edfadb0f@dread11.news.tele.dk>
Pascal Bourguignon wrote:
> I doubt that it's the .gif and .jpg of the Web that are clogging the
> bandwidth. I'd bet on the .mp3 and the .avi of the P2P...
> 
> rsync use the same block check-summing algorihtm.  It may be more
> naive at guessing the block limits though.

Andrew Tridgell tells the story of how he got the idea for rsync
at page 58 of <http://samba.org/~tridge/phd_thesis.pdf>. The
following two pages gives a very clear and readable overview of
the main ideas used in the rsync algorithm.

The page <http://samba.org/~tridge/> is also worth a visit.

-- 
Jens Axel Søgaard
From: Marco Antoniotti
Subject: Re: To diff or not to diff
Date: 
Message-ID: <rV3Uc.34$D5.10472@typhoon.nyu.edu>
Hi Ron (nee Erann)

I always liked the PRCS system and the underlying Xdelta system.  You 
can have a look at them to figure out a possible answer to your dilemmas

Cheers

marco




Ron Garret wrote:

> I am starting a multi-developer project and have come to the conclusion 
> that none of the existing revision control systems meet my needs, so 
> I've decided to write my own (in Lisp, of course).  There's a major 
> design decision that I'm having trouble making: should I store revision 
> chains as full files, or as diffs?  Storing original files uses more 
> storage but makes the coding simpler.  Storing diffs makes the coding 
> somewhat more complicated, but feels more elegant and less wasteful.  
> Still, disk space is cheap, and redundancy is not necessarily a bad 
> thing.  I thought I'd ask if anyone here had any experience with this 
> sort of thing and had an opinion one way or the other.
> 
> Thanks,
> rg