How to do a checksum?

From: Alex Kilpatrick
Subject: How to do a checksum?
Date: Tue, 26 Dec 1995 00:00:00 +0000
Message-ID: <30e0794e.781647921@news.concentric.net>

I have a large file (about 1 meg) composed of large list structures
(about 3K each).  I would like to insert new lists into the file only
if they are not present already.  I'm trying to do this by generating
a 'checksum' of each structure in the file, and then comparing against
this list of checksums before I write the new list.  My problem is
that I don't know a good formula for generating the checksum.  It
doesn't have to be guaranteed to not eliminate unique objects, but it
should be unlikely.  My first attempt was to convert the list to
characters, then add up the ASCII values, but that is a very poor
method because two lists with different order '(a b) vs '(b a) will
have the same checksum.

Any ideas?

Re: How to do a checksum? Michael Greenwald
- Re: How to do a checksum? Alex Kilpatrick
  - Re: How to do a checksum? Michael Greenwald
  - Re: How to do a checksum? Barry Margolin
Re: How to do a checksum? Joe User

From: Michael Greenwald
Subject: Re: How to do a checksum?
Date: Wed, 27 Dec 1995 00:00:00 +0000
Message-ID: <michaelg.820093915@Xenon.Stanford.EDU>

·····@cris.com (Alex Kilpatrick) writes:

>I have a large file (about 1 meg) composed of large list structures
>(about 3K each).  I would like to insert new lists into the file only
>if they are not present already.  I'm trying to do this by generating
>a 'checksum' of each structure in the file, and then comparing against
>this list of checksums before I write the new list.  My problem is
>that I don't know a good formula for generating the checksum.  It
>doesn't have to be guaranteed to not eliminate unique objects, but it
>should be unlikely.  My first attempt was to convert the list to
>characters, then add up the ASCII values, but that is a very poor
>method because two lists with different order '(a b) vs '(b a) will
>have the same checksum.

>Any ideas?  

It depends on what the contents of the list are.  But, SXHASH might be
a good first try.  You might be able to come up with something more
efficient if you use domain specific knowledge and write your own.
Also, writing your own guarantees portability --- SXHASH won't
(necessarily) work if you expect the file to be read my multiple lisp
implementations by different vendors. But from your brief description
it sounds like SXHASH is the easiest way to proceed.

From: Alex Kilpatrick
Subject: Re: How to do a checksum?
Date: Thu, 28 Dec 1995 00:00:00 +0000
Message-ID: <30e22ade.892656900@news.concentric.net>

On 27 Dec 95 19:51:55 GMT, ········@Xenon.Stanford.EDU (Michael
Greenwald) wrote:

>·····@cris.com (Alex Kilpatrick) writes:
> 
>>I have a large file (about 1 meg) composed of large list structures
>>(about 3K each).  I would like to insert new lists into the file only
>>if they are not present already.  I'm trying to do this by generating
>>a 'checksum' of each structure in the file, and then comparing against

>
>It depends on what the contents of the list are.  But, SXHASH might be
>a good first try.  You might be able to come up with something more
>efficient if you use domain specific knowledge and write your own.
>Also, writing your own guarantees portability --- SXHASH won't
>(necessarily) work if you expect the file to be read my multiple lisp
>implementations by different vendors. But from your brief description
>it sounds like SXHASH is the easiest way to proceed.

I'm not worried about portability, and SXHASH seems to work pretty
well. (and it is simple and fast)  However, I'm not sure how likely
sxhash is to generate a "false positive" where two non-EQUAL objects
have the same SXHASH values.  In ACL/Win, sxhash seems to generate a
16-bit number (it is always less than 32767), so it would seem that a
false positive is not all that unlikely.

From: Michael Greenwald
Subject: Re: How to do a checksum?
Date: Thu, 28 Dec 1995 00:00:00 +0000
Message-ID: <michaelg.820133247@Xenon.Stanford.EDU>

·····@cris.com (Alex Kilpatrick) writes:

>I'm not worried about portability, and SXHASH seems to work pretty
>well. (and it is simple and fast)  However, I'm not sure how likely
>sxhash is to generate a "false positive" where two non-EQUAL objects
>have the same SXHASH values.  In ACL/Win, sxhash seems to generate a
>16-bit number (it is always less than 32767), so it would seem that a
>false positive is not all that unlikely.

I assumed you meant to use the checksum as a hint, to avoid the EQUAL
test except in cases where there >might< be a match.  If you want to
reduce that overhead, too, to a lower level, then you can write your
own N-bit version, where the likelihood of a false positive is low
enough to be acceptable.  But there's always a chance of a false
positive, so you'll always have to check.  (The checksum will map a
large domain into a smaller range in a many-to-one mapping.)

From: Barry Margolin
Subject: Re: How to do a checksum?
Date: Fri, 29 Dec 1995 00:00:00 +0000
Message-ID: <4c2416$iao@tools.bbnplanet.com>

In article <··················@news.concentric.net>,
Alex Kilpatrick <·····@cris.com> wrote:
>I'm not worried about portability, and SXHASH seems to work pretty
>well. (and it is simple and fast)  However, I'm not sure how likely
>sxhash is to generate a "false positive" where two non-EQUAL objects
>have the same SXHASH values.

In some CL implementations, SXHASH completely gives up for some data types.
For instance, I think Symbolics Common Lisp always returns 0 for any
defstruct objects or Flavors/CLOS instances.  I think it only behaves
reasonably well for numbers, symbols, arrays, and lists/trees whose leaves
are of those types.
-- 
Barry Margolin
BBN PlaNET Corporation, Cambridge, MA
······@bbnplanet.com
Phone (617) 873-3126 - Fax (617) 873-6351

From: Joe User
Subject: Re: How to do a checksum?
Date: Wed, 03 Jan 1996 00:00:00 +0000
Message-ID: <4cedqa$vm6@merlin.delphi.com>

·····@cris.com (Alex Kilpatrick) wrote:
>... by generating
>a 'checksum' of each structure in the file, and then comparing against
>this list of checksums ...

If you want a really strong 'checksum' technique have a look
at the MD5 (message digest 5) algorithm. It is published along
with C source code in an internet RFC. ftp://ftp.near.net/rfc/rfc1334.txt
and is commonly used in various internet protocols (e.g. PPP/CHAP).

On symbols, strings, and numbers you would presumably apply
the md5_update function to the printed representation.
And on anything structural you would apply some kind of recursive
update. Perhaps you introduce some kind of bias factor
(like hashing the characters "( )." so that the resulting md5 is the
same as if you had computed md5 of the printed representation
of the entire list structure)).

That should give you an 8-byte 'checksum' for which you are
extremely unlikely to find another list structure which results
in the same hash, even if you tried very very hard to come up
with one.

Anyway, I've been using md5 to good effect in siod for some time.
See ftp://ftp.std.com/pub/gjc/siod_tar.gz

Perhaps somebody has recoded MD5 in common lisp someplace.

-gjc