Lisp and huge datasets...

From: Cory Spencer
Subject: Lisp and huge datasets...
Date: Sun, 21 Apr 2002 19:46:18 +0000
Message-ID: <a9v4ua$cd0$1@nntp.itservices.ubc.ca>

I realize that this isn't necessarily a Lisp-specific question, however after reading an article
on Paul Grahams website <http://www.paulgraham.com/carl.html> pertaining to ITA Sofware and
their implementation of the Orbitz airline reservation system, I have a couple questions.

In point four, the article indicates that Orbitz had several gigs of static data stored as C
structs in memory-mapped files.  Having come across a similar situation in an application that I
am also working on (large read-only dataset requiring rapid access, and not requiring a full-out
RDMS to host the data), I'm wondering as to the best methods for indexing/accessing the data
stored in each record.  Each record can been assumed to have uniquely identifying
data.  Constant access time for each record is preferable.  The dataset will be refreshed at
most once per day.

The solution I currently have in mind is to step through each record when the dataset is
refreshed, creating a separate index file.  (A hash key based upon the uniquely identifying
characteristics of each record would map to the actual index into the array of
C-structs.)  When the application is started, it needs only to load the index file to determine
the location of individual records.

Is there a more preferable method for doing this?

Re: Lisp and huge datasets... Bulent Murtezaoglu
Re: Lisp and huge datasets... Joe Marshall

From: Bulent Murtezaoglu
Subject: Re: Lisp and huge datasets...
Date: Sun, 21 Apr 2002 20:09:57 +0000
Message-ID: <877kn0yfsa.fsf@nkapi.internal>

If you are indexing on a single key, you might want to look into Berkeley DB, 
I believe it can be tuned to pull the file into memory if it is RO data.
  
Paul Foley wrote an interface for it for CMUCL see:
http://users.actrix.co.nz/mycroft/cl.html

I don't know what the issues would be if you run into memory limits on x86 
(are you on x86?).  But in that case given that this is read only, you might 
want to keep everything in C w/o linking from CL and whip up an unix IPC 
interface from CL to do your lookups.  Berkeley DB (or at least the sleepycat
version of it) should be able to handle >3Gig data.

cheers,

BM

From: Joe Marshall
Subject: Re: Lisp and huge datasets...
Date: Sun, 21 Apr 2002 21:38:41 +0000
Message-ID: <BjGw8.37207$%s3.15655225@typhoon.ne.ipsvc.net>

"Cory Spencer" <········@interchange.ubc.ca> wrote in message
·················@nntp.itservices.ubc.ca...
> I realize that this isn't necessarily a Lisp-specific question, however
after reading an article
> on Paul Grahams website <http://www.paulgraham.com/carl.html> pertaining
to ITA Sofware and
> their implementation of the Orbitz airline reservation system, I have a
couple questions.
>
> In point four, the article indicates that Orbitz had several gigs of
static data stored as C
> structs in memory-mapped files.  Having come across a similar situation in
an application that I
> am also working on (large read-only dataset requiring rapid access, and
not requiring a full-out
> RDMS to host the data), I'm wondering as to the best methods for
indexing/accessing the data
> stored in each record.  Each record can been assumed to have uniquely
identifying
> data.  Constant access time for each record is preferable.  The dataset
will be refreshed at
> most once per day.

Orbitz has a small enough data set that it will fit into the address space
(at least it did last summer).  The data are parsed from the raw sources
(various crufty airline and travel industry agencies) and are `massaged'
a bit to put them into a format that is easier to search upon.

The code for querying the data just has to be quick, not clever.
It turns out that there is little advantage in indexing the data
because you cannot cluster this kind of data (the price of a ticket
from Chicago to Denver *is* relevant to the price of a ticket
from Boston to San Francisco).  You pretty much end
up looking at records in unpredictable ways.  The lookup code was
optimized until it wasn't taking up the bulk of the time, and then
it was left alone.