database feasibility

From: John Klein
Subject: database feasibility
Date: Mon, 25 Nov 2002 11:16:58 +0000
Message-ID: <b70a59c6.0211250316.56cef4f5@posting.google.com>

This is only marginally lisp related, but I'll
ask it here anyway as there are probably people
here who know the answer.

Someone I know needs to store lots of objects roughly of the following
type:

(defstruct THING
   (id 0 :type (unsigned-byte 32)) 
   (x 0d0 :type double-float)
   (y 0d0 :type double-float)
   (flags 0 :type (unsigned-byte 32))
   (q1 0.0 :type single-float)
   (q2 0.0 :type single-float)
   ;; ... a few more 'qualities' q
   (tvec nil) ;; a vector of single float of length up to few*10^3
   (yvec nil) ;; a vector of single float of length up to few*10^3
   )


There will be a about 5e8 of these THINGs, for a total of a few
terabytes.  tvec and yvec could be stored apart from THING if
necessary.

Requirements would be retrieval by x,y ranges and if possible by flags
and qs.

These will need to be updated about once a day, by appending to the
vectors tvec and yvec, and by modifying the flags and qs.  This
implies about 10^4 modifications a second just to keep up day by day.
Or about of 10^5 machine ops per update.

I think that a compiled Lisp should be able to handle this.

But is there any database that could handle this load, particularly as
tvec and yvec grow big? tvec and yvec wouldn't have to be loaded in
at every update, just appended to.    Updates could be bundled in batches
as 
   (get-objects-in-xyrange)
   (update-objects-in-xyrange)
   (submit-updated-objects-in-xyrange)

Would IPC be the bottleneck, requring use of a non-IPC db like
berkeley-db (to which there exists a Lisp interface)?

Has anyone tried to do anything of this or greater magnitude, especially
with Lisp?  On a fast one-CPU machine?  

According to a benchmark I found, postgres can do ~1e3 transactions/sec 
so with improved CPU speeds this might go up to 5e3, so I think it  is
feasible, but just barely.

Re: database feasibility Tim Bradshaw
Re: database feasibility Will Hartung

From: Tim Bradshaw
Subject: Re: database feasibility
Date: Mon, 25 Nov 2002 11:43:54 +0000
Message-ID: <ey3wun17u2t.fsf@cley.com>

* John Klein wrote:
> According to a benchmark I found, postgres can do ~1e3 transactions/sec 
> so with improved CPU speeds this might go up to 5e3, so I think it  is
> feasible, but just barely.

One thing is that, if you only have one modifier of these things, you
don't necessarily need a transactional database, and you might (?)
find that non-transactional ones are faster because they have to do
less work.

--tim

From: Will Hartung
Subject: Re: database feasibility
Date: Tue, 26 Nov 2002 04:47:32 +0000
Message-ID: <E1DE9.336$KH1.23281899@newssvr21.news.prodigy.com>

"John Klein" <········@yahoo.com> wrote in message
·································@posting.google.com...
> This is only marginally lisp related, but I'll
> ask it here anyway as there are probably people
> here who know the answer.
>
> Someone I know needs to store lots of objects roughly of the following
> type:
>

*snip object-def*

> There will be a about 5e8 of these THINGs, for a total of a few
> terabytes.  tvec and yvec could be stored apart from THING if
> necessary.
>
> Requirements would be retrieval by x,y ranges and if possible by flags
> and qs.
>
> These will need to be updated about once a day, by appending to the
> vectors tvec and yvec, and by modifying the flags and qs.  This
> implies about 10^4 modifications a second just to keep up day by day.
> Or about of 10^5 machine ops per update.
>
> I think that a compiled Lisp should be able to handle this.
>
> But is there any database that could handle this load, particularly as
> tvec and yvec grow big? tvec and yvec wouldn't have to be loaded in
> at every update, just appended to.    Updates could be bundled in batches
> as
>    (get-objects-in-xyrange)
>    (update-objects-in-xyrange)
>    (submit-updated-objects-in-xyrange)
>
> Would IPC be the bottleneck, requring use of a non-IPC db like
> berkeley-db (to which there exists a Lisp interface)?

What are the magnitudes of X and Y? What's their resolution? (i.e. how many
X's per Y, or Y's to X's)

There are all sorts of issues here, mostly relating to number of files, file
sizes, volumes sizes and such like that.

And you're planning on updating every (or, at least, most) of the Things
every day? Will the properties change (x, y, qs, flags), or just the
dependent vectors? How big will your "get-objects-in-xyrange" queries be?
How many objects are you planning on operating on simultaneously?

Using my hi tech back-of-napkin technique, the Thing data alone can hit .5
TB.

The things to concern yourself with are how your indexes are changing, and
how to handle the vectors.

If you indexes are constantly changing, that makes it a much different
problem than if your indexes are fairly static.

Depending on the volume of work that you are doing, it may be better to
perform your batch updates and then recreate the indexes from scratch each
time. Dramatic index changes send the disk bouncing all over the platter,
and that's expensive.

If your data is not really transactional in nature, and you don't really
need the log based recovery options of typical DBs, then that's another
consideration, as the logs can suck up a lot of CPU and I/O bandwidth.

I would consider the "general purpose" databases last in any case, because
you'll want to take as many shortcuts as you can from program to disk,
because of your volume.

There's a data structure called the R-Tree (R-Tree, R+Tree, R*Tree) that is
used to index things like X-Y ranges, and that sounds like something you'd
want to use in this project, but  I don't know any popular databases that
directly support this structure (I'm sure they exist, I'm just unaware of
them).

If you are simply appending to the vectors (are you actually using the
vector data or simply recording it from day to day?), then I'd contrive some
method to chain vector blocks together, rather than update them in place. If
you're reading these things back a lot, then you'll get some thrashing on
the disk, so you need to think how the vectors are used.

If you are really updating the entire thing everyday, then you'd gain alot
by simply recreating it everyday (save disk space, which is a non-trivial
detail with the numbers you're talking about). Indexes tend to build much
faster with sorted data than incrementally building them in place based on
changes. Plus many indexes have different behaviours in "Create" mode rather
than "Update mode" (for example balancing the trees etc).

It also sounds like (to me) it would be smart to have the update process
seperated from the query process (if practical). That is, if you end up
rewriting this thing most every day, then let the users query "yesterdays"
version in a read only mode, then spend the day recreating it for tomorrow.

You may want to dredge through www.vldb.org and look for some pointers
there. You have so much data, and so many instances that every little bit of
overhead is going to kill your processing time, I would think.

But it's really depends on how big you overall data set really is, and how
much of it changes regularly to really determine what would be best.

And, finally, make sure you look in to making your Lisp efficient and try to
keep the numbers from getting all boxed up if it's practical.

Regards,

Will Hartung
(·····@msoft.com)