Storing sh'loads of data

From: Michael J. Barillier
Subject: Storing sh'loads of data
Date: Thu, 06 Sep 2001 18:15:40 +0000
Message-ID: <ul8k7zc2njn.fsf@babu.pcisys.net>

As a proof-of-concept idea, I want to pitch Lisp as an alternative to
C++ (see sig below) on a particular project.  One issue is massive
data flow and storage.  Currently this project processes on the order
of 10 million records a day, caching them for up to 3 days.  I'd
prefer a free or low-cost solution, but anything anyone can suggest
for data storage and fast access would be appreciated.

And regarding roll-your-own data storage in Lisp, I was wondering:
What would the drawbacks be to writing Lisp objects to a file using
prin1 with \n delimiters and keeping an index file (or in memory) of
the start of each record?  Or is there a prefered idiom for persisting
data structures?

Thanks!

-- 
Michael J. Barillier
<················@pcisys.net> <http://www.pcisys.net/~blackwolf/>

> (print "OO *sucks*.")

Re: Storing sh'loads of data Wade Humeniuk
Re: Storing sh'loads of data ···@itasoftware.com
Re: Storing sh'loads of data Kent M Pitman
- Re: Storing sh'loads of data Michael J. Barillier
  - Re: Storing sh'loads of data Thomas F. Burdick
    - Re: Storing sh'loads of data Kent M Pitman
    - Re: Storing sh'loads of data Roger Corman
- Re: Storing sh'loads of data Pekka P. Pirinen
  - Re: Storing sh'loads of data Lieven Marchand
  - Re: Storing sh'loads of data Kent M Pitman
    - Re: Storing sh'loads of data Pekka P. Pirinen
      - Re: Storing sh'loads of data Kent M Pitman
Re: Storing sh'loads of data Peter Seibel
- Re: Storing sh'loads of data Roger Corman
- Re: Storing sh'loads of data Bulent Murtezaoglu
  - Re: Storing sh'loads of data Tim Bradshaw
- Re: Storing sh'loads of data Christophe Rhodes
  - Re: Storing sh'loads of data Peter Seibel
- Re: Storing sh'loads of data Frode Vatvedt Fjeld
Re: Storing sh'loads of data Raymond Wiker

From: Wade Humeniuk
Subject: Re: Storing sh'loads of data
Date: Thu, 06 Sep 2001 20:56:57 +0000
Message-ID: <9n8o0i$rec$1@news3.cadvision.com>

If you just want to store records you might look at using the Berkeley DB.
You can store arbitrary records under a key.  You would have to write a FFI
to it but that should not be too hard.  A Berkeley DB can hold tons of data,
more than you need.  Runs on most everything.

See

http://www.sleeycat.com

http://www.sleepycat.com/products.html

Here's a page from the online manual.

http://www.sleepycat.com/docs/api_c/db_put.html


I have thought about using it for the basis of some higher level Lisp
storage mechanism but you could just store the results of using
write-to-string.

Wade

"Michael J. Barillier" <·········@pcisys.net> wrote in message
····················@babu.pcisys.net...
> As a proof-of-concept idea, I want to pitch Lisp as an alternative to
> C++ (see sig below) on a particular project.  One issue is massive
> data flow and storage.  Currently this project processes on the order
> of 10 million records a day, caching them for up to 3 days.  I'd
> prefer a free or low-cost solution, but anything anyone can suggest
> for data storage and fast access would be appreciated.
>
> And regarding roll-your-own data storage in Lisp, I was wondering:
> What would the drawbacks be to writing Lisp objects to a file using
> prin1 with \n delimiters and keeping an index file (or in memory) of
> the start of each record?  Or is there a prefered idiom for persisting
> data structures?
>
> Thanks!
>
> --
> Michael J. Barillier
> <················@pcisys.net> <http://www.pcisys.net/~blackwolf/>
>
> > (print "OO *sucks*.")
>

From: ···@itasoftware.com
Subject: Re: Storing sh'loads of data
Date: Thu, 06 Sep 2001 20:40:39 +0000
Message-ID: <ofoo833s.fsf@itasoftware.com>

"Michael J. Barillier" <·········@pcisys.net> writes:

> As a proof-of-concept idea, I want to pitch Lisp as an alternative to
> C++ (see sig below) on a particular project.  One issue is massive
> data flow and storage.  Currently this project processes on the order
> of 10 million records a day, caching them for up to 3 days.  I'd
> prefer a free or low-cost solution, but anything anyone can suggest
> for data storage and fast access would be appreciated.
> 
> And regarding roll-your-own data storage in Lisp, I was wondering:
> What would the drawbacks be to writing Lisp objects to a file using
> prin1 with \n delimiters and keeping an index file (or in memory) of
> the start of each record?  Or is there a prefered idiom for persisting
> data structures?

With this magnitude of data you might want to write some foreign
function calls to MMAP it in as a raw array of chars rather than use
READ and PRIN1 to deal with it.

From: Kent M Pitman
Subject: Re: Storing sh'loads of data
Date: Thu, 06 Sep 2001 18:54:47 +0000
Message-ID: <sfwbsko1760.fsf@world.std.com>

"Michael J. Barillier" <·········@pcisys.net> writes:

> 
> As a proof-of-concept idea, I want to pitch Lisp as an alternative to
> C++ (see sig below) on a particular project.  One issue is massive
> data flow and storage.  Currently this project processes on the order
> of 10 million records a day, caching them for up to 3 days.  I'd
> prefer a free or low-cost solution, but anything anyone can suggest
> for data storage and fast access would be appreciated.
> 
> And regarding roll-your-own data storage in Lisp, I was wondering:
> What would the drawbacks be to writing Lisp objects to a file using
> prin1 with \n delimiters and keeping an index file (or in memory) of
> the start of each record?  Or is there a prefered idiom for persisting
> data structures?

Well, portably of course, you aren't guaranteed set-file-position
won't seek linearly from the start, i.e., that it won't be O(n) where
n is the file position.  And, in fact, in any character encoding that
uses escape prefixes, it sort of has to... if it works at all.

If you ensure that the external file format that you use is randomly
accessible (this is vendor specific) you can perhaps win.

Sounds like you're just using "\n" as an arbitrary break.  I assume
you know \n is not a lisp term.  We use #\Newline here, and it might
be a CR or an LF or a CRLF in an external file, depending on the
system, so make sure you count correctly!  (You could literally use
that two-character marker I suppose, for all it matters.  Maybe even
more portable than newline. :-)

Don't use PRIN1 expecting READ to work without binding at least 
*PRINT-READABLY* around the print, and probably WITH-STANDARD-IO-SYNTAX.

Keep in mind that Newline could occur mid-expression, so don't write any
tools that count newlines to find out what recorde number you're in.
That is, Newline might be an ok spacer but it can't be a delimiter if you're
using PRINT/PRIN1+READ.

You could build a tree using the file system itself, btw, with one structure
per file.  Depending on how big your typical object was and how big your
system cluster size was.

(I've always thought it a shame that we don't have the ability to compile to
a binary stream, so that we could compile and load from an array and you
could store binary things in databases.)

Depending also on your data structure type and its constraints, you might
be able to make your own trivial fasl file format that worked with 8 bit files
and was portable.  The Lisp Machine fasl file format isn't portable among
platforms (I-Machine vs L-Machine for those who know what that is), but I
think that's because of instructions.  I think dumped "data", as in 
sys:dump-forms-to-file (not compiled code, but compiled data), in the same
format is portable across platforms.

From: Michael J. Barillier
Subject: Re: Storing sh'loads of data
Date: Thu, 06 Sep 2001 19:59:38 +0000
Message-ID: <ul8itew2iqd.fsf@babu.pcisys.net>

Kent M Pitman <······@world.std.com> writes:

> Well, portably of course, you aren't guaranteed set-file-position
> won't seek linearly from the start, i.e., that it won't be O(n) where
> n is the file position.  And, in fact, in any character encoding that
> uses escape prefixes, it sort of has to... if it works at all.

I'm not overly worried about portability -- the current version is so
poorly designed that they're having no end of trouble getting the
thing moved over to a DEC (using automake/autoconf apparently didn't
occur to anyone way back when).

> Sounds like you're just using "\n" as an arbitrary break.  I assume
> you know \n is not a lisp term.  We use #\Newline here, and it might
> be a CR or an LF or a CRLF in an external file, depending on the
> system, so make sure you count correctly!  (You could literally use
> that two-character marker I suppose, for all it matters.  Maybe even
> more portable than newline. :-)

Sorry 'bout that, Chief.  Yeah, I know it's #\Newline, but this
business with the DOJ effectively dropping the Micro$oft case has me a
bit riled.  It won't happen again.

> You could build a tree using the file system itself, btw, with one structure
> per file.  Depending on how big your typical object was and how big your
> system cluster size was.

Small records, and lots of 'em.  I was thinking about something like
that, but the records don't have any natural hierarchical ordering to
them (other than separate directories for records for ranges of
timestamps), but overall I was thinking that grouping them in a single
file would be less resource-intensive.  CPU and processing effort
would have to be benchmarked.

> Depending also on your data structure type and its constraints, you might
> be able to make your own trivial fasl file format that worked with 8 bit files
> and was portable.  The Lisp Machine fasl file format isn't portable among
> platforms (I-Machine vs L-Machine for those who know what that is), but I
> think that's because of instructions.  I think dumped "data", as in 
> sys:dump-forms-to-file (not compiled code, but compiled data), in the same
> format is portable across platforms.

Something to look into.

So how come there's never the perfect tool that fits in perfectly with
an application?  Sheesh. :)

-- 
Michael J. Barillier
<················@pcisys.net> <http://www.pcisys.net/~blackwolf/>

> (prin1 "OO *sucks*.")

From: Thomas F. Burdick
Subject: Re: Storing sh'loads of data
Date: Fri, 07 Sep 2001 02:49:06 +0000
Message-ID: <xcvbsknu34t.fsf@conquest.OCF.Berkeley.EDU>

"Michael J. Barillier" <·········@pcisys.net> writes:

> Kent M Pitman <······@world.std.com> writes:

> > Sounds like you're just using "\n" as an arbitrary break.  I assume
> > you know \n is not a lisp term.  We use #\Newline here, and it might
> > be a CR or an LF or a CRLF in an external file, depending on the
> > system, so make sure you count correctly!  (You could literally use
> > that two-character marker I suppose, for all it matters.  Maybe even
> > more portable than newline. :-)
> 
> Sorry 'bout that, Chief.  Yeah, I know it's #\Newline, but this
> business with the DOJ effectively dropping the Micro$oft case has me a
> bit riled.  It won't happen again.

Microsoft, shmicrosoft.  Macs use LF as #\Newline.

From: Kent M Pitman
Subject: Re: Storing sh'loads of data
Date: Fri, 07 Sep 2001 03:33:10 +0000
Message-ID: <sfw66avr7yh.fsf@world.std.com>

···@conquest.OCF.Berkeley.EDU (Thomas F. Burdick) writes:

> 
> "Michael J. Barillier" <·········@pcisys.net> writes:
> 
> > Kent M Pitman <······@world.std.com> writes:
> 
> > > Sounds like you're just using "\n" as an arbitrary break.  I assume
> > > you know \n is not a lisp term.  We use #\Newline here, and it might
> > > be a CR or an LF or a CRLF in an external file, depending on the
> > > system, so make sure you count correctly!  (You could literally use
> > > that two-character marker I suppose, for all it matters.  Maybe even
> > > more portable than newline. :-)
> > 
> > Sorry 'bout that, Chief.  Yeah, I know it's #\Newline, but this
> > business with the DOJ effectively dropping the Micro$oft case has me a
> > bit riled.  It won't happen again.
> 
> Microsoft, shmicrosoft.  Macs use LF as #\Newline.

Most implementations use either #\Linefeed or #\Return as #\Newline.
But the critical thing to understand is that IF an implementation makes
(eql #\Newline #\Return) or (eql #\Newline #\Linefeed), then there is a
problem about knowing what the exact effect of 
 (write-char #\Return stream)
 (write-char #\Linefeed stream)
will be, since it's going to be the same as 
 (write-char #\Return stream)
 (write-char #\Newline stream) ;if newline = lf
or
 (write-char #\Newline stream) ;if newline = cr
 (write-char #\Linefeed stream)
and you're going to risk that the system is not doing what you expect it
is doing.  Certainly there is no portable behavior that is well-defined here.
As a consequence, be careful in how many characters get output or how you
count.. best to just "ask".  But part of my point is that if you're going to
only read between two points, you don't really need ANY particular separator;
it's the pos that matters.  So you read in the range between the poses into
a string buffer and then read-from-string exactly from that buffer.  You might
as well put "FOO" between expressions OR the null string for that matter.

From: Roger Corman
Subject: Re: Storing sh'loads of data
Date: Fri, 07 Sep 2001 06:35:01 +0000
Message-ID: <3b986a66.282566018@news.callatg.com>

On 06 Sep 2001 19:49:06 -0700, ···@conquest.OCF.Berkeley.EDU (Thomas F. Burdick)
wrote:

>"Michael J. Barillier" <·········@pcisys.net> writes:
>
>> Kent M Pitman <······@world.std.com> writes:
>
>> > Sounds like you're just using "\n" as an arbitrary break.  I assume
>> > you know \n is not a lisp term.  We use #\Newline here, and it might
>> > be a CR or an LF or a CRLF in an external file, depending on the
>> > system, so make sure you count correctly!  (You could literally use
>> > that two-character marker I suppose, for all it matters.  Maybe even
>> > more portable than newline. :-)
>> 
>> Sorry 'bout that, Chief.  Yeah, I know it's #\Newline, but this
>> business with the DOJ effectively dropping the Micro$oft case has me a
>> bit riled.  It won't happen again.
>
>Microsoft, shmicrosoft.  Macs use LF as #\Newline.

You mean CR. Macs use CR as #\Newline.

From: Pekka P. Pirinen
Subject: Re: Storing sh'loads of data
Date: Wed, 12 Sep 2001 16:22:25 +0000
Message-ID: <usndsbcqm.fsf@globalgraphics.com>

Kent M Pitman <······@world.std.com> writes:
> Well, portably of course, you aren't guaranteed set-file-position
> won't seek linearly from the start, i.e., that it won't be O(n) where
> n is the file position.  And, in fact, in any character encoding that
> uses escape prefixes, it sort of has to... if it works at all.

IMO, that's worse than just refusing to support positioning by
returning NIL from FILE-POSITION.  There is a better way: Encode the
"shift state" in the low bits of the file position.  That's how I
implemented it for LW.  This is legal since it just has to be a
monotonically increasing integer.  (And I thought the committee was
very clever to leave me the leeway to do that.)
-- 
Pekka P. Pirinen, Global Graphics Software
 More than any time in history mankind faces a crossroads.  One path
 leads to despair and utter hopelessness, the other to total extinction.
 Let us pray that we have the wisdom to choose correctly.  - Woody Allen

From: Lieven Marchand
Subject: Re: Storing sh'loads of data
Date: Wed, 12 Sep 2001 21:37:47 +0000
Message-ID: <m3r8tc9jkk.fsf@localhost.localdomain>

···············@globalgraphics.com (Pekka P. Pirinen) writes:

> Kent M Pitman <······@world.std.com> writes:
> > Well, portably of course, you aren't guaranteed set-file-position
> > won't seek linearly from the start, i.e., that it won't be O(n) where
> > n is the file position.  And, in fact, in any character encoding that
> > uses escape prefixes, it sort of has to... if it works at all.
> 
> IMO, that's worse than just refusing to support positioning by
> returning NIL from FILE-POSITION.  

Not always. Sometimes the operating system doesn't give you a
choice. In VM/CMS your files can have record format fixed or
variable. For variable record format type files, you do need to scan,
but since this is under the users control, you shouldn't second guess
him by refusing to support it if the user chose to make that file
variable. The C library did and supported stuff like fopen("FILENAME
FILETYPE A ( RECFM F LRECL 120", "r"). CL would have had a more
elegant solution by having OPEN accept more keyword arguments.

I don't know whether there was ever a CL implementation for VM/CMS so
this is pretty academic. There was Portable Standard Lisp or whatever
the thing REDUCE was written in was called and an IBM lisp that had a
MAPCAR that supported only one list and had the order of arguments
(mapcar list function).

-- 
Lieven Marchand <···@wyrd.be>
She says, "Honey, you're a Bastard of great proportion."
He says, "Darling, I plead guilty to that sin."
Cowboy Junkies -- A few simple words

From: Kent M Pitman
Subject: Re: Storing sh'loads of data
Date: Thu, 13 Sep 2001 17:48:16 +0000
Message-ID: <sfwsndr6kyn.fsf@world.std.com>

···············@globalgraphics.com (Pekka P. Pirinen) writes:

> Kent M Pitman <······@world.std.com> writes:
> > Well, portably of course, you aren't guaranteed set-file-position
> > won't seek linearly from the start, i.e., that it won't be O(n) where
> > n is the file position.  And, in fact, in any character encoding that
> > uses escape prefixes, it sort of has to... if it works at all.
> 
> IMO, that's worse than just refusing to support positioning by
> returning NIL from FILE-POSITION.  There is a better way: Encode the
> "shift state" in the low bits of the file position.  That's how I
> implemented it for LW.  This is legal since it just has to be a
> monotonically increasing integer.  (And I thought the committee was
> very clever to leave me the leeway to do that.)

I didn't mean the "shift-is-on" information like doublequote and
vertical bar, though that's an issue, too I was more meaning the
things that compromise the ability to guess the position of the 7th
character, which some implementations let you do.  For that, you have
to know how many of characters 1-6 were multi-byte representations
and how many were fixed.  Sounds like your file position markers don't
themselves represent how many characters precede.  I think on the LispM
if you open in base-char you'll get direct access to the nth char when
you set-file-position to n.  But if you open in character, it will write 
out some chars as single byte and some as multi-byte, but the position
will be the nth character, so you can dead-reckon a file-position you
didn't ever ask for, but at the expense that moving to it can be tough.
If you're accessing a document linearly, it's often no big deal, but 
random access is awful.  I agree that definitionally, not knowing
it's going to do this is bad.

From: Pekka P. Pirinen
Subject: Re: Storing sh'loads of data
Date: Fri, 14 Sep 2001 18:01:57 +0000
Message-ID: <uwv31lkh6.fsf@globalgraphics.com>

Kent M Pitman <······@world.std.com> writes:
> ···············@globalgraphics.com (Pekka P. Pirinen) writes:
> > [...] Encode the
> > "shift state" in the low bits of the file position.
> 
> I didn't mean the "shift-is-on" information like doublequote and
> vertical bar, though that's an issue, too

To be specific, escape sequences that change the character set,
as in the JIS encoding for Japanese characters.

> I was more meaning the
> things that compromise the ability to guess the position of the 7th
> character, which some implementations let you do.  For that, you have
> to know how many of characters 1-6 were multi-byte representations
> and how many were fixed.  Sounds like your file position markers don't
> themselves represent how many characters precede.

That's right.  And it isn't enough to know the encodings of the
characters, you also have to know how file positions are represented
as integers.  So it's hopeless to try to do that with
  (FILE-POSITION stream 7)
unless you have implemtation-specific knowledge about it.

> If you're accessing a document linearly, it's often no big deal, but 
> random access is awful.  I agree that definitionally, not knowing
> it's going to do this is bad.

Well, FILE-POSITION is for random access; if it doesn't do that
efficiently, it's not a good implementation.

However, the concern you raised doesn't apply to the situation the
original poster was asking about: indexing a file they've written
themselves.  The program can just query the file positions as it's
doing the writing and store them in the index, no need to guess.
-- 
Pekka P. Pirinen, Global Graphics Software
If a war scene makes you think "Thank God I'm not out there!" rather
than "Whee, that looks fun!  When's the computer game coming out?"
then it's succeeded.  - nomad_requiem.demon.co.uk (Occam's Razor)

From: Kent M Pitman
Subject: Re: Storing sh'loads of data
Date: Fri, 14 Sep 2001 20:53:26 +0000
Message-ID: <sfwwv318pfd.fsf@world.std.com>

···············@globalgraphics.com (Pekka P. Pirinen) writes:

> However, the concern you raised doesn't apply to the situation the
> original poster was asking about: indexing a file they've written
> themselves.  The program can just query the file positions as it's
> doing the writing and store them in the index, no need to guess.

Agreed.

From: Peter Seibel
Subject: Re: Storing sh'loads of data
Date: Fri, 07 Sep 2001 06:21:51 +0000
Message-ID: <m31yljed1c.fsf@localhost.localdomain>

"Michael J. Barillier" <·········@pcisys.net> writes:

> As a proof-of-concept idea, I want to pitch Lisp as an alternative to
> C++ (see sig below) on a particular project.  One issue is massive
> data flow and storage.  Currently this project processes on the order
> of 10 million records a day, caching them for up to 3 days.  I'd
> prefer a free or low-cost solution, but anything anyone can suggest
> for data storage and fast access would be appreciated.
> 
> And regarding roll-your-own data storage in Lisp, I was wondering:
> What would the drawbacks be to writing Lisp objects to a file using
> prin1 with \n delimiters and keeping an index file (or in memory) of
> the start of each record?  Or is there a prefered idiom for persisting
> data structures?

I've had a similar question bouncing around my mind lately, as a Lisp
newbie. Suppose I wanted to write a on-disk btree implementation in
Lisp; it seems to me (based on flipping through CLtL2) that Common
Lisp doesn't specify very elaborate APIs for reading and writing
binary data to and from disk. All I could find were two functions: one
for reading a byte at a time and one for writing a byte at a
time. Which is fine except when you're writing a paged storage system
like a btree you'd really like an API that lets you read and write
arrays of bytes of you page size in one system call. (Well, really if
you really wanted top performance there's lots you'd want to do beyond
that such as turning of the OS page cache and perhaps using a raw
partition. But I can deal with not having that as long as I don't have
to dribble by 4k blocks out one byte at a time.)

I assume that you could always do this with a FFI to C (at least on
Unix, etc.) I was just wondering if that's The Right Thing(tm) to
do. Or do different vendors provide their own proprietary ways of
doing things like this?

-Peter

P.S. Michael, speaking not as a Lisp newbie, but as someone who builds
(or tries to) scalable software in other languages for a living, it
seems that the success of your proposal will depend on how you
implement the index file. It doesn't seem to me that any "simple"
indexing scheme is likely to scale well to processing 10,000,000 new
records a day and storing 30,000,000 an any given time. You'll need to
do--assuming best case conditions of perfectly even load--115
transactions a second, not counting removing expired records. And it
doesn't seem you can keep the index just in memory: what happens if
your process crashes; how long would it take to rebuild the in-memory
index from the data file? Plus, even keeping the whole index in memory
might cause you to run afoul of the "five-minute rule" that basically
says if you have data that is not accessed often enough (on the order
of five minutes in the original paper by Jim Gray) it's more
economical to swap it out to secondary storage than to keep it in
memory. I suppose you can trade the higher cost of memory off against
the cost of programmer time if you choose. But you'll probably have a
hard time finding boxes with more than 4G of RAM in them, and with 30
million records in your index, you only have 143 bytes per record to
play with (not to mention the memory used for things other than the
index, like the rest of the Lisp runtime.) You could use virtual
memory but then, unless you're very careful about your memory access
patterns (which you may not even have that much control over, given
the garbage-collector), you'll probably thrash yourself to
death--you'd be better off handling the paging yourself with something
like a btree. If you really need to write the code yourself. It seems
like you might want to use a database (I assume there are already Lisp
libraries for talking to databases) or at least something like
BerkeleyDB (free from http://www.sleepycat.com), which you could wrap
in a FFI.

Or maybe I'm just misunderstanding the requirements of your software.

> 
> Thanks!
> 
> -- 
> Michael J. Barillier
> <················@pcisys.net> <http://www.pcisys.net/~blackwolf/>
> 
> > (print "OO *sucks*.")

-- 
Peter Seibel <·····@javamonkey.com>

From: Roger Corman
Subject: Re: Storing sh'loads of data
Date: Fri, 07 Sep 2001 06:40:46 +0000
Message-ID: <3b986ae4.282692550@news.callatg.com>

On 06 Sep 2001 23:21:51 -0700, Peter Seibel <·····@javamonkey.com> wrote:

>"Michael J. Barillier" <·········@pcisys.net> writes:
;..
>I've had a similar question bouncing around my mind lately, as a Lisp
>newbie. Suppose I wanted to write a on-disk btree implementation in
>Lisp; it seems to me (based on flipping through CLtL2) that Common
>Lisp doesn't specify very elaborate APIs for reading and writing
>binary data to and from disk. All I could find were two functions: one
>for reading a byte at a time and one for writing a byte at a
>time. Which is fine except when you're writing a paged storage system
>like a btree you'd really like an API that lets you read and write
>arrays of bytes of you page size in one system call. (Well, really if
>you really wanted top performance there's lots you'd want to do beyond
>that such as turning of the OS page cache and perhaps using a raw
>partition. But I can deal with not having that as long as I don't have
>to dribble by 4k blocks out one byte at a time.)
>
>I assume that you could always do this with a FFI to C (at least on
>Unix, etc.) I was just wondering if that's The Right Thing(tm) to
>do. Or do different vendors provide their own proprietary ways of
>doing things like this?
>

I believe you would use READ-SEQUENCE and WRITE-SEQUENCE for this,
which are ANSI functions. Ideally an implementation would provide an
optimized implementation for the case you are interested in (large blocks of
bytes).

From the Hyperspec:

read-sequence is identical in effect to iterating over the indicated subsequence
and reading one element at a time from stream and storing it into sequence, but
may be more efficient than the equivalent loop. An efficient implementation is
more likely to exist for the case where the sequence is a vector with the same
element type as the stream. 

Roger

From: Bulent Murtezaoglu
Subject: Re: Storing sh'loads of data
Date: Fri, 07 Sep 2001 06:41:18 +0000
Message-ID: <87bsknmrht.fsf@nkapi.internal>

>>>>> "PS" == Peter Seibel <·····@javamonkey.com> writes:

    PS> [...] All
    PS> I could find were two functions: one for reading a byte at a
    PS> time and one for writing a byte at a time. 

See read-sequence and write-sequence.  Some implementations might optimize
them havily for specialized arrays (I know CMUCL tries to do a good job
at least).  

    PS> [...] But I can deal with not having
    PS> that as long as I don't have to dribble by 4k blocks out one
    PS> byte at a time.)

You don't.  You can read a 4k block at once with read-sequence and
a 4k-long array of element type (unsigned-byte 8) or whatever.  You 
might want to check with your vendor if this turns out to have 
inefficiency problems.  I don't see anything inherently inefficient
in the way those functions are specified.  An SC[1] should be able to
turn read-sequence into a call to read with minimal overhead for the 
right array type (with the right declerations).

    PS> I assume that you could always do this with a FFI to C (at
    PS> least on Unix, etc.) I was just wondering if that's The Right
    PS> Thing(tm) to do. [...]

Well, that depends on who you ask.  I wouldn't have a problem with 
using a well tested library through FFI for this.  If you are dealing 
with truly massive amounts of data, you might even try having a lisp
program writing your C code (see [2] and think something lispier than 
their cymbal language).  On the other hand, if you did indeed do it
all in Lisp and tuned it for adequate performance you'd get to know 
the I/O substrate part of your implementation very well and that might 
prove very useful later.

cheers,

BM

[1] I don't think an SSC is needed for this.
[2] http://www.research.att.com/projects/daytona/

From: Tim Bradshaw
Subject: Re: Storing sh'loads of data
Date: Mon, 10 Sep 2001 08:32:58 +0000
Message-ID: <ey3heubo37p.fsf@cley.com>

* Bulent Murtezaoglu wrote:

> You don't.  You can read a 4k block at once with read-sequence and
> a 4k-long array of element type (unsigned-byte 8) or whatever.  You 
> might want to check with your vendor if this turns out to have 
> inefficiency problems.  I don't see anything inherently inefficient
> in the way those functions are specified.  An SC[1] should be able to
> turn read-sequence into a call to read with minimal overhead for the 
> right array type (with the right declerations).

I did some experiments along these lines, and at least ACL (I think
also CMUCL, although I am not sure) can sustain very good IO rates
using READ-SEQUENCE / WRITE-SEQUENCE for suitable types.  I don't have
fancy hardware to play with, but on PC-class machines you can easily
saturate the disk (and in fact with a memory filesystem like /tmp on
Solaris you can get performance that looks like it's asymptotically
memory-bandwidth limited).

--tim

From: Christophe Rhodes
Subject: Re: Storing sh'loads of data
Date: Fri, 07 Sep 2001 08:29:32 +0000
Message-ID: <sqn147v1xv.fsf@lambda.jesus.cam.ac.uk>

Peter Seibel <·····@javamonkey.com> writes:

> I've had a similar question bouncing around my mind lately, as a Lisp
> newbie. Suppose I wanted to write a on-disk btree implementation in
> Lisp; it seems to me (based on flipping through CLtL2) that Common
> Lisp doesn't specify very elaborate APIs for reading and writing
> binary data to and from disk. All I could find were two functions: one
> for reading a byte at a time and one for writing a byte at a
> time. 

CLtL2 is not the specification of Common Lisp; it was an attempt at
summarizing the "state of the nation" during the ANSI standardization
process.

As others have mentioned, since CLtL2 was published the functions
READ-SEQUENCE and WRITE-SEQUENCE were added to the language, available
in a hyperspec near you.

Christophe
-- 
Jesus College, Cambridge, CB5 8BL                           +44 1223 510 299
http://www-jcsu.jesus.cam.ac.uk/~csr21/                  (defun pling-dollar 
(str schar arg) (first (last +))) (make-dispatch-macro-character #\! t)
(set-dispatch-macro-character #\! #\$ #'pling-dollar)

From: Peter Seibel
Subject: Re: Storing sh'loads of data
Date: Sat, 08 Sep 2001 21:39:28 +0000
Message-ID: <m3ofolnyzz.fsf@localhost.localdomain>

Christophe Rhodes <·····@cam.ac.uk> writes:

> Peter Seibel <·····@javamonkey.com> writes:
> 
> > I've had a similar question bouncing around my mind lately, as a Lisp
> > newbie. Suppose I wanted to write a on-disk btree implementation in
> > Lisp; it seems to me (based on flipping through CLtL2) that Common
> > Lisp doesn't specify very elaborate APIs for reading and writing
> > binary data to and from disk. All I could find were two functions: one
> > for reading a byte at a time and one for writing a byte at a
> > time. 
> 
> CLtL2 is not the specification of Common Lisp; it was an attempt at
> summarizing the "state of the nation" during the ANSI standardization
> process.
> 
> As others have mentioned, since CLtL2 was published the functions
> READ-SEQUENCE and WRITE-SEQUENCE were added to the language, available
> in a hyperspec near you.

Fair enough. CLtL2 was what I had at hand; I'll look into the
HyperSpec more carefully. Thanks.

-Peter


> 
> Christophe
> -- 
> Jesus College, Cambridge, CB5 8BL                           +44 1223 510 299
> http://www-jcsu.jesus.cam.ac.uk/~csr21/                  (defun pling-dollar 
> (str schar arg) (first (last +))) (make-dispatch-macro-character #\! t)
> (set-dispatch-macro-character #\! #\$ #'pling-dollar)

From: Frode Vatvedt Fjeld
Subject: Re: Storing sh'loads of data
Date: Fri, 07 Sep 2001 08:19:06 +0000
Message-ID: <2hlmjr5s79.fsf@dslab7.cs.uit.no>

Peter Seibel <·····@javamonkey.com> writes:

> I've had a similar question bouncing around my mind lately, as a
> Lisp newbie. Suppose I wanted to write a on-disk btree
> implementation in Lisp; it seems to me (based on flipping through
> CLtL2) that Common Lisp doesn't specify very elaborate APIs for
> reading and writing binary data to and from disk.

Perhaps you'd be interested in my binary-types
package. <URL:http://www.cs.uit.no/~frodef/sw/binary-types/>

Just don't expect great I/O performance from it in its current form,
as that has not been a concern of mine. (My main concern has been to
make the mapping to and from binary data seamless.) It's basically
built on top of (8-bit) READ-BYTE and WRITE-BYTE. I've also introduced
some cludges to allow binary data be written to vectors (of suitable
element-type) which subsequently could be output with WRITE-SEQUENCE.

> All I could find were two functions: one for reading a byte at a
> time and one for writing a byte at a time. Which is fine except when
> you're writing a paged storage system like a btree you'd really like
> an API that lets you read and write arrays of bytes of you page size
> in one system call.

I'd certainly expect a lisp implementation to buffer streams, in a
similar fashion to what libc/stdio does. There is still the overhead
of a function call for each and every byte, of course. Note that
e.g. WRITE-BYTE is likely to be able to deal with 32-bit bytes (on a
32-bit system), so if you're dealing in 32-bit sized objects there
shouldn't be so much overhead. Well, except now you'll possibly have
the problem of consing lots of bignums.. :)

> (Well, really if you really wanted top performance there's lots
> you'd want to do beyond that such as turning of the OS page cache
> and perhaps using a raw partition. But I can deal with not having
> that as long as I don't have to dribble by 4k blocks out one byte at
> a time.)

I think that if extreme I/O performance is the most important goal,
you might want to ditch the CL stream abstraction and build your own
abstractions on top of some low-level OS interface.

-- 
Frode Vatvedt Fjeld

From: Raymond Wiker
Subject: Re: Storing sh'loads of data
Date: Fri, 07 Sep 2001 07:51:04 +0000
Message-ID: <86g09z4exj.fsf@raw.grenland.fast.no>

        Is it even necessary to use external storage? Given that you
need a few tens of millions of small objects, it might be possible to
just keep everything in memory (and/or swap). Obviously depends on
what you mean by "small", though :-)

-- 
Raymond Wiker
·············@fast.no