Efficient large binary file IO

From: Nicholas Harbour
Subject: Efficient large binary file IO
Date: Sun, 15 Jun 2003 05:48:32 +0000
Message-ID: <3ciccns6.fsf@diety.brainpowerconsulting.com>

I apologize if this is a common topic, but I could not find anything in the FAQ.

I am writing an application that is well suited to functional programming but I have come to a rather basic obstacle.  One of the things that I need to do is read in data from a stream and write it out to a file, until I reach a predefined delimiter.  The data could be binary and arbitrarily large.  

I've tried looping read-line (which actually works nice because the delimiter of this data always follows a newline).  It runs a little slow but not too bad.  My problem with that method is that the read-line function truncates the newline portion which makes it useless for working with binary data.  Also rather useless if the data is for example all binary zeros :)

So I have a nice accurate solution with a looping read-char.  I can handle the added complexity it adds to the code, but I cannot however handle the added IO overhead this adds.  Even the most basic "do" loop with only a read-char and a write-char will take several seconds or minutes to read and write a 5MB file.  Add in the overhead of searching for a text pattern and you can see where I am having problems.  

I've been hacking with lisp on and off for a couple of years now, so it pains me to ask such a basic question, but here it is:

What is the most efficient solution for basic file IO?

-- 
Nicholas Harbour,  ···············@yahoo.com

"Nearly half of all people are below average."

Re: Efficient large binary file IO Thomas F. Burdick
- Re: Efficient large binary file IO Janis Dzerins

From: Thomas F. Burdick
Subject: Re: Efficient large binary file IO
Date: Sun, 15 Jun 2003 07:07:30 +0000
Message-ID: <xcvhe6rkedp.fsf@famine.OCF.Berkeley.EDU>

Nicholas Harbour <···············@yahoo.com> writes:

> I apologize if this is a common topic, but I could not find anything in the FAQ.
> 
> I am writing an application that is well suited to functional
> programming but I have come to a rather basic obstacle.  One of the
> things that I need to do is read in data from a stream and write it
> out to a file, until I reach a predefined delimiter.  The data could
> be binary and arbitrarily large.
> 
> I've tried looping read-line [...]

[ Yuck! I don't even do that in Perl! (not that I haven't been tempted) ]
>
> So I have a nice accurate solution with a looping read-char.

If it's binary input, my first suggestion is that you use binary
streams, not character streams.

> I can handle the added complexity it adds to the code, but I cannot
> however handle the added IO overhead this adds.  Even the most basic
> "do" loop with only a read-char and a write-char will take several
> seconds or minutes to read and write a 5MB file.  Add in the
> overhead of searching for a text pattern and you can see where I am
> having problems.

So, use buffering.  Combining binary streams and READ-SEQUENCE on
(vector (unsigned-byte 32)) objects should get you the efficient I/O
you need.

As for combining efficiency with ease of use, if you haven't checked
out Gray streams, you should.  And if you're searching for a text
pattern and you're concerned with efficiency, hopefully you've built a
search tree or something similar, and you're reading in your data in
appropriate chunks.

-- 
           /|_     .-----------------------.                        
         ,'  .\  / | No to Imperialist war |                        
     ,--'    _,'   | Wage class war!       |                        
    /       /      `-----------------------'                        
   (   -.  |                               
   |     ) |                               
  (`-.  '--.)                              
   `. )----'

From: Janis Dzerins
Subject: Re: Efficient large binary file IO
Date: Wed, 18 Jun 2003 12:03:04 +0000
Message-ID: <twkfzm7wq2v.fsf@gulbis.latnet.lv>

···@famine.OCF.Berkeley.EDU (Thomas F. Burdick) writes:

> Nicholas Harbour <···············@yahoo.com> writes:
> 
> > I can handle the added complexity it adds to the code, but I cannot
> > however handle the added IO overhead this adds.  Even the most basic
> > "do" loop with only a read-char and a write-char will take several
> > seconds or minutes to read and write a 5MB file.  Add in the
> > overhead of searching for a text pattern and you can see where I am
> > having problems.
> 
> So, use buffering.  Combining binary streams and READ-SEQUENCE on
> (vector (unsigned-byte 32)) objects should get you the efficient I/O
> you need.

Two things:

1. Might be better to use a simble-array with needed element type.
2. The fact that 32-bit bytes should be used is just an example, right?

-- 
Janis Dzerins

  If million people say a stupid thing, it's still a stupid thing.