reading large fixed-width datasets

From: Tamas K Papp
Subject: reading large fixed-width datasets
Date: Mon, 13 Oct 2008 15:44:09 +0000
Message-ID: <6lh8m9Fce97aU1@mid.individual.net>

Hi,

I am working with some Current Population Survey data.  The files 
themselves are huge (145MB for each month 13x12 months), and the data 
format is fixed-width tables, like this:

002600310997690 82008 120100-1 1 1-1 1-3-1-1-1  36409348 1 1 7 7 2 0
002909470991091 82008 220100-1 1 1-1 014-1-1-1  34777557 1 5 1 8 1 1
002909470991091 82008 220100-1 1 1-1 014-1-1-1  34777557 1 5 1 8 1 1
002909470991091 82008 220100-1 1 1-1 014-1-1-1  34777557 1 5 1 8 1 1

Eg those 1-1's are a 1 and a -1.  The column width and positions are 
known in advance and don't change.  I only need part of each row (always 
the same columns), so the plan is to

1) read one row at a time,
2) extract the columns I need and convert them to integers (mostly 
fixnums),
3) save them in a matrix.

I would like to do this as fast as possible.  What is the best way?  I am 
using SBCL if that matters.

Thanks,

Tamas

Re: reading large fixed-width datasets Thomas A. Russ
- Re: reading large fixed-width datasets sross
  - Re: reading large fixed-width datasets Thomas A. Russ
    - Re: reading large fixed-width datasets sross
      - Re: reading large fixed-width datasets Thomas A. Russ
        Re: reading large fixed-width datasets sross
- Re: reading large fixed-width datasets Pascal J. Bourguignon
Re: reading large fixed-width datasets Volkan YAZICI
- Re: reading large fixed-width datasets Pascal J. Bourguignon
- Re: reading large fixed-width datasets Thomas A. Russ
Re: reading large fixed-width datasets William James
Re: reading large fixed-width datasets Rainer Joswig
- Re: reading large fixed-width datasets Pascal J. Bourguignon
Re: reading large fixed-width datasets Richard M Kreuter
Re: reading large fixed-width datasets Thomas F. Burdick
Re: reading large fixed-width datasets ·······@eurogaran.com
Re: reading large fixed-width datasets Volkan YAZICI
- Re: reading large fixed-width datasets Tamas K Papp
  - Re: reading large fixed-width datasets Thomas A. Russ
  - Re: reading large fixed-width datasets Pascal J. Bourguignon
    - Re: reading large fixed-width datasets Tamas K Papp

From: Thomas A. Russ
Subject: Re: reading large fixed-width datasets
Date: Mon, 13 Oct 2008 16:43:18 +0000
Message-ID: <ymihc7gigyh.fsf@blackcat.isi.edu>

Tamas K Papp <······@gmail.com> writes:

> Hi,
> 
> I am working with some Current Population Survey data.  The files 
> themselves are huge (145MB for each month 13x12 months), and the data 
> format is fixed-width tables, like this:
> 
> 002600310997690 82008 120100-1 1 1-1 1-3-1-1-1  36409348 1 1 7 7 2 0
> 002909470991091 82008 220100-1 1 1-1 014-1-1-1  34777557 1 5 1 8 1 1
> 002909470991091 82008 220100-1 1 1-1 014-1-1-1  34777557 1 5 1 8 1 1
> 002909470991091 82008 220100-1 1 1-1 014-1-1-1  34777557 1 5 1 8 1 1
> 
> Eg those 1-1's are a 1 and a -1.  The column width and positions are 
> known in advance and don't change.  I only need part of each row (always 
> the same columns), so the plan is to
> 
> 1) read one row at a time,
> 2) extract the columns I need and convert them to integers (mostly 
> fixnums),
> 3) save them in a matrix.
> 
> I would like to do this as fast as possible.  What is the best way?  I am 
> using SBCL if that matters.

I would think that using READ-SEQUENCE to get each row.  With a fixed
size, you can align your read buffer to the row size of the data.  Just
remember to allow for the end-of-line character!

Alternately, an perhaps more simply, you could use READ-LINE, but
READ-SEQUENCE is likely to be faster.

Once you have the data in a buffer, you should be able to use
PARSE-INTEGER with :BEGIN and :END arguments to extract the portions of
the line that you need.

(defvar *buffer* (make-string 69))  ;; line length of 68 + 1 for EOL

(defun read-next-line (buffer stream)
  (read-sequence buffer stream))

(defun extract-fields (buffer field-list)
  (loop for (name start end) in field-list
        do (format t "~A = ~D~%"
                   name (parse-integer buffer :start start :end end))))

(with-input-from-string (s "002909470991091 82008 220100-1 1 1-1 014-1-1-1  34777557 1 5 1 8 1 1")
   (read-next-line *buffer* s)
   (extract-fields *buffer* '((first 16 21) (one 32 34) (negative 34 36))))

-- 
Thomas A. Russ,  USC/Information Sciences Institute

From: sross
Subject: Re: reading large fixed-width datasets
Date: Mon, 13 Oct 2008 19:56:53 +0000
Message-ID: <7766a27e-3638-4874-b144-e7fa02ee3766@i76g2000hsf.googlegroups.com>

On Oct 13, 5:43 pm, ····@sevak.isi.edu (Thomas A. Russ) wrote:
> Tamas K Papp <······@gmail.com> writes:
>
>
>
> > Hi,
>
> > I am working with some Current Population Survey data.  The files
> > themselves are huge (145MB for each month 13x12 months), and the data
> > format is fixed-width tables, like this:
>
> > 002600310997690 82008 120100-1 1 1-1 1-3-1-1-1  36409348 1 1 7 7 2 0
> > 002909470991091 82008 220100-1 1 1-1 014-1-1-1  34777557 1 5 1 8 1 1
> > 002909470991091 82008 220100-1 1 1-1 014-1-1-1  34777557 1 5 1 8 1 1
> > 002909470991091 82008 220100-1 1 1-1 014-1-1-1  34777557 1 5 1 8 1 1
>
> > Eg those 1-1's are a 1 and a -1.  The column width and positions are
> > known in advance and don't change.  I only need part of each row (always
> > the same columns), so the plan is to
>
> > 1) read one row at a time,
> > 2) extract the columns I need and convert them to integers (mostly
> > fixnums),
> > 3) save them in a matrix.
>
> > I would like to do this as fast as possible.  What is the best way?  I am
> > using SBCL if that matters.
>
> I would think that using READ-SEQUENCE to get each row.  With a fixed
> size, you can align your read buffer to the row size of the data.  Just
> remember to allow for the end-of-line character!
>
> Alternately, an perhaps more simply, you could use READ-LINE, but
> READ-SEQUENCE is likely to be faster.
>
> Once you have the data in a buffer, you should be able to use
> PARSE-INTEGER with :BEGIN and :END arguments to extract the portions of
> the line that you need.
>
> (defvar *buffer* (make-string 69))  ;; line length of 68 + 1 for EOL
>
> (defun read-next-line (buffer stream)
>   (read-sequence buffer stream))
>
> (defun extract-fields (buffer field-list)
>   (loop for (name start end) in field-list
>         do (format t "~A = ~D~%"
>                    name (parse-integer buffer :start start :end end))))
>
> (with-input-from-string (s "002909470991091 82008 220100-1 1 1-1 014-1-1-1  34777557 1 5 1 8 1 1")
>    (read-next-line *buffer* s)
>    (extract-fields *buffer* '((first 16 21) (one 32 34) (negative 34 36))))
>
> --
> Thomas A. Russ,  USC/Information Sciences Institute



Another useful trick is to not extract the data using subseq but to
define arrays which are displaced onto your
buffer. When these arrays have fill-pointers you can also resize them
at will to get a cheap `window` onto the data.
Of course this is only useful when the data item in question is a
string but it allows you to drastically
reduce the amount of interim garbage which can be created.

(defvar *buffer* (make-string 69))


;; Optionally add :fill-pointer t
(defun field (size &optional (start 0))
  (make-array size :element-type (array-element-type *buffer*)
              :displaced-to *buffer* :displaced-index-offset start))

(defvar *field-one* (field 15))
(defvar *field-two* (field 5 16))
(defvar *field-three* (field 8 22))


(defun read-next-line (buffer stream)
  (read-sequence buffer stream))

(with-input-from-string (s "002909470991091 82008 220100-1 1 1-1
014-1-1-1  34777557 1 5 1 8 1 1")
  (read-next-line *buffer* s)
  (values *field-one* *field-two* *field-three*))

=> "002909470991091",
   "82008",
   "220100-1"



I've used this technique to good effect when parsing GB's worth of UK
postcode files.

sean.

From: Thomas A. Russ
Subject: Re: reading large fixed-width datasets
Date: Mon, 13 Oct 2008 21:50:50 +0000
Message-ID: <ymid4i4i2px.fsf@blackcat.isi.edu>

sross <······@gmail.com> writes:

> On Oct 13, 5:43��pm, ····@sevak.isi.edu (Thomas A. Russ) wrote:
> > Once you have the data in a buffer, you should be able to use
> > PARSE-INTEGER with :BEGIN and :END arguments to extract the portions of
> > the line that you need.
...
> > (defun extract-fields (buffer field-list)
> > � (loop for (name start end) in field-list
> > � � � � do (format t "~A = ~D~%"
> > � � � � � � � � � �name (parse-integer buffer :start start :end end))))

> Another useful trick is to not extract the data using subseq but to
> define arrays which are displaced onto your
> buffer. When these arrays have fill-pointers you can also resize them
> at will to get a cheap `window` onto the data.
> Of course this is only useful when the data item in question is a
> string but it allows you to drastically
> reduce the amount of interim garbage which can be created.

Well, avoiding SUBSEQ is precisely why I use the START and END arguments
to PARSE-INTEGER.  I'm pretty sure that any decent implementation of
that function will not use SUBSEQ internally to create a separate, new
sequence, but will instead use the given sequence and the indices to
operate on the original data.

I can't imagine anyone doing it differently, especially since one
motivation for including those START and END keywords to so many
sequence functions has to be to allow the operation on subsequences
without the need to explicitly create them and then have them be garbage
collected.

-- 
Thomas A. Russ,  USC/Information Sciences Institute

From: sross
Subject: Re: reading large fixed-width datasets
Date: Tue, 14 Oct 2008 09:03:15 +0000
Message-ID: <27c48556-5bfe-4924-9338-05098d62f370@h2g2000hsg.googlegroups.com>

On Oct 13, 10:50 pm, ····@sevak.isi.edu (Thomas A. Russ) wrote:
> sross <······@gmail.com> writes:
> > On Oct 13, 5:43  pm, ····@sevak.isi.edu (Thomas A. Russ) wrote:
> > > Once you have the data in a buffer, you should be able to use
> > > PARSE-INTEGER with :BEGIN and :END arguments to extract the portions of
> > > the line that you need.
> ...
> > > (defun extract-fields (buffer field-list)
> > >   (loop for (name start end) in field-list
> > >         do (format t "~A = ~D~%"
> > >                    name (parse-integer buffer :start start :end end))))
> > Another useful trick is to not extract the data using subseq but to
> > define arrays which are displaced onto your
> > buffer. When these arrays have fill-pointers you can also resize them
> > at will to get a cheap `window` onto the data.
> > Of course this is only useful when the data item in question is a
> > string but it allows you to drastically
> > reduce the amount of interim garbage which can be created.
>
> Well, avoiding SUBSEQ is precisely why I use the START and END arguments
> to PARSE-INTEGER.  I'm pretty sure that any decent implementation of
> that function will not use SUBSEQ internally to create a separate, new
> sequence, but will instead use the given sequence and the indices to
> operate on the original data.
>
> I can't imagine anyone doing it differently, especially since one
> motivation for including those START and END keywords to so many
> sequence functions has to be to allow the operation on subsequences
> without the need to explicitly create them and then have them be garbage
> collected.
>
> --
> Thomas A. Russ,  USC/Information Sciences Institute


Indeed, but sometimes it is not appropriate to use parse-integer on
data elements and when it isn't
displaced arrays can be a useful tool.

In this example  220100-1  does not have a particularly meaningful
integer representation and
parse-integer may not be what you are after if you want to keep the
leading zero's in 002909470991091.


sean.

From: Thomas A. Russ
Subject: Re: reading large fixed-width datasets
Date: Tue, 14 Oct 2008 20:58:02 +0000
Message-ID: <ymi3aiy99np.fsf@blackcat.isi.edu>

sross <······@gmail.com> writes:

> In this example  220100-1  does not have a particularly meaningful
> integer representation

Well, given the notes in the OP's message, and the fact that these are
fixed width fields and thus not necessarily white-space delimited, there
are a number of potential interger representations, among them:
 (220100 -1) and (220 100 -1), etc.

It just depends on where the field boundaries are, which is why they get
specified separately.

-- 
Thomas A. Russ,  USC/Information Sciences Institute

From: sross
Subject: Re: reading large fixed-width datasets
Date: Tue, 14 Oct 2008 22:28:10 +0000
Message-ID: <dd89ff5e-e897-4b2c-a3d9-36d5dfa981d5@i20g2000prf.googlegroups.com>

On Oct 14, 9:58 pm, ····@sevak.isi.edu (Thomas A. Russ) wrote:
> sross <······@gmail.com> writes:
> > In this example  220100-1  does not have a particularly meaningful
> > integer representation
>
> Well, given the notes in the OP's message, and the fact that these are
> fixed width fields and thus not necessarily white-space delimited, there
> are a number of potential interger representations, among them:
>  (220100 -1) and (220 100 -1), etc.
>
> It just depends on where the field boundaries are, which is why they get
> specified separately.
>
> --
> Thomas A. Russ,  USC/Information Sciences Institute


I think you are missing the point of my post, namely;

- parse-integer is useful.
- it is not necessarily the best or only option.

sean

From: Pascal J. Bourguignon
Subject: Re: reading large fixed-width datasets
Date: Mon, 13 Oct 2008 19:28:53 +0000
Message-ID: <87ej2kz43u.fsf@hubble.informatimago.com>

···@sevak.isi.edu (Thomas A. Russ) writes:
> Alternately, an perhaps more simply, you could use READ-LINE, but
> READ-SEQUENCE is likely to be faster.
>
> Once you have the data in a buffer, you should be able to use
> PARSE-INTEGER with :BEGIN and :END arguments to extract the portions of
> the line that you need.

Yes, but it would be faster to avoid converting to CHARACTER, so let's
use  a vector of (unsigned-byte 8), and assume ASCII code.

For a two-byte wide field:

(= (+ (* (- hi 48) 10) (- lo 48))
   (+ (* hi 10) lo -528))


(defun get-2-digit-integer (buffer position)
  (let ((hi (aref buffer position))
        (lo (aref buffer (1+ position))))
     (case hi
       ((32)      (- lo 48))
       ((45)      (- 48 lo))
       (otherwise (+ (* hi 10) lo -528)))))

and so on.

Once you avoid converting to characters, you can even use mmap...


-- 
__Pascal Bourguignon__                     http://www.informatimago.com/
Our enemies are innovative and resourceful, and so are we. They never
stop thinking about new ways to harm our country and our people, and
neither do we. -- Georges W. Bush

From: Volkan YAZICI
Subject: Re: reading large fixed-width datasets
Date: Mon, 13 Oct 2008 17:38:58 +0000
Message-ID: <0fa128a9-95ce-4f67-86ef-bcbf0b4593c5@m73g2000hsh.googlegroups.com>

On Oct 13, 6:44 pm, Tamas K Papp <······@gmail.com> wrote:
> I am working with some Current Population Survey data.  The files
> themselves are huge (145MB for each month 13x12 months), and the data
> format is fixed-width tables, like this:
>
> 002600310997690 82008 120100-1 1 1-1 1-3-1-1-1  36409348 1 1 7 7 2 0
> 002909470991091 82008 220100-1 1 1-1 014-1-1-1  34777557 1 5 1 8 1 1
> 002909470991091 82008 220100-1 1 1-1 014-1-1-1  34777557 1 5 1 8 1 1
> 002909470991091 82008 220100-1 1 1-1 014-1-1-1  34777557 1 5 1 8 1 1
>
> Eg those 1-1's are a 1 and a -1.  The column width and positions are
> known in advance and don't change.  I only need part of each row (always
> the same columns), so the plan is to
>
> 1) read one row at a time,
> 2) extract the columns I need and convert them to integers (mostly
> fixnums),
> 3) save them in a matrix.
>
> I would like to do this as fast as possible.  What is the best way?  I am
> using SBCL if that matters.

I'd read lines of file into a sequence of type BASE-STRING and then
divide related parts of the input using NSUBSEQ[1].


Regards.

[1] http://darcs.informatimago.com/lisp/common-lisp/utility.lisp

From: Pascal J. Bourguignon
Subject: Re: reading large fixed-width datasets
Date: Mon, 13 Oct 2008 19:22:00 +0000
Message-ID: <87iqrwz4fb.fsf@hubble.informatimago.com>

Volkan YAZICI <·············@gmail.com> writes:

> I'd read lines of file into a sequence of type BASE-STRING and then
> divide related parts of the input using NSUBSEQ[1].

For string parts, yes.  However, for integers, PARSE-INTEGER takes a
pair of :START and :END parameters to avoid having to split the
string.

> Regards.
>
> [1] http://darcs.informatimago.com/lisp/common-lisp/utility.lisp

-- 
__Pascal Bourguignon__                     http://www.informatimago.com/
Our enemies are innovative and resourceful, and so are we. They never
stop thinking about new ways to harm our country and our people, and
neither do we. -- Georges W. Bush

From: Thomas A. Russ
Subject: Re: reading large fixed-width datasets
Date: Wed, 15 Oct 2008 19:23:44 +0000
Message-ID: <ymitzbdfyrj.fsf@blackcat.isi.edu>

Volkan YAZICI <·············@gmail.com> writes:

> I'd read lines of file into a sequence of type BASE-STRING and then
> divide related parts of the input using NSUBSEQ[1].
>
> [1] http://darcs.informatimago.com/lisp/common-lisp/utility.lisp

Well, that would only work if you either use a new sequence to hold each
line as it comes in or with a fixed buffer if you process the entire
line at once and don't save any of the information for further
processing.

-- 
Thomas A. Russ,  USC/Information Sciences Institute

From: William James
Subject: Re: reading large fixed-width datasets
Date: Mon, 13 Oct 2008 19:25:48 +0000
Message-ID: <4581a29c-1baf-4752-ba21-8b2337ef8dee@i76g2000hsf.googlegroups.com>

On Oct 13, 10:44 am, Tamas K Papp <······@gmail.com> wrote:
> Hi,
>
> I am working with some Current Population Survey data.  The files
> themselves are huge (145MB for each month 13x12 months), and the data
> format is fixed-width tables, like this:
>
> 002600310997690 82008 120100-1 1 1-1 1-3-1-1-1  36409348 1 1 7 7 2 0
> 002909470991091 82008 220100-1 1 1-1 014-1-1-1  34777557 1 5 1 8 1 1
> 002909470991091 82008 220100-1 1 1-1 014-1-1-1  34777557 1 5 1 8 1 1
> 002909470991091 82008 220100-1 1 1-1 014-1-1-1  34777557 1 5 1 8 1 1
>
> Eg those 1-1's are a 1 and a -1.  The column width and positions are
> known in advance and don't change.  I only need part of each row (always
> the same columns), so the plan is to
>
> 1) read one row at a time,
> 2) extract the columns I need and convert them to integers (mostly
> fixnums),
> 3) save them in a matrix.
>
> I would like to do this as fast as possible.  What is the best way?  I am
> using SBCL if that matters.
>
> Thanks,
>
> Tamas

Ruby:

fields = [9,3], [38,2]
matrix = []
IO.foreach("junk1"){|line|
  matrix << fields.map{|p,n| line[p,n].to_i }}

From: Rainer Joswig
Subject: Re: reading large fixed-width datasets
Date: Mon, 13 Oct 2008 17:10:39 +0000
Message-ID: <joswig-1872AD.19103913102008@news-europe.giganews.com>

In article <··············@mid.individual.net>,
 Tamas K Papp <······@gmail.com> wrote:

> Hi,
> 
> I am working with some Current Population Survey data.  The files 
> themselves are huge (145MB for each month 13x12 months), and the data 
> format is fixed-width tables, like this:
> 
> 002600310997690 82008 120100-1 1 1-1 1-3-1-1-1  36409348 1 1 7 7 2 0
> 002909470991091 82008 220100-1 1 1-1 014-1-1-1  34777557 1 5 1 8 1 1
> 002909470991091 82008 220100-1 1 1-1 014-1-1-1  34777557 1 5 1 8 1 1
> 002909470991091 82008 220100-1 1 1-1 014-1-1-1  34777557 1 5 1 8 1 1
> 
> Eg those 1-1's are a 1 and a -1.  The column width and positions are 
> known in advance and don't change.  I only need part of each row (always 
> the same columns), so the plan is to
> 
> 1) read one row at a time,
> 2) extract the columns I need and convert them to integers (mostly 
> fixnums),
> 3) save them in a matrix.
> 
> I would like to do this as fast as possible.  What is the best way?  I am 
> using SBCL if that matters.
> 
> Thanks,
> 
> Tamas

Usually I would use a line-reader that uses a buffer, so
that the line string can be reuse.

Then use PARSE-INTEGER to extract the integers from the line:

CL-USER 1 > (parse-integer "123456" :start 2 :end 4)
34
4

CL-USER 2 > (parse-integer "==34==" :start 2 :end 4)
34
4

CL-USER 3 > (parse-integer "==-34==" :start 2 :end 5)
-34
5



CL-USER 8 > (let ((lines '("12345678" "45678901"))
                      (bounds '((1 3) (4 8))))
                  (loop for line in lines
                        collect (loop for (start end) in bounds
                                      collect (parse-integer line :start start :end end))))
((23 5678) (56 8901))

-- 
http://lispm.dyndns.org/

From: Pascal J. Bourguignon
Subject: Re: reading large fixed-width datasets
Date: Mon, 13 Oct 2008 19:19:09 +0000
Message-ID: <87myh8z4k2.fsf@hubble.informatimago.com>

Rainer Joswig <······@lisp.de> writes:

> In article <··············@mid.individual.net>,
>  Tamas K Papp <······@gmail.com> wrote:
>
>> Hi,
>> 
>> I am working with some Current Population Survey data.  The files 
>> themselves are huge (145MB for each month 13x12 months), and the data 
>> format is fixed-width tables, like this:
>> 
>> 002600310997690 82008 120100-1 1 1-1 1-3-1-1-1  36409348 1 1 7 7 2 0
>> 002909470991091 82008 220100-1 1 1-1 014-1-1-1  34777557 1 5 1 8 1 1
>> 002909470991091 82008 220100-1 1 1-1 014-1-1-1  34777557 1 5 1 8 1 1
>> 002909470991091 82008 220100-1 1 1-1 014-1-1-1  34777557 1 5 1 8 1 1
>> 
>> Eg those 1-1's are a 1 and a -1.  The column width and positions are 
>> known in advance and don't change.  I only need part of each row (always 
>> the same columns), so the plan is to
>> 
>> 1) read one row at a time,
>> 2) extract the columns I need and convert them to integers (mostly 
>> fixnums),
>> 3) save them in a matrix.
>> 
>> I would like to do this as fast as possible.  What is the best way?  I am 
>> using SBCL if that matters.
>> 
>> Thanks,
>> 
>> Tamas
>
> Usually I would use a line-reader that uses a buffer, so
> that the line string can be reuse.
>
> Then use PARSE-INTEGER to extract the integers from the line:

Way too slow!  
He said he wanted to do this as fast as possible, in the best way!!!

The absolutely fastest way to do something, is not to do it at all.  I
mean lazy evaluation.  Don't even try to read the file (right now),
you might very well not even need it.

Also, notice that this file seems to contain only ASCII bytes.  Why
waste memory and processing time to convert these bytes into lisp
CHARACTER?  When you'll need to read it, be sure to do it as binary.

-- 
__Pascal Bourguignon__                     http://www.informatimago.com/
Our enemies are innovative and resourceful, and so are we. They never
stop thinking about new ways to harm our country and our people, and
neither do we. -- Georges W. Bush

From: Richard M Kreuter
Subject: Re: reading large fixed-width datasets
Date: Mon, 13 Oct 2008 19:28:14 +0000
Message-ID: <87prm4e1m9.fsf@progn.net>

Tamas K Papp <······@gmail.com> writes:

> [T]he plan is to
>
> 1) read one row at a time,
> 2) extract the columns I need and convert them to integers (mostly 
> fixnums),
> 3) save them in a matrix.
>
> I would like to do this as fast as possible.  What is the best way?  I am 
> using SBCL if that matters.

Generally speaking, things that can make Lisp I/O more costly than it
might otherwise be are consing fresh objects for units of input,
transcoding data according to an external-format, and possibly some
undesirable consing and/or copying in the streams implementation.

To control transcoding cost, you might do integer I/O by opening the
file with an element-type that's a suitable subtype of UNSIGNED-BYTE,
and decoding the input manually (or not at all).  To reduce consing of
new objects, you might use READ-SEQUENCE with a preallocated buffer.
You can't necessarily do anything to avoid undesirable stuff in the
streams implementation, if there is any.

Depending on your problem, you might also consider processing the input
into some binary format you can read in more quickly, or into something
you can memory map.

In any case, I'd suggest benchmarking a couple of approaches and seeing
whether there's any difference in wall-clock time, since the operating
system's performance is liable to dominate anything you can do in Lisp.
(In one recent I/O profiling session, it turned out that the file
system's behavior when reading from a heavily fragmented file accounted
for a factor of 10 slowdown in the program's wall-clock duration.)

--
RmK

From: Thomas F. Burdick
Subject: Re: reading large fixed-width datasets
Date: Tue, 14 Oct 2008 08:46:40 +0000
Message-ID: <636f8cd4-1d1e-4b29-85ab-3757a18fb166@q26g2000prq.googlegroups.com>

Tamas K Papp a écrit :
> Hi,
>
> I am working with some Current Population Survey data.  The files
> themselves are huge (145MB for each month 13x12 months), and the data
> format is fixed-width tables, like this:
>
> 002600310997690 82008 120100-1 1 1-1 1-3-1-1-1  36409348 1 1 7 7 2 0
> 002909470991091 82008 220100-1 1 1-1 014-1-1-1  34777557 1 5 1 8 1 1
> 002909470991091 82008 220100-1 1 1-1 014-1-1-1  34777557 1 5 1 8 1 1
> 002909470991091 82008 220100-1 1 1-1 014-1-1-1  34777557 1 5 1 8 1 1
>
> Eg those 1-1's are a 1 and a -1.  The column width and positions are
> known in advance and don't change.  I only need part of each row (always
> the same columns), so the plan is to
>
> 1) read one row at a time,
> 2) extract the columns I need and convert them to integers (mostly
> fixnums),
> 3) save them in a matrix.
>
> I would like to do this as fast as possible.  What is the best way?  I am
> using SBCL if that matters.

I would mmap the file. If you can edit it slightly, you could add the
necessary bits to the head and map it directly as a simple-base-
string. Otherwise, treat it as an alien vector of chars.

From: ·······@eurogaran.com
Subject: Re: reading large fixed-width datasets
Date: Tue, 14 Oct 2008 10:38:56 +0000
Message-ID: <9f562a44-de12-40ae-a6e7-d80f6da76e53@t54g2000hsg.googlegroups.com>

Tamas K Papp wrote:
> Hi,
>
> I am working with some Current Population Survey data.  The files
> themselves are huge (145MB for each month 13x12 months), and the data
> format is fixed-width tables, like this:
>
> 002600310997690 82008 120100-1 1 1-1 1-3-1-1-1  36409348 1 1 7 7 2 0

Think about having the data in a (mysql is good) database.
Consider future data growth, no loading time, etc.
Very good libraries to manage and access (MySQL) databases are
available for most Common Lisp implementations.

From: Volkan YAZICI
Subject: Re: reading large fixed-width datasets
Date: Tue, 14 Oct 2008 10:56:14 +0000
Message-ID: <d516848e-bb9a-4f31-97f8-c8e8c3e26eed@q9g2000hsb.googlegroups.com>

On Oct 13, 6:44 pm, Tamas K Papp <······@gmail.com> wrote:
> ...
>
> 002600310997690 82008 120100-1 1 1-1 1-3-1-1-1  36409348 1 1 7 7 2 0
> 002909470991091 82008 220100-1 1 1-1 014-1-1-1  34777557 1 5 1 8 1 1
> 002909470991091 82008 220100-1 1 1-1 014-1-1-1  34777557 1 5 1 8 1 1
> 002909470991091 82008 220100-1 1 1-1 014-1-1-1  34777557 1 5 1 8 1 1
>
> ...
>
> 1) read one row at a time,
> 2) extract the columns I need and convert them to integers
>    (mostly fixnums),
> 3) save them in a matrix.

Do you mind if I ask what are you planning to do with these parsed
fields? Because, without specifying the purpose of the parsed lines,
one will need to allocate a sequence for each line and create another
array to hold string (or numeric) representations of the specified
fields. It's obvious that even the most memory/processor efficient way
of doing this will be bounded by O(n). If you tell us more about your
aim, I'm sure you'll receive much more sensible replies.

Regards.

From: Tamas K Papp
Subject: Re: reading large fixed-width datasets
Date: Tue, 14 Oct 2008 12:45:08 +0000
Message-ID: <6ljiijFcqlrmU1@mid.individual.net>

On Tue, 14 Oct 2008 03:56:14 -0700, Volkan YAZICI wrote:

> On Oct 13, 6:44 pm, Tamas K Papp <······@gmail.com> wrote:
>> ...
>>
>> 002600310997690 82008 120100-1 1 1-1 1-3-1-1-1  36409348 1 1 7 7 2 0
>> 002909470991091 82008 220100-1 1 1-1 014-1-1-1  34777557 1 5 1 8 1 1
>> 002909470991091 82008 220100-1 1 1-1 014-1-1-1  34777557 1 5 1 8 1 1
>> 002909470991091 82008 220100-1 1 1-1 014-1-1-1  34777557 1 5 1 8 1 1
>>
>> ...
>>
>> 1) read one row at a time,
>> 2) extract the columns I need and convert them to integers
>>    (mostly fixnums),
>> 3) save them in a matrix.
> 
> Do you mind if I ask what are you planning to do with these parsed
> fields? Because, without specifying the purpose of the parsed lines, one
> will need to allocate a sequence for each line and create another array
> to hold string (or numeric) representations of the specified fields.
> It's obvious that even the most memory/processor efficient way of doing
> this will be bounded by O(n). If you tell us more about your aim, I'm
> sure you'll receive much more sensible replies.

Fair question.

Once the entries are read, I need to match entries using the 15-digit 
household identifier.  Then I will have a time series for each household, 
which I will analyze using Bayesian methods.

There are 56000 households in each month's sample, and the panel is 
rotated at frequency 1/8, so roughly there will be more than a million 
households.  I still have to see whether I can fit that in RAM (4GB) or 
whether I need a database backend.  I have a chance of fitting stuff in 
RAM because for the main analysis I only need employment status 
transitions (Unemployment->Employment, E->E, U->E) and that can be coded 
using 4 bits for each transition.

I am still exploring how to do the matching efficiently.  Eg I could put 
the households in a hash table by ID, and add observations from lines as 
they are read.

I would like to thank everyone for the replies, they were really 
helpful.  I think that initially I will read line-by-line and use parse-
integer.

Thanks,

Tamas

From: Thomas A. Russ
Subject: Re: reading large fixed-width datasets
Date: Tue, 14 Oct 2008 21:02:44 +0000
Message-ID: <ymiy70q7uvf.fsf@blackcat.isi.edu>

Tamas K Papp <······@gmail.com> writes:

> I think that initially I will read line-by-line and use parse-
> integer.

Sounds reasonable (especially since that was my solution).  I would
suggest using READ-SEQUENCE into a pre-allocated string buffer rather
than READ-LINE, since READ-LINE will create a new string for each line
and that will result in a lot of garbage.

-- 
Thomas A. Russ,  USC/Information Sciences Institute

From: Pascal J. Bourguignon
Subject: Re: reading large fixed-width datasets
Date: Tue, 14 Oct 2008 12:59:52 +0000
Message-ID: <7cwsgb5o3a.fsf@pbourguignon.anevia.com>

Tamas K Papp <······@gmail.com> writes:
> Once the entries are read, I need to match entries using the 15-digit 
> household identifier.  Then I will have a time series for each household, 
> which I will analyze using Bayesian methods.

Therefore you don't need the whole data set in RAM at the same time.
You only need the data concerning one household.

> There are 56000 households in each month's sample, and the panel is 
> rotated at frequency 1/8, so roughly there will be more than a million 
> households.  I still have to see whether I can fit that in RAM (4GB) or 
> whether I need a database backend.  I have a chance of fitting stuff in 
> RAM because for the main analysis I only need employment status 
> transitions (Unemployment->Employment, E->E, U->E) and that can be coded 
> using 4 bits for each transition.

That's about 150M/56K = 2678 byte per household.  There should be no
problem to store that in RAM, and process in any way  you need;
there's no point in trying to optimize anything here.

> I am still exploring how to do the matching efficiently.  Eg I could put 
> the households in a hash table by ID, and add observations from lines as 
> they are read.

The simpliest would be to keep the files sorted by household.  If you
don't want to sort them, then it's simple enough to build an index to
be able to read only the records concerning the household currently
being processed.

> I would like to thank everyone for the replies, they were really 
> helpful.  I think that initially I will read line-by-line and use parse-
> integer.

Yes, I don't see that you need the very best fastest faster way there.

-- 
__Pascal Bourguignon__

From: Tamas K Papp
Subject: Re: reading large fixed-width datasets
Date: Tue, 14 Oct 2008 13:37:53 +0000
Message-ID: <6ljllhFcp3ivU1@mid.individual.net>

On Tue, 14 Oct 2008 14:59:52 +0200, Pascal J. Bourguignon wrote:

> Tamas K Papp <······@gmail.com> writes:
>> Once the entries are read, I need to match entries using the 15-digit
>> household identifier.  Then I will have a time series for each
>> household, which I will analyze using Bayesian methods.
> 
> Therefore you don't need the whole data set in RAM at the same time. You
> only need the data concerning one household.

But I do need the (matched, processed) data in RAM.  Bayesian analysis 
works by calculating the posterior density (which means going through the 
whole data) many (think millions) times.

> That's about 150M/56K = 2678 byte per household.  There should be no
> problem to store that in RAM, and process in any way  you need; there's
> no point in trying to optimize anything here.

Later on I plan to include other independent variables (age, sex, 
education, etc).  But I agree, that might fit too.

> The simpliest would be to keep the files sorted by household.  If you
> don't want to sort them, then it's simple enough to build an index to be
> able to read only the records concerning the household currently being
> processed.
> [...]
> Yes, I don't see that you need the very best fastest faster way there.

I think you misunderstand.  The raw data for _each_ month is 150MB.  
There are 150 months.  At any given moment, there are 56k households, but 
each is in the sample for only 8 months, so the total number of 
households is much higher.  Fitting the extracted the data fast and 
fitting the whole thing in RAM is important: having large chunks of data 
in the CPU cache would speed up my analysis considerably.

Anyhow, preliminary benchmarking indicates that for the data extraction 
with parse-integer, the hard disk speed is my bottleneck.  So now I will 
profile the Bayesian analysis.

Thanks,

Tamas