Hi,
I am working with some Current Population Survey data. The files
themselves are huge (145MB for each month 13x12 months), and the data
format is fixed-width tables, like this:
002600310997690 82008 120100-1 1 1-1 1-3-1-1-1 36409348 1 1 7 7 2 0
002909470991091 82008 220100-1 1 1-1 014-1-1-1 34777557 1 5 1 8 1 1
002909470991091 82008 220100-1 1 1-1 014-1-1-1 34777557 1 5 1 8 1 1
002909470991091 82008 220100-1 1 1-1 014-1-1-1 34777557 1 5 1 8 1 1
Eg those 1-1's are a 1 and a -1. The column width and positions are
known in advance and don't change. I only need part of each row (always
the same columns), so the plan is to
1) read one row at a time,
2) extract the columns I need and convert them to integers (mostly
fixnums),
3) save them in a matrix.
I would like to do this as fast as possible. What is the best way? I am
using SBCL if that matters.
Thanks,
Tamas
Tamas K Papp <······@gmail.com> writes:
> Hi,
>
> I am working with some Current Population Survey data. The files
> themselves are huge (145MB for each month 13x12 months), and the data
> format is fixed-width tables, like this:
>
> 002600310997690 82008 120100-1 1 1-1 1-3-1-1-1 36409348 1 1 7 7 2 0
> 002909470991091 82008 220100-1 1 1-1 014-1-1-1 34777557 1 5 1 8 1 1
> 002909470991091 82008 220100-1 1 1-1 014-1-1-1 34777557 1 5 1 8 1 1
> 002909470991091 82008 220100-1 1 1-1 014-1-1-1 34777557 1 5 1 8 1 1
>
> Eg those 1-1's are a 1 and a -1. The column width and positions are
> known in advance and don't change. I only need part of each row (always
> the same columns), so the plan is to
>
> 1) read one row at a time,
> 2) extract the columns I need and convert them to integers (mostly
> fixnums),
> 3) save them in a matrix.
>
> I would like to do this as fast as possible. What is the best way? I am
> using SBCL if that matters.
I would think that using READ-SEQUENCE to get each row. With a fixed
size, you can align your read buffer to the row size of the data. Just
remember to allow for the end-of-line character!
Alternately, an perhaps more simply, you could use READ-LINE, but
READ-SEQUENCE is likely to be faster.
Once you have the data in a buffer, you should be able to use
PARSE-INTEGER with :BEGIN and :END arguments to extract the portions of
the line that you need.
(defvar *buffer* (make-string 69)) ;; line length of 68 + 1 for EOL
(defun read-next-line (buffer stream)
(read-sequence buffer stream))
(defun extract-fields (buffer field-list)
(loop for (name start end) in field-list
do (format t "~A = ~D~%"
name (parse-integer buffer :start start :end end))))
(with-input-from-string (s "002909470991091 82008 220100-1 1 1-1 014-1-1-1 34777557 1 5 1 8 1 1")
(read-next-line *buffer* s)
(extract-fields *buffer* '((first 16 21) (one 32 34) (negative 34 36))))
--
Thomas A. Russ, USC/Information Sciences Institute
On Oct 13, 5:43 pm, ····@sevak.isi.edu (Thomas A. Russ) wrote:
> Tamas K Papp <······@gmail.com> writes:
>
>
>
> > Hi,
>
> > I am working with some Current Population Survey data. The files
> > themselves are huge (145MB for each month 13x12 months), and the data
> > format is fixed-width tables, like this:
>
> > 002600310997690 82008 120100-1 1 1-1 1-3-1-1-1 36409348 1 1 7 7 2 0
> > 002909470991091 82008 220100-1 1 1-1 014-1-1-1 34777557 1 5 1 8 1 1
> > 002909470991091 82008 220100-1 1 1-1 014-1-1-1 34777557 1 5 1 8 1 1
> > 002909470991091 82008 220100-1 1 1-1 014-1-1-1 34777557 1 5 1 8 1 1
>
> > Eg those 1-1's are a 1 and a -1. The column width and positions are
> > known in advance and don't change. I only need part of each row (always
> > the same columns), so the plan is to
>
> > 1) read one row at a time,
> > 2) extract the columns I need and convert them to integers (mostly
> > fixnums),
> > 3) save them in a matrix.
>
> > I would like to do this as fast as possible. What is the best way? I am
> > using SBCL if that matters.
>
> I would think that using READ-SEQUENCE to get each row. With a fixed
> size, you can align your read buffer to the row size of the data. Just
> remember to allow for the end-of-line character!
>
> Alternately, an perhaps more simply, you could use READ-LINE, but
> READ-SEQUENCE is likely to be faster.
>
> Once you have the data in a buffer, you should be able to use
> PARSE-INTEGER with :BEGIN and :END arguments to extract the portions of
> the line that you need.
>
> (defvar *buffer* (make-string 69)) ;; line length of 68 + 1 for EOL
>
> (defun read-next-line (buffer stream)
> (read-sequence buffer stream))
>
> (defun extract-fields (buffer field-list)
> (loop for (name start end) in field-list
> do (format t "~A = ~D~%"
> name (parse-integer buffer :start start :end end))))
>
> (with-input-from-string (s "002909470991091 82008 220100-1 1 1-1 014-1-1-1 34777557 1 5 1 8 1 1")
> (read-next-line *buffer* s)
> (extract-fields *buffer* '((first 16 21) (one 32 34) (negative 34 36))))
>
> --
> Thomas A. Russ, USC/Information Sciences Institute
Another useful trick is to not extract the data using subseq but to
define arrays which are displaced onto your
buffer. When these arrays have fill-pointers you can also resize them
at will to get a cheap `window` onto the data.
Of course this is only useful when the data item in question is a
string but it allows you to drastically
reduce the amount of interim garbage which can be created.
(defvar *buffer* (make-string 69))
;; Optionally add :fill-pointer t
(defun field (size &optional (start 0))
(make-array size :element-type (array-element-type *buffer*)
:displaced-to *buffer* :displaced-index-offset start))
(defvar *field-one* (field 15))
(defvar *field-two* (field 5 16))
(defvar *field-three* (field 8 22))
(defun read-next-line (buffer stream)
(read-sequence buffer stream))
(with-input-from-string (s "002909470991091 82008 220100-1 1 1-1
014-1-1-1 34777557 1 5 1 8 1 1")
(read-next-line *buffer* s)
(values *field-one* *field-two* *field-three*))
=> "002909470991091",
"82008",
"220100-1"
I've used this technique to good effect when parsing GB's worth of UK
postcode files.
sean.
sross <······@gmail.com> writes:
> On Oct 13, 5:43��pm, ····@sevak.isi.edu (Thomas A. Russ) wrote:
> > Once you have the data in a buffer, you should be able to use
> > PARSE-INTEGER with :BEGIN and :END arguments to extract the portions of
> > the line that you need.
...
> > (defun extract-fields (buffer field-list)
> > � (loop for (name start end) in field-list
> > � � � � do (format t "~A = ~D~%"
> > � � � � � � � � � �name (parse-integer buffer :start start :end end))))
> Another useful trick is to not extract the data using subseq but to
> define arrays which are displaced onto your
> buffer. When these arrays have fill-pointers you can also resize them
> at will to get a cheap `window` onto the data.
> Of course this is only useful when the data item in question is a
> string but it allows you to drastically
> reduce the amount of interim garbage which can be created.
Well, avoiding SUBSEQ is precisely why I use the START and END arguments
to PARSE-INTEGER. I'm pretty sure that any decent implementation of
that function will not use SUBSEQ internally to create a separate, new
sequence, but will instead use the given sequence and the indices to
operate on the original data.
I can't imagine anyone doing it differently, especially since one
motivation for including those START and END keywords to so many
sequence functions has to be to allow the operation on subsequences
without the need to explicitly create them and then have them be garbage
collected.
--
Thomas A. Russ, USC/Information Sciences Institute
On Oct 13, 10:50 pm, ····@sevak.isi.edu (Thomas A. Russ) wrote:
> sross <······@gmail.com> writes:
> > On Oct 13, 5:43 pm, ····@sevak.isi.edu (Thomas A. Russ) wrote:
> > > Once you have the data in a buffer, you should be able to use
> > > PARSE-INTEGER with :BEGIN and :END arguments to extract the portions of
> > > the line that you need.
> ...
> > > (defun extract-fields (buffer field-list)
> > > (loop for (name start end) in field-list
> > > do (format t "~A = ~D~%"
> > > name (parse-integer buffer :start start :end end))))
> > Another useful trick is to not extract the data using subseq but to
> > define arrays which are displaced onto your
> > buffer. When these arrays have fill-pointers you can also resize them
> > at will to get a cheap `window` onto the data.
> > Of course this is only useful when the data item in question is a
> > string but it allows you to drastically
> > reduce the amount of interim garbage which can be created.
>
> Well, avoiding SUBSEQ is precisely why I use the START and END arguments
> to PARSE-INTEGER. I'm pretty sure that any decent implementation of
> that function will not use SUBSEQ internally to create a separate, new
> sequence, but will instead use the given sequence and the indices to
> operate on the original data.
>
> I can't imagine anyone doing it differently, especially since one
> motivation for including those START and END keywords to so many
> sequence functions has to be to allow the operation on subsequences
> without the need to explicitly create them and then have them be garbage
> collected.
>
> --
> Thomas A. Russ, USC/Information Sciences Institute
Indeed, but sometimes it is not appropriate to use parse-integer on
data elements and when it isn't
displaced arrays can be a useful tool.
In this example 220100-1 does not have a particularly meaningful
integer representation and
parse-integer may not be what you are after if you want to keep the
leading zero's in 002909470991091.
sean.
sross <······@gmail.com> writes:
> In this example 220100-1 does not have a particularly meaningful
> integer representation
Well, given the notes in the OP's message, and the fact that these are
fixed width fields and thus not necessarily white-space delimited, there
are a number of potential interger representations, among them:
(220100 -1) and (220 100 -1), etc.
It just depends on where the field boundaries are, which is why they get
specified separately.
--
Thomas A. Russ, USC/Information Sciences Institute
On Oct 14, 9:58 pm, ····@sevak.isi.edu (Thomas A. Russ) wrote:
> sross <······@gmail.com> writes:
> > In this example 220100-1 does not have a particularly meaningful
> > integer representation
>
> Well, given the notes in the OP's message, and the fact that these are
> fixed width fields and thus not necessarily white-space delimited, there
> are a number of potential interger representations, among them:
> (220100 -1) and (220 100 -1), etc.
>
> It just depends on where the field boundaries are, which is why they get
> specified separately.
>
> --
> Thomas A. Russ, USC/Information Sciences Institute
I think you are missing the point of my post, namely;
- parse-integer is useful.
- it is not necessarily the best or only option.
sean
···@sevak.isi.edu (Thomas A. Russ) writes:
> Alternately, an perhaps more simply, you could use READ-LINE, but
> READ-SEQUENCE is likely to be faster.
>
> Once you have the data in a buffer, you should be able to use
> PARSE-INTEGER with :BEGIN and :END arguments to extract the portions of
> the line that you need.
Yes, but it would be faster to avoid converting to CHARACTER, so let's
use a vector of (unsigned-byte 8), and assume ASCII code.
For a two-byte wide field:
(= (+ (* (- hi 48) 10) (- lo 48))
(+ (* hi 10) lo -528))
(defun get-2-digit-integer (buffer position)
(let ((hi (aref buffer position))
(lo (aref buffer (1+ position))))
(case hi
((32) (- lo 48))
((45) (- 48 lo))
(otherwise (+ (* hi 10) lo -528)))))
and so on.
Once you avoid converting to characters, you can even use mmap...
--
__Pascal Bourguignon__ http://www.informatimago.com/
Our enemies are innovative and resourceful, and so are we. They never
stop thinking about new ways to harm our country and our people, and
neither do we. -- Georges W. Bush
On Oct 13, 6:44 pm, Tamas K Papp <······@gmail.com> wrote:
> I am working with some Current Population Survey data. The files
> themselves are huge (145MB for each month 13x12 months), and the data
> format is fixed-width tables, like this:
>
> 002600310997690 82008 120100-1 1 1-1 1-3-1-1-1 36409348 1 1 7 7 2 0
> 002909470991091 82008 220100-1 1 1-1 014-1-1-1 34777557 1 5 1 8 1 1
> 002909470991091 82008 220100-1 1 1-1 014-1-1-1 34777557 1 5 1 8 1 1
> 002909470991091 82008 220100-1 1 1-1 014-1-1-1 34777557 1 5 1 8 1 1
>
> Eg those 1-1's are a 1 and a -1. The column width and positions are
> known in advance and don't change. I only need part of each row (always
> the same columns), so the plan is to
>
> 1) read one row at a time,
> 2) extract the columns I need and convert them to integers (mostly
> fixnums),
> 3) save them in a matrix.
>
> I would like to do this as fast as possible. What is the best way? I am
> using SBCL if that matters.
I'd read lines of file into a sequence of type BASE-STRING and then
divide related parts of the input using NSUBSEQ[1].
Regards.
[1] http://darcs.informatimago.com/lisp/common-lisp/utility.lisp
Volkan YAZICI <·············@gmail.com> writes:
> I'd read lines of file into a sequence of type BASE-STRING and then
> divide related parts of the input using NSUBSEQ[1].
For string parts, yes. However, for integers, PARSE-INTEGER takes a
pair of :START and :END parameters to avoid having to split the
string.
> Regards.
>
> [1] http://darcs.informatimago.com/lisp/common-lisp/utility.lisp
--
__Pascal Bourguignon__ http://www.informatimago.com/
Our enemies are innovative and resourceful, and so are we. They never
stop thinking about new ways to harm our country and our people, and
neither do we. -- Georges W. Bush
Volkan YAZICI <·············@gmail.com> writes:
> I'd read lines of file into a sequence of type BASE-STRING and then
> divide related parts of the input using NSUBSEQ[1].
>
> [1] http://darcs.informatimago.com/lisp/common-lisp/utility.lisp
Well, that would only work if you either use a new sequence to hold each
line as it comes in or with a fixed buffer if you process the entire
line at once and don't save any of the information for further
processing.
--
Thomas A. Russ, USC/Information Sciences Institute
On Oct 13, 10:44 am, Tamas K Papp <······@gmail.com> wrote:
> Hi,
>
> I am working with some Current Population Survey data. The files
> themselves are huge (145MB for each month 13x12 months), and the data
> format is fixed-width tables, like this:
>
> 002600310997690 82008 120100-1 1 1-1 1-3-1-1-1 36409348 1 1 7 7 2 0
> 002909470991091 82008 220100-1 1 1-1 014-1-1-1 34777557 1 5 1 8 1 1
> 002909470991091 82008 220100-1 1 1-1 014-1-1-1 34777557 1 5 1 8 1 1
> 002909470991091 82008 220100-1 1 1-1 014-1-1-1 34777557 1 5 1 8 1 1
>
> Eg those 1-1's are a 1 and a -1. The column width and positions are
> known in advance and don't change. I only need part of each row (always
> the same columns), so the plan is to
>
> 1) read one row at a time,
> 2) extract the columns I need and convert them to integers (mostly
> fixnums),
> 3) save them in a matrix.
>
> I would like to do this as fast as possible. What is the best way? I am
> using SBCL if that matters.
>
> Thanks,
>
> Tamas
Ruby:
fields = [9,3], [38,2]
matrix = []
IO.foreach("junk1"){|line|
matrix << fields.map{|p,n| line[p,n].to_i }}
In article <··············@mid.individual.net>,
Tamas K Papp <······@gmail.com> wrote:
> Hi,
>
> I am working with some Current Population Survey data. The files
> themselves are huge (145MB for each month 13x12 months), and the data
> format is fixed-width tables, like this:
>
> 002600310997690 82008 120100-1 1 1-1 1-3-1-1-1 36409348 1 1 7 7 2 0
> 002909470991091 82008 220100-1 1 1-1 014-1-1-1 34777557 1 5 1 8 1 1
> 002909470991091 82008 220100-1 1 1-1 014-1-1-1 34777557 1 5 1 8 1 1
> 002909470991091 82008 220100-1 1 1-1 014-1-1-1 34777557 1 5 1 8 1 1
>
> Eg those 1-1's are a 1 and a -1. The column width and positions are
> known in advance and don't change. I only need part of each row (always
> the same columns), so the plan is to
>
> 1) read one row at a time,
> 2) extract the columns I need and convert them to integers (mostly
> fixnums),
> 3) save them in a matrix.
>
> I would like to do this as fast as possible. What is the best way? I am
> using SBCL if that matters.
>
> Thanks,
>
> Tamas
Usually I would use a line-reader that uses a buffer, so
that the line string can be reuse.
Then use PARSE-INTEGER to extract the integers from the line:
CL-USER 1 > (parse-integer "123456" :start 2 :end 4)
34
4
CL-USER 2 > (parse-integer "==34==" :start 2 :end 4)
34
4
CL-USER 3 > (parse-integer "==-34==" :start 2 :end 5)
-34
5
CL-USER 8 > (let ((lines '("12345678" "45678901"))
(bounds '((1 3) (4 8))))
(loop for line in lines
collect (loop for (start end) in bounds
collect (parse-integer line :start start :end end))))
((23 5678) (56 8901))
--
http://lispm.dyndns.org/
Rainer Joswig <······@lisp.de> writes:
> In article <··············@mid.individual.net>,
> Tamas K Papp <······@gmail.com> wrote:
>
>> Hi,
>>
>> I am working with some Current Population Survey data. The files
>> themselves are huge (145MB for each month 13x12 months), and the data
>> format is fixed-width tables, like this:
>>
>> 002600310997690 82008 120100-1 1 1-1 1-3-1-1-1 36409348 1 1 7 7 2 0
>> 002909470991091 82008 220100-1 1 1-1 014-1-1-1 34777557 1 5 1 8 1 1
>> 002909470991091 82008 220100-1 1 1-1 014-1-1-1 34777557 1 5 1 8 1 1
>> 002909470991091 82008 220100-1 1 1-1 014-1-1-1 34777557 1 5 1 8 1 1
>>
>> Eg those 1-1's are a 1 and a -1. The column width and positions are
>> known in advance and don't change. I only need part of each row (always
>> the same columns), so the plan is to
>>
>> 1) read one row at a time,
>> 2) extract the columns I need and convert them to integers (mostly
>> fixnums),
>> 3) save them in a matrix.
>>
>> I would like to do this as fast as possible. What is the best way? I am
>> using SBCL if that matters.
>>
>> Thanks,
>>
>> Tamas
>
> Usually I would use a line-reader that uses a buffer, so
> that the line string can be reuse.
>
> Then use PARSE-INTEGER to extract the integers from the line:
Way too slow!
He said he wanted to do this as fast as possible, in the best way!!!
The absolutely fastest way to do something, is not to do it at all. I
mean lazy evaluation. Don't even try to read the file (right now),
you might very well not even need it.
Also, notice that this file seems to contain only ASCII bytes. Why
waste memory and processing time to convert these bytes into lisp
CHARACTER? When you'll need to read it, be sure to do it as binary.
--
__Pascal Bourguignon__ http://www.informatimago.com/
Our enemies are innovative and resourceful, and so are we. They never
stop thinking about new ways to harm our country and our people, and
neither do we. -- Georges W. Bush
From: Richard M Kreuter
Subject: Re: reading large fixed-width datasets
Date:
Message-ID: <87prm4e1m9.fsf@progn.net>
Tamas K Papp <······@gmail.com> writes:
> [T]he plan is to
>
> 1) read one row at a time,
> 2) extract the columns I need and convert them to integers (mostly
> fixnums),
> 3) save them in a matrix.
>
> I would like to do this as fast as possible. What is the best way? I am
> using SBCL if that matters.
Generally speaking, things that can make Lisp I/O more costly than it
might otherwise be are consing fresh objects for units of input,
transcoding data according to an external-format, and possibly some
undesirable consing and/or copying in the streams implementation.
To control transcoding cost, you might do integer I/O by opening the
file with an element-type that's a suitable subtype of UNSIGNED-BYTE,
and decoding the input manually (or not at all). To reduce consing of
new objects, you might use READ-SEQUENCE with a preallocated buffer.
You can't necessarily do anything to avoid undesirable stuff in the
streams implementation, if there is any.
Depending on your problem, you might also consider processing the input
into some binary format you can read in more quickly, or into something
you can memory map.
In any case, I'd suggest benchmarking a couple of approaches and seeing
whether there's any difference in wall-clock time, since the operating
system's performance is liable to dominate anything you can do in Lisp.
(In one recent I/O profiling session, it turned out that the file
system's behavior when reading from a heavily fragmented file accounted
for a factor of 10 slowdown in the program's wall-clock duration.)
--
RmK
Tamas K Papp a écrit :
> Hi,
>
> I am working with some Current Population Survey data. The files
> themselves are huge (145MB for each month 13x12 months), and the data
> format is fixed-width tables, like this:
>
> 002600310997690 82008 120100-1 1 1-1 1-3-1-1-1 36409348 1 1 7 7 2 0
> 002909470991091 82008 220100-1 1 1-1 014-1-1-1 34777557 1 5 1 8 1 1
> 002909470991091 82008 220100-1 1 1-1 014-1-1-1 34777557 1 5 1 8 1 1
> 002909470991091 82008 220100-1 1 1-1 014-1-1-1 34777557 1 5 1 8 1 1
>
> Eg those 1-1's are a 1 and a -1. The column width and positions are
> known in advance and don't change. I only need part of each row (always
> the same columns), so the plan is to
>
> 1) read one row at a time,
> 2) extract the columns I need and convert them to integers (mostly
> fixnums),
> 3) save them in a matrix.
>
> I would like to do this as fast as possible. What is the best way? I am
> using SBCL if that matters.
I would mmap the file. If you can edit it slightly, you could add the
necessary bits to the head and map it directly as a simple-base-
string. Otherwise, treat it as an alien vector of chars.
Tamas K Papp wrote:
> Hi,
>
> I am working with some Current Population Survey data. The files
> themselves are huge (145MB for each month 13x12 months), and the data
> format is fixed-width tables, like this:
>
> 002600310997690 82008 120100-1 1 1-1 1-3-1-1-1 36409348 1 1 7 7 2 0
Think about having the data in a (mysql is good) database.
Consider future data growth, no loading time, etc.
Very good libraries to manage and access (MySQL) databases are
available for most Common Lisp implementations.
On Oct 13, 6:44 pm, Tamas K Papp <······@gmail.com> wrote:
> ...
>
> 002600310997690 82008 120100-1 1 1-1 1-3-1-1-1 36409348 1 1 7 7 2 0
> 002909470991091 82008 220100-1 1 1-1 014-1-1-1 34777557 1 5 1 8 1 1
> 002909470991091 82008 220100-1 1 1-1 014-1-1-1 34777557 1 5 1 8 1 1
> 002909470991091 82008 220100-1 1 1-1 014-1-1-1 34777557 1 5 1 8 1 1
>
> ...
>
> 1) read one row at a time,
> 2) extract the columns I need and convert them to integers
> (mostly fixnums),
> 3) save them in a matrix.
Do you mind if I ask what are you planning to do with these parsed
fields? Because, without specifying the purpose of the parsed lines,
one will need to allocate a sequence for each line and create another
array to hold string (or numeric) representations of the specified
fields. It's obvious that even the most memory/processor efficient way
of doing this will be bounded by O(n). If you tell us more about your
aim, I'm sure you'll receive much more sensible replies.
Regards.
On Tue, 14 Oct 2008 03:56:14 -0700, Volkan YAZICI wrote:
> On Oct 13, 6:44 pm, Tamas K Papp <······@gmail.com> wrote:
>> ...
>>
>> 002600310997690 82008 120100-1 1 1-1 1-3-1-1-1 36409348 1 1 7 7 2 0
>> 002909470991091 82008 220100-1 1 1-1 014-1-1-1 34777557 1 5 1 8 1 1
>> 002909470991091 82008 220100-1 1 1-1 014-1-1-1 34777557 1 5 1 8 1 1
>> 002909470991091 82008 220100-1 1 1-1 014-1-1-1 34777557 1 5 1 8 1 1
>>
>> ...
>>
>> 1) read one row at a time,
>> 2) extract the columns I need and convert them to integers
>> (mostly fixnums),
>> 3) save them in a matrix.
>
> Do you mind if I ask what are you planning to do with these parsed
> fields? Because, without specifying the purpose of the parsed lines, one
> will need to allocate a sequence for each line and create another array
> to hold string (or numeric) representations of the specified fields.
> It's obvious that even the most memory/processor efficient way of doing
> this will be bounded by O(n). If you tell us more about your aim, I'm
> sure you'll receive much more sensible replies.
Fair question.
Once the entries are read, I need to match entries using the 15-digit
household identifier. Then I will have a time series for each household,
which I will analyze using Bayesian methods.
There are 56000 households in each month's sample, and the panel is
rotated at frequency 1/8, so roughly there will be more than a million
households. I still have to see whether I can fit that in RAM (4GB) or
whether I need a database backend. I have a chance of fitting stuff in
RAM because for the main analysis I only need employment status
transitions (Unemployment->Employment, E->E, U->E) and that can be coded
using 4 bits for each transition.
I am still exploring how to do the matching efficiently. Eg I could put
the households in a hash table by ID, and add observations from lines as
they are read.
I would like to thank everyone for the replies, they were really
helpful. I think that initially I will read line-by-line and use parse-
integer.
Thanks,
Tamas
Tamas K Papp <······@gmail.com> writes:
> I think that initially I will read line-by-line and use parse-
> integer.
Sounds reasonable (especially since that was my solution). I would
suggest using READ-SEQUENCE into a pre-allocated string buffer rather
than READ-LINE, since READ-LINE will create a new string for each line
and that will result in a lot of garbage.
--
Thomas A. Russ, USC/Information Sciences Institute
Tamas K Papp <······@gmail.com> writes:
> Once the entries are read, I need to match entries using the 15-digit
> household identifier. Then I will have a time series for each household,
> which I will analyze using Bayesian methods.
Therefore you don't need the whole data set in RAM at the same time.
You only need the data concerning one household.
> There are 56000 households in each month's sample, and the panel is
> rotated at frequency 1/8, so roughly there will be more than a million
> households. I still have to see whether I can fit that in RAM (4GB) or
> whether I need a database backend. I have a chance of fitting stuff in
> RAM because for the main analysis I only need employment status
> transitions (Unemployment->Employment, E->E, U->E) and that can be coded
> using 4 bits for each transition.
That's about 150M/56K = 2678 byte per household. There should be no
problem to store that in RAM, and process in any way you need;
there's no point in trying to optimize anything here.
> I am still exploring how to do the matching efficiently. Eg I could put
> the households in a hash table by ID, and add observations from lines as
> they are read.
The simpliest would be to keep the files sorted by household. If you
don't want to sort them, then it's simple enough to build an index to
be able to read only the records concerning the household currently
being processed.
> I would like to thank everyone for the replies, they were really
> helpful. I think that initially I will read line-by-line and use parse-
> integer.
Yes, I don't see that you need the very best fastest faster way there.
--
__Pascal Bourguignon__
On Tue, 14 Oct 2008 14:59:52 +0200, Pascal J. Bourguignon wrote:
> Tamas K Papp <······@gmail.com> writes:
>> Once the entries are read, I need to match entries using the 15-digit
>> household identifier. Then I will have a time series for each
>> household, which I will analyze using Bayesian methods.
>
> Therefore you don't need the whole data set in RAM at the same time. You
> only need the data concerning one household.
But I do need the (matched, processed) data in RAM. Bayesian analysis
works by calculating the posterior density (which means going through the
whole data) many (think millions) times.
> That's about 150M/56K = 2678 byte per household. There should be no
> problem to store that in RAM, and process in any way you need; there's
> no point in trying to optimize anything here.
Later on I plan to include other independent variables (age, sex,
education, etc). But I agree, that might fit too.
> The simpliest would be to keep the files sorted by household. If you
> don't want to sort them, then it's simple enough to build an index to be
> able to read only the records concerning the household currently being
> processed.
> [...]
> Yes, I don't see that you need the very best fastest faster way there.
I think you misunderstand. The raw data for _each_ month is 150MB.
There are 150 months. At any given moment, there are 56k households, but
each is in the sample for only 8 months, so the total number of
households is much higher. Fitting the extracted the data fast and
fitting the whole thing in RAM is important: having large chunks of data
in the CPU cache would speed up my analysis considerably.
Anyhow, preliminary benchmarking indicates that for the data extraction
with parse-integer, the hard disk speed is my bottleneck. So now I will
profile the Bayesian analysis.
Thanks,
Tamas