Fast serialization?

From: rif
Subject: Fast serialization?
Date: Mon, 08 May 2006 03:30:20 +0000
Message-ID: <wj0r735gpmb.fsf@five-percent-nation.mit.edu>

Actually, I'm not quite sure what I need.

Basically, I'm reading a bunch of files in, and producing some
intermediate results, which I want to save to disk, to later read in
again.  The original files are raw data, so I read them with
read-line, then I use split-sequence functions to get the fields out.
For the intermediate results, I'm storing them in instances of a CLOS
class, and writing them out to disk via a print-object method.  I
basically am storing them as

(class-name field-name-1 field-value-1 field-name-2 field-value-2 ...)

Then when I want to get them back for the next computation, I call
read to get a list, and convert the list with something like

(make-instance 'class-name 'field-name-1 (nth 2 list) 
                           'field-name-2 (nth 4 list) ... )

This works, but doesn't seem especially elegant (although it'd be more
elegant if I hid it in some macrology), and more importantly it's
slow.  Profiling (Pentium IV, SBCL 0.9.8) indicates that almost all
the runtime is spent in the read function (note: there is some lossage
because I should only traverse the list once, and calling (nth 2 list)
then (nth 4 list) on the same list is inefficient, but profiling
indicates the call to read is really the timesink).  The overall
running time is slow enough that I'd like to figure out a better
approach.

Any thoughts?

rif

Re: Fast serialization? JP Massar
Re: Fast serialization? Barry Margolin
Re: Fast serialization? Bill Atkins
- Re: Fast serialization? Bill Atkins

From: JP Massar
Subject: Re: Fast serialization?
Date: Mon, 08 May 2006 05:24:30 +0000
Message-ID: <6clt525spkoqnna2906mt47fc75ae2vnn7@4ax.com>

On 07 May 2006 23:30:20 -0400, rif <···@mit.edu> wrote:

>
>Actually, I'm not quite sure what I need.
>
>Basically, I'm reading a bunch of files in, and producing some
>intermediate results, which I want to save to disk, to later read in
>again.  The original files are raw data, so I read them with
>read-line, then I use split-sequence functions to get the fields out.
>For the intermediate results, I'm storing them in instances of a CLOS
>class, and writing them out to disk via a print-object method.  I
>basically am storing them as
>
>(class-name field-name-1 field-value-1 field-name-2 field-value-2 ...)
>
>Then when I want to get them back for the next computation, I call
>read to get a list, and convert the list with something like
>
>(make-instance 'class-name 'field-name-1 (nth 2 list) 
>                           'field-name-2 (nth 4 list) ... )
>
>This works, but doesn't seem especially elegant (although it'd be more
>elegant if I hid it in some macrology), and more importantly it's
>slow.  Profiling (Pentium IV, SBCL 0.9.8) indicates that almost all
>the runtime is spent in the read function (note: there is some lossage
>because I should only traverse the list once, and calling (nth 2 list)
>then (nth 4 list) on the same list is inefficient, but profiling
>indicates the call to read is really the timesink).  The overall
>running time is slow enough that I'd like to figure out a better
>approach.
>
>Any thoughts?
 
It might be faster to read the file into a huge string using
READ-SEQUENCE, then use

WITH-INPUT-FROM-STRING

From: Barry Margolin
Subject: Re: Fast serialization?
Date: Mon, 08 May 2006 03:39:26 +0000
Message-ID: <barmar-B51EE8.23392607052006@comcast.dca.giganews.com>

In article <···············@five-percent-nation.mit.edu>,
 rif <···@mit.edu> wrote:

> Actually, I'm not quite sure what I need.
> 
> Basically, I'm reading a bunch of files in, and producing some
> intermediate results, which I want to save to disk, to later read in
> again.  The original files are raw data, so I read them with
> read-line, then I use split-sequence functions to get the fields out.
> For the intermediate results, I'm storing them in instances of a CLOS
> class, and writing them out to disk via a print-object method.  I
> basically am storing them as
> 
> (class-name field-name-1 field-value-1 field-name-2 field-value-2 ...)
> 
> Then when I want to get them back for the next computation, I call
> read to get a list, and convert the list with something like
> 
> (make-instance 'class-name 'field-name-1 (nth 2 list) 
>                            'field-name-2 (nth 4 list) ... )

Wouldn't it be simpler to just do:

(apply #'make-instance item)

?

> 
> This works, but doesn't seem especially elegant (although it'd be more
> elegant if I hid it in some macrology), and more importantly it's
> slow.  Profiling (Pentium IV, SBCL 0.9.8) indicates that almost all
> the runtime is spent in the read function (note: there is some lossage
> because I should only traverse the list once, and calling (nth 2 list)
> then (nth 4 list) on the same list is inefficient, but profiling
> indicates the call to read is really the timesink).  The overall
> running time is slow enough that I'd like to figure out a better
> approach.
> 
> Any thoughts?

You can design your own serialization format.  This can take advantage 
of what you know about the field values, to avoid the general purpose 
parser in READ.

But if the field values can be arbitrary Lisp values, you're probably 
going to have to resort to calling READ-FROM-STRING on those items, 
which just gets you almost back to where you are.

-- 
Barry Margolin, ······@alum.mit.edu
Arlington, MA
*** PLEASE post questions in newsgroups, not directly to me ***
*** PLEASE don't copy me on replies, I'll read them in the group ***

From: Bill Atkins
Subject: Re: Fast serialization?
Date: Mon, 08 May 2006 06:02:13 +0000
Message-ID: <87veshgil6.fsf@rpi.edu>

rif <···@mit.edu> writes:

> Actually, I'm not quite sure what I need.
>
> Basically, I'm reading a bunch of files in, and producing some
> intermediate results, which I want to save to disk, to later read in
> again.  The original files are raw data, so I read them with
> read-line, then I use split-sequence functions to get the fields out.
> For the intermediate results, I'm storing them in instances of a CLOS
> class, and writing them out to disk via a print-object method.  I
> basically am storing them as
>
> (class-name field-name-1 field-value-1 field-name-2 field-value-2 ...)
>
> Then when I want to get them back for the next computation, I call
> read to get a list, and convert the list with something like
>
> (make-instance 'class-name 'field-name-1 (nth 2 list) 
>                            'field-name-2 (nth 4 list) ... )
>
> This works, but doesn't seem especially elegant (although it'd be more
> elegant if I hid it in some macrology), and more importantly it's
> slow.  Profiling (Pentium IV, SBCL 0.9.8) indicates that almost all
> the runtime is spent in the read function (note: there is some lossage
> because I should only traverse the list once, and calling (nth 2 list)
> then (nth 4 list) on the same list is inefficient, but profiling
> indicates the call to read is really the timesink).  The overall
> running time is slow enough that I'd like to figure out a better
> approach.
>
> Any thoughts?
>
> rif
>

(defun serialize (thing file function-name &optional (compilep t))
  "Saves THING to FILE.  When you load FILE, or its compiled version,
run the function FUNCTION-NAME to get the data back"
  (with-open-file (s file :direction :output :if-exists :supersede
		     :if-does-not-exist :create)
    (write-string "(defun " s)
    (write function-name :stream s)
    (write-string " () (quote " s)
    (write thing :stream s :length nil :level nil :readably t)
    (write-string "))" s))
  (when compilep
    (compile-file file)))

CL-USER> (progn
	   (defvar thing
	     (loop for x from 1 to 10000
	        collect (loop for y from 1 to 100
		           collect (elt '(a b c d e) (random 5)))))
	   nil)
NIL
CL-USER> (time (serialize thing "foo.lisp" 'foobar t))
; compiling file "/home/bill/lisp/foo.lisp" (written 08 MAY 2006 01:51:47 AM):
; compiling (DEFUN FOOBAR ...)

; /home/bill/lisp/foo.fasl written
; compilation finished in 0:00:09
Evaluation took:
  15.81 seconds of real time
  8.271743 seconds of user run time
  0.355946 seconds of system run time
  0 page faults and
  200,217,128 bytes consed.
#P"/home/bill/lisp/foo.fasl"
NIL
NIL
CL-USER> (time (serialize thing "foo.lisp" 'foobar nil))
Evaluation took:
  7.123 seconds of real time
  4.228358 seconds of user run time
  0.178973 seconds of system run time
  0 page faults and
  41,686,112 bytes consed.
NIL
CL-USER> (time (progn (load *) (length (foobar))))
STYLE-WARNING: redefining FOOBAR in DEFUN
Evaluation took:
  0.284 seconds of real time
  0.258961 seconds of user run time
  0.001 seconds of system run time
  0 page faults and
  8,236,624 bytes consed.
10000
CL-USER> (time (progn (load "foo.lisp") (length (foobar))))
STYLE-WARNING: redefining FOOBAR in DEFUN
Evaluation took:
  1.629 seconds of real time
  1.550765 seconds of user run time
  0.054992 seconds of system run time
  0 page faults and
  25,329,568 bytes consed.
10000

 ------

Compiling the data lets you load the data about 6 times faster than
READ'ing in the text representation, although the compilation
increases the time needed for serialization by a factor of about 2.
Tested with CVS SBCL.

There could be some subtle problems with this that I'm not aware of,
but it seems like as long as your data isn't anything complicated (and
it doesn't seem to be, since you were able to use READ directly on the
lists of properties), this should be OK.

Anyway, it's something to try.

-- 
This is a song that took me ten years to live and two years to write.
 - Bob Dylan

From: Bill Atkins
Subject: Re: Fast serialization?
Date: Mon, 08 May 2006 06:07:30 +0000
Message-ID: <87lktdgicd.fsf@rpi.edu>

Bill Atkins <············@rpi.edu> writes:

> rif <···@mit.edu> writes:
>
>> Actually, I'm not quite sure what I need.
>>
>> Basically, I'm reading a bunch of files in, and producing some
>> intermediate results, which I want to save to disk, to later read in
>> again.  The original files are raw data, so I read them with
>> read-line, then I use split-sequence functions to get the fields out.
>> For the intermediate results, I'm storing them in instances of a CLOS
>> class, and writing them out to disk via a print-object method.  I
>> basically am storing them as
>>
>> (class-name field-name-1 field-value-1 field-name-2 field-value-2 ...)
>>
>> Then when I want to get them back for the next computation, I call
>> read to get a list, and convert the list with something like
>>
>> (make-instance 'class-name 'field-name-1 (nth 2 list) 
>>                            'field-name-2 (nth 4 list) ... )
>>
>> This works, but doesn't seem especially elegant (although it'd be more
>> elegant if I hid it in some macrology), and more importantly it's
>> slow.  Profiling (Pentium IV, SBCL 0.9.8) indicates that almost all
>> the runtime is spent in the read function (note: there is some lossage
>> because I should only traverse the list once, and calling (nth 2 list)
>> then (nth 4 list) on the same list is inefficient, but profiling
>> indicates the call to read is really the timesink).  The overall
>> running time is slow enough that I'd like to figure out a better
>> approach.
>>
>> Any thoughts?
>>
>> rif
>>
>
> (defun serialize (thing file function-name &optional (compilep t))
>   "Saves THING to FILE.  When you load FILE, or its compiled version,
> run the function FUNCTION-NAME to get the data back"
>   (with-open-file (s file :direction :output :if-exists :supersede
> 		     :if-does-not-exist :create)
>     (write-string "(defun " s)
>     (write function-name :stream s)
>     (write-string " () (quote " s)
>     (write thing :stream s :length nil :level nil :readably t)
>     (write-string "))" s))
>   (when compilep
>     (compile-file file)))
>
> CL-USER> (progn
> 	   (defvar thing
> 	     (loop for x from 1 to 10000
> 	        collect (loop for y from 1 to 100
> 		           collect (elt '(a b c d e) (random 5)))))
> 	   nil)
> NIL
> CL-USER> (time (serialize thing "foo.lisp" 'foobar t))
> ; compiling file "/home/bill/lisp/foo.lisp" (written 08 MAY 2006 01:51:47 AM):
> ; compiling (DEFUN FOOBAR ...)
>
> ; /home/bill/lisp/foo.fasl written
> ; compilation finished in 0:00:09
> Evaluation took:
>   15.81 seconds of real time
>   8.271743 seconds of user run time
>   0.355946 seconds of system run time
>   0 page faults and
>   200,217,128 bytes consed.
> #P"/home/bill/lisp/foo.fasl"
> NIL
> NIL
> CL-USER> (time (serialize thing "foo.lisp" 'foobar nil))
> Evaluation took:
>   7.123 seconds of real time
>   4.228358 seconds of user run time
>   0.178973 seconds of system run time
>   0 page faults and
>   41,686,112 bytes consed.
> NIL
> CL-USER> (time (progn (load *) (length (foobar))))
> STYLE-WARNING: redefining FOOBAR in DEFUN
> Evaluation took:
>   0.284 seconds of real time
>   0.258961 seconds of user run time
>   0.001 seconds of system run time
>   0 page faults and
>   8,236,624 bytes consed.
> 10000
> CL-USER> (time (progn (load "foo.lisp") (length (foobar))))
> STYLE-WARNING: redefining FOOBAR in DEFUN
> Evaluation took:
>   1.629 seconds of real time
>   1.550765 seconds of user run time
>   0.054992 seconds of system run time
>   0 page faults and
>   25,329,568 bytes consed.
> 10000
>
>  ------
>
> Compiling the data lets you load the data about 6 times faster than
> READ'ing in the text representation, although the compilation
> increases the time needed for serialization by a factor of about 2.
> Tested with CVS SBCL.
>
> There could be some subtle problems with this that I'm not aware of,
> but it seems like as long as your data isn't anything complicated (and
> it doesn't seem to be, since you were able to use READ directly on the
> lists of properties), this should be OK.
>
> Anyway, it's something to try.
>
> -- 
> This is a song that took me ten years to live and two years to write.
>  - Bob Dylan

This is a much nicer version of SERIALIZE:

(defun serialize (thing file function-name &optional (compilep t))
  (with-open-file (s file :direction :output :if-exists :supersede
		     :if-does-not-exist :create)
    (write `(defun ,function-name ()
	      (quote ,thing))
	   :stream s :length nil :level nil :readably t))
  (if compilep
      (compile-file file)
      file))

I don't know what I was thinking in my previous post. :)

-- 
This is a song that took me ten years to live and two years to write.
 - Bob Dylan