From: Bruce L. Lambert, Ph.D.
Subject: How to read lots of data from disk quickly?
Date: 
Message-ID: <4l8j5u$ocu@piglet.cc.uic.edu>
In the course of my information retrieval experiments, I often need to
read large data files in from disk. I've got a couple of functions that
do this successfully, but they seem awful damn slow to me. They also seem
really convoluted for what should be a more straightforward task. Of
course, I am a graduate of the Rube Goldberg school of programming. I
never saw a superfluous step I didn't like.

In any case, here's a function to read in the indexes and values from a
sparse array into a hash table. Is there a more efficient way of going
about this.

(defun read-nonzero-lower-diagonal (fname)
  (print "Loading similarity matrix...")
  (let ((row-index         0)
         (column-index      0)
         (current-sim-value 0)
         (sim-table        (make-hash-table :test #'equal)))
   (with-open-file (istream fname :direction :input)
     (loop
       (when (null (peek-char t istream nil nil)) (return))
       (setf row-index (read istream nil nil))
       (setf column-index (read istream nil nil))
       (setf current-sim-value (read istream nil nil))
       (setf (gethash (cons row-index column-index) sim-table)
                current-sim-value)))
  (print "Done loading similarity matrix")
  sim-table))

This next one is supposed to read each line from a text file and return
it as a list of symbols (not as a string). It seems almost unbelievably
convoluted to me. Is there a better way?

(defun read-collection (filename)
  (with-open-file (istream filename :direction :input)
   (let ((collection (make-array 0 :adjustable t :fill-pointer t))
          (current-clause '()))
    (loop
       (cond ((null (peek-char nil istream nil nil)) (return))
       ;;when next char is end of file return
             ((char= (peek-char nil istream) #\newline)
       ;;when next char is newline
               (read-char istream)
       ;;pluck char off of istream
               (vector-push-extend current-clause collection)
       ;;put the current clause on the end of the collection
               (setf current-clause '()))
       ;;clear the contents of current clause
             (t (loop
                  (if (or (null (peek-char nil istream nil nil))
                           ;;if next char is end of file
                          (char= (peek-char nil istream) #\newline))
                           ;;or newline
                          (return)
                          ;;return
                          ;;else push the next word onto the end of the
current clause
                   (push-end (read-preserving-whitespace istream
                                    current-clause))))))
   collection)))

The previous function uses a macro called push-end that a friend of mine
wrote. I's mighty convenient.

(defmacro push-end (object list-variable)
  `(setf ,list-variable
         (cond ((consp ,list-variable)
                    (nconc ,list-variable (list ,object)))
                   ((null ,list-variable)  (list ,object))
                   (t (format t "ERROR -- PUSH-END was given invalid
                                       arguments")))))

From: Benjamin Shults
Subject: Re: How to read lots of data from disk quickly?
Date: 
Message-ID: <3177EF09.52AD63AE@math.utexas.edu>
Bruce, L., Lambert, Ph.D. wrote:
> 
> In the course of my information retrieval experiments, I often need to
> read large data files in from disk. I've got a couple of functions that
> do this successfully, but they seem awful damn slow to me. They also seem
> really convoluted for what should be a more straightforward task. Of
> course, I am a graduate of the Rube Goldberg school of programming. I
> never saw a superfluous step I didn't like.

This should make some difference on the first one:

(defun read-nonzero-lower-diagonal (fname)
   (print "Loading similarity matrix...")
   (let ((sim-table (make-hash-table :test #'equal)))
    (with-open-file (istream fname :direction :input)
      (loop
        (when (null (peek-char t istream nil nil)) (return))
        (setf (gethash (cons (read istream nil nil) (read istream nil nil)) sim-table)
                 (read istream nil nil)))
   (print "Done loading similarity matrix")
   sim-table))
 
On the second one, I would probably use a read-line to get a string then read-from-string.
I'm not sure that would be faster but it would be less convoluted.

It has seemed to me that Lisp's file reading functions are slow.  This
probably depends on the Lisp implementation.  If speed is important, you
might try another implementation.

-- 
Benjamin Shults                 Email:  ·······@math.utexas.edu
Department of Mathematics       Phone:  (512) 471-7711 ext. 208
University of Texas at Austin   WWW:    http://www.ma.utexas.edu/users/bshults
Austin, TX  78712   USA         FAX:    (512) 471-9038 (attn: Benjamin Shults)
From: Pete Grant
Subject: Re: How to read lots of data from disk quickly?
Date: 
Message-ID: <4lb7kp$1jm@news1.h1.usa.pipeline.com>
On Apr 19, 1996 17:40:14 in article <How to read lots of data from disk
quickly?>, 'Bruce L. Lambert, Ph.D. <········@uic.edu>' wrote: 
 
 
>In the course of my information retrieval experiments, I often need to 
>read large data files in from disk. I've got a couple of functions that 
>do this successfully, but they seem awful damn slow to me. They also seem 
>really convoluted for what should be a more straightforward task. Of 
>course, I am a graduate of the Rube Goldberg school of programming. I 
>never saw a superfluous step I didn't like. 
> 
>In any case, here's a function to read in the indexes and values from a 
>sparse array into a hash table. Is there a more efficient way of going 
>about this. 
> 
>(defun read-nonzero-lower-diagonal (fname) 
>(print "Loading similarity matrix...") 
>(let ((row-index         0) 
>(column-index      0) 
>(current-sim-value 0) 
>(sim-table        (make-hash-table :test #'equal))) 
>(with-open-file (istream fname :direction :input) 
>(loop 
>(when (null (peek-char t istream nil nil)) (return)) 
>(setf row-index (read istream nil nil)) 
>(setf column-index (read istream nil nil)) 
>(setf current-sim-value (read istream nil nil)) 
>(setf (gethash (cons row-index column-index) sim-table) 
>current-sim-value))) 
>(print "Done loading similarity matrix") 
>sim-table)) 
> 
 
The answer depends on how much control you have over matters, 
and how hard you are willing to work to speed things up. 
For example, a significant amount of time is expended in  
extracting numerical data from ASCII character strings.  If 
you could store the numbers in binary form, much time could 
be saved by not doing the conversion as well as eliminating 
the inherent inefficiencies in general-purpose routines such 
as READ. 
 
If external constraints dictate text mode data, could you 
at least write it in formatted; i.e., column-sensitive form? 
If so, you could save some time by using READ-LINE and your 
own custom integer extractors that convert text into numeric 
from specified columns.  I have written some that I could share. 
An example of using one such would be: 
  (loop as data-line = (read-line infile nil nil) 
     while data-line 
	 as row = (extract-integer data-line 0 6) ;; cols 1-6 
	 as col = (extract-integer data-line 7 13) ;; cols 8-13 
	 as val = (extract-integer data-line 14 20) 
	 do 
	(setf (gethash (cons row col) sim-table) val)) 
 
If you have no control over any of this, there's not much 
you can do.  (Somebody else's suggestion of eliminating 
the intermediate variables will not make a noticeable 
difference).  You could easily eliminate the PEEK-CHAR call  
but its effect is implementation-dependent and probably  
will not make a difference worth beans. 
 
- - - - - 
 
>This next one is supposed to read each line from a text file and return 
>it as a list of symbols (not as a string). It seems almost unbelievably 
>convoluted to me. Is there a better way? 
> 
>(defun read-collection (filename) 
>(with-open-file (istream filename :direction :input) 
>(let ((collection (make-array 0 :adjustable t :fill-pointer t)) 
>(current-clause '())) 
>(loop 
>(cond ((null (peek-char nil istream nil nil)) (return)) 
>;;when next char is end of file return 
>((char= (peek-char nil istream) #\newline) 
>;;when next char is newline 
>(read-char istream) 
>;;pluck char off of istream 
>(vector-push-extend current-clause collection) 
>;;put the current clause on the end of the collection 
>(setf current-clause '())) 
>;;clear the contents of current clause 
>(t (loop 
>(if (or (null (peek-char nil istream nil nil)) 
>;;if next char is end of file 
>(char= (peek-char nil istream) #\newline)) 
>;;or newline 
>(return) 
>;;return 
>;;else push the next word onto the end of the 
>current clause 
>(push-end (read-preserving-whitespace istream 
>current-clause)))))) 
>collection))) 
> 
>The previous function uses a macro called push-end that a friend of mine 
>wrote. I's mighty convenient. 
> 
>(defmacro push-end (object list-variable) 
>`(setf ,list-variable 
>(cond ((consp ,list-variable) 
>(nconc ,list-variable (list ,object))) 
>((null ,list-variable)  (list ,object)) 
>(t (format t "ERROR -- PUSH-END was given invalid 
>arguments"))))) 
 
Now here, a buch of improvements can be obtained; and, 
the 'convolution' can definitely be diminshed. 
 
First, if you have some idea of how big the array is 
going to be, make it that size and save the overhead of 
multiple GROW-ARRAY calls that are being made 'behind the  
scenes'.  If the range of values is great, waste some  
space and initialize it to some reasonable size; e.g., 
1000.  Alternately, and my favorite technique, is to  
initially gather a list and then make an array out of 
it -- see below for example. 
 
Second, the PEEK-CHAR issue surfaces itself again.  This  
time, however, it's much more significant in that it  
potentially gets called many, many times.  The best (?   
best is in the eyes of the beholder) method is to read 
the entire line and process it in a more efficient manner.   
This eliminates all of the peeking. 
 
Third, although the PUSH-END works, it's potentially 
somewhat inefficient.  Each time you push an item to 
the end of the list, you must traverse the list.  If 
your lists are lengthy, this adds up to a fair amount of 
time.  Better to build the list in reverse order, then 
unreverse it just before storing into the array. 
 
Here's how I would do it.  The idea, as partially 
outlined above, is to read a line at a time, creating 
the list from the text line, then collecting a list 
of lists.  When done, build an array from the list. 
It wastes a bit of cons space, but is potentially 
a lot faster.  Further improvements are also 
possible, but at diminishing returns for the effort. 
For example, I have a special read-line-into-reusable- 
string that is noticeably faster, but only with large  
files. 
 
(defun build-array (&optional (filename "e:\\data\\foo.bar")) 
   (let ((list-of-lists nil) 
         (len 0) 
         arr) 
      (with-open-file (f filename) 
         (loop as line = (read-line f nil nil) 
            while line 
            as wordlist = (make-wordlist line) 
            do 
            (push wordlist list-of-lists) 
            (incf len))) 
	;; you could eliminate the len var and just 
	;; take the length of the list instead. 
      (setf arr (make-array len)) 
      (loop for wlist in list-of-lists 
         for i from (- len 1) by -1 
         do 
         (setf (aref arr i) wlist)) 
      arr) ;; return the array to caller 
    
(defun make-wordlist (text) 
   (loop with pos = 0 
      and list = nil 
      do 
      (multiple-value-bind (word npos) 
          (read-from-string text nil nil :start pos) 
         (unless word 
            (return-from make-wordlist (nreverse list))) 
         (push word list) 
         (setf pos npos)))) 
 
Note:  Tested on Allegro CL 3.0 on WinNT 3.51. 
 
 
-- 
Pete Grant 
Kalevi, Inc. 
Software Engineering & development
From: Erik Naggum
Subject: Re: How to read lots of data from disk quickly?
Date: 
Message-ID: <3039042460679868@arcana.naggum.no>
[Bruce L. Lambert, Ph.D.]

|   In the course of my information retrieval experiments, I often need to
|   read large data files in from disk.  I've got a couple of functions
|   that do this successfully, but they seem awful damn slow to me.  They
|   also seem really convoluted for what should be a more straightforward
|   task.

(this is not an answer to your question, but it could become one.)

the ANSI CL function `read-sequence' (`write-sequence') will replace
successive objects in a sequence (file) with objects in the file
(sequence).  the efficiency gains hinge on equality of the element-type of
the stream and the sequence.  `open' is not able to portable open a file
with anything but element-types that are subtypes of character, finite
subtypes of integer, signed-byte or unsigned-byte, barring an unusual
interpretation of :default.

would it be a non-conforming extension to pass floating-point types as the
element-type to `open' so one could do the following?

    (let* ((type 'double-float)
	   (vector (make-array <dimensions>
                              :element-type type
			      :initial-element (coerce 0 type))))
      (with-open-file (stream <filespec>
			      :element-type type)
	(read-sequence vector stream))
      vector)

IMHO, if this is a non-conforming extension, we have a problem.  if it is a
conforming extension, vendors should be encouraged to supply ways to allow
files to handle user-specified extended element-types.  it would perhaps
make sense to achieve consensus on the interface to such extensions.

(should this be posted to comp.std.lisp?  or is the group defunct?)

#<Erik>
-- 
errare umanum hest