From: remixer
Subject: massive data analysis with lisp
Date: 
Message-ID: <1160372084.377509.8350@i42g2000cwa.googlegroups.com>
I dont know details of Lisp memory management, but I am going to tell
you how I failed to do something that shouldnt be very difficult.

I was thinking of playing with the data for Netflix prize
(http://www.netflixprize.com/index) -- it is about 2 gig of movie
ratings in text files. I wanted to read it in and store it in a
database. I tried Franz and Allegrocache -- and it failed with GC
errors. Understandable, I was creating too many objects that didnt fit
the memory. I searched in vain for a feature that would let me delete
the objects from RAM and just keep them on disk. Couldnt find it.

Then I tried CMUCL, sqlite3 and the cmucl-sqlite bridge. Very slow, was
reading 1MB/min (which would be over a day to read the data). But then
this bombed out with out of memory error. I was reading text and
shoving it in the db, and the ratings are purely numerical.

I am using a decent machine (2g ram, dual xeon, etc). I am a little
frustrated with this experience, as I finally gave up, as I coudnt even
read in the data. My friends using perl and even php seem to have had
no such problems. What am I doing wrong? I am looking for a solution
for the following problem -- how to efficiently access X GB of data
with Y GB of RAM, where X>Y.

From: samantha
Subject: Re: massive data analysis with lisp
Date: 
Message-ID: <1160381290.871678.34000@h48g2000cwc.googlegroups.com>
Someone may be able to help you if a bit more information is given.

a) exactly what kind of memory errors did you experience?  stack traces
if you have them.

b) please describe your algorithms.  Consider showing the relevant
code.

Generally if GC can't release enough it is because you are holding
something that is pinning the information in memory, e.g., some large
collection.

- samantha

remixer wrote:
> I dont know details of Lisp memory management, but I am going to tell
> you how I failed to do something that shouldnt be very difficult.
>
> I was thinking of playing with the data for Netflix prize
> (http://www.netflixprize.com/index) -- it is about 2 gig of movie
> ratings in text files. I wanted to read it in and store it in a
> database. I tried Franz and Allegrocache -- and it failed with GC
> errors. Understandable, I was creating too many objects that didnt fit
> the memory. I searched in vain for a feature that would let me delete
> the objects from RAM and just keep them on disk. Couldnt find it.
>
> Then I tried CMUCL, sqlite3 and the cmucl-sqlite bridge. Very slow, was
> reading 1MB/min (which would be over a day to read the data). But then
> this bombed out with out of memory error. I was reading text and
> shoving it in the db, and the ratings are purely numerical.
>
> I am using a decent machine (2g ram, dual xeon, etc). I am a little
> frustrated with this experience, as I finally gave up, as I coudnt even
> read in the data. My friends using perl and even php seem to have had
> no such problems. What am I doing wrong? I am looking for a solution
> for the following problem -- how to efficiently access X GB of data
> with Y GB of RAM, where X>Y.
From: remixer
Subject: Re: massive data analysis with lisp
Date: 
Message-ID: <1160389711.472307.9170@c28g2000cwb.googlegroups.com>
Here is the code that I was using to read in the data, when trying
Allegro. I am using full version of ACL 8.0, and I tried Joel's
suggestion of cutting off the number of objects in cache -- not helping
with regards to speed.

(require :acache "acache-1.1.15.fasl")
(require :pcache "pcache-1.1.15.fasl")

(defclass movie ()
    ((mid :initarg :mid :accessor mid :index :any)
     (name :initarg :name :accessor name)
     (year :initarg :year :accessor year))
  (:metaclass persistent-class))

(defclass rating ()
    ((mid :initarg :mid :accessor mid :index :any)
     (uid :initarg :uid :accessor uid :index :any)
     (date :initarg :date :accessor date)
     (stars :initarg :stars :accessor stars))
  (:metaclass persistent-class))


(defun open-db (db-path)
  (setq *db* (open-file-database db-path
		 :if-does-not-exist :create)))
(defun close-db ()
  (close-database))

(defun delete-db (&optional (db *moviedb*))
  (close-db)
  (excl::delete-directory-and-files db))

(defvar *db* nil)
(defparameter *moviefile* "/data/movie_titles.txt")
(defparameter *moviedb* "/data/nf.db")
(defparameter *trainingdir* "/data/training_set_full/")

(defun load-netflix-data ()
  (format t "Opening Allegrocache DB...~%")
  (open-db *moviedb*)
  (setf (object-cache-size 'movie) 17770) ;; total number of movies
  (setf (object-cache-size 'rating) 10000) ;; more than enough ratings
per movie
  ;; load movies
  (format t "Reading movies...~%")
  (load-movies *moviefile*)
  (commit)
  ;; load-ratings
  (format t "Reading ratings...~%")
  (load-all-ratings-sequentially)
  (commit)
  (close-db))

;; Read the movies -- this takes a few seconds

(defun load-movies (moviefile)
  (with-open-file (fin moviefile :direction :input)
    (let* ((fin (make-string-input-stream (slurp-file file))))
      (do ((line (read-line fin nil :eof) (read-line fin nil :eof)))
          ((eql line :eof) t)
        (line->movie line mid)))))

;;1,2003,Dinosaur Planet
(defun line->movie (line)
  (let ((m (csv->list line)))
    (make-instance 'movie
	   :mid (parse-integer (first m))
	   :year (unless (equal (second m) "NULL")
		   (parse-integer (second m)))
	   :name (third m))))


;; Read the ratings -- this wants to take two days on dual xeon 3ghz PC
with 2 MB RAM

(defun load-all-ratings (&aux (i 0))
  (dolist (filename (directory *trainingdir*))
    (load-ratings filename)
    (when (= 0 (mod (incf i) 100))
      (format t "Read ~A movies.~%" i)
      (commit))))


(defun load-ratings (file) ;; the file corresponds to a movie
  (with-open-file (fin file :direction :input)
    ;; movie id
    (let* ((fin (make-string-input-stream (slurp-file file)))
	   (line1 (read-line fin nil :eof))
	   (mid (parse-integer (subseq line1 0 (- (length line1) 1)))))
      (do ((line (read-line fin nil :eof) (read-line fin nil :eof)))
	  ((eql line :eof) t)
	(line->rating line mid)))))


;;1488844,3,2005-09-06
(defun line->rating (line mid)
  (let ((r (csv->list line)))
    (make-instance 'rating
		   :mid mid
		   :uid (parse-integer (first r))
		   :stars (parse-integer (second r))
		   :date (third r))))

(defun slurp-file (file)
  (with-open-file (fin file :direction :input)
    (let ((seq (make-string (file-length fin))))
      (read-sequence seq fin)
      seq)))

(defun csv->list (str)
  (excl::split-re "," str))

;; Alternate method for loading all the files in the training directory
;; Does not have to keep the list of files
;; MovieIDs range from 1 to 17770 sequentially

(defparameter *num-movies* 17770)

(defun load-all-ratings-sequentially (&optional (num-movies
*num-movies*))
  (dotimes (i num-movies)
    (load-ratings (make-filename-for-movie (+ i 1)))
    (when (= 0 (mod i 100))
      (format t "Read ~A movies.~%" i)
      (commit))))

;; mv_0000026.txt
(defun make-filename-for-movie (mid)
  (format nil "~Amv_~7,'0D.txt" *trainingdir* mid))

samantha wrote:
> Someone may be able to help you if a bit more information is given.
>
> a) exactly what kind of memory errors did you experience?  stack traces
> if you have them.
>
> b) please describe your algorithms.  Consider showing the relevant
> code.
>
> Generally if GC can't release enough it is because you are holding
> something that is pinning the information in memory, e.g., some large
> collection.
>
> - samantha
>
> remixer wrote:
> > I dont know details of Lisp memory management, but I am going to tell
> > you how I failed to do something that shouldnt be very difficult.
> >
> > I was thinking of playing with the data for Netflix prize
> > (http://www.netflixprize.com/index) -- it is about 2 gig of movie
> > ratings in text files. I wanted to read it in and store it in a
> > database. I tried Franz and Allegrocache -- and it failed with GC
> > errors. Understandable, I was creating too many objects that didnt fit
> > the memory. I searched in vain for a feature that would let me delete
> > the objects from RAM and just keep them on disk. Couldnt find it.
> >
> > Then I tried CMUCL, sqlite3 and the cmucl-sqlite bridge. Very slow, was
> > reading 1MB/min (which would be over a day to read the data). But then
> > this bombed out with out of memory error. I was reading text and
> > shoving it in the db, and the ratings are purely numerical.
> >
> > I am using a decent machine (2g ram, dual xeon, etc). I am a little
> > frustrated with this experience, as I finally gave up, as I coudnt even
> > read in the data. My friends using perl and even php seem to have had
> > no such problems. What am I doing wrong? I am looking for a solution
> > for the following problem -- how to efficiently access X GB of data
> > with Y GB of RAM, where X>Y.
From: Thomas A. Russ
Subject: Re: massive data analysis with lisp
Date: 
Message-ID: <ymi64et1jdb.fsf@sevak.isi.edu>
"remixer" <········@gmail.com> writes:

> ;; Read the ratings -- this wants to take two days on dual xeon 3ghz PC
> with 2 MB RAM

Well, it looks like you need to do some profiling on these functions to
see where all the time is going.  That will get you some idea of where
the problem lies.

 > (defun load-all-ratings (&aux (i 0))
 >   (dolist (filename (directory *trainingdir*))
 >     (load-ratings filename)
 >     (when (= 0 (mod (incf i) 100))
 >       (format t "Read ~A movies.~%" i)
 >       (commit))))
 > 
 > 
 > (defun load-ratings (file) ;; the file corresponds to a movie
 >   (with-open-file (fin file :direction :input)
 >     ;; movie id
 >     (let* ((fin (make-string-input-stream (slurp-file file)))
 > 	   (line1 (read-line fin nil :eof))
 > 	   (mid (parse-integer (subseq line1 0 (- (length line1) 1)))))
 >       (do ((line (read-line fin nil :eof) (read-line fin nil :eof)))
 > 	  ((eql line :eof) t)
 > 	(line->rating line mid)))))
 > 
 > 
 > ;;1488844,3,2005-09-06
 > (defun line->rating (line mid)
 >   (let ((r (csv->list line)))
 >     (make-instance 'rating
 > 		   :mid mid
 > 		   :uid (parse-integer (first r))
 > 		   :stars (parse-integer (second r))
 > 		   :date (third r))))
 > 
 > (defun slurp-file (file)
 >   (with-open-file (fin file :direction :input)
 >     (let ((seq (make-string (file-length fin))))
 >       (read-sequence seq fin)
 >       seq)))
 > 
 > (defun csv->list (str)
 >   (excl::split-re "," str))

-- 
Thomas A. Russ,  USC/Information Sciences Institute
From: Peder O. Klingenberg
Subject: Re: massive data analysis with lisp
Date: 
Message-ID: <ksirit22ds.fsf@beto.netfonds.no>
"remixer" <········@gmail.com> writes:

> (defun load-movies (moviefile)
>   (with-open-file (fin moviefile :direction :input)
>     (let* ((fin (make-string-input-stream (slurp-file file))))
        ^^^^^^^^^^^^

This line doesn't make any sense.  Redefining fin immediately after
defining it, and calling slurp-file on some undefined FILE.  Is this
really the code you're using?

> (defun slurp-file (file)
>   (with-open-file (fin file :direction :input)
>     (let ((seq (make-string (file-length fin))))
>       (read-sequence seq fin)
>       seq)))

So you allocate room for the entire file in memory before you turn it
back into a stream and call read-line on it.  Seems wasteful.  What's
wrong with using read-line directly on the file you're reading?

...Peder...
-- 
I wish a new life awaited _me_ in some off-world colony.
From: ·······@gmail.com
Subject: Re: massive data analysis with lisp
Date: 
Message-ID: <1160391278.772202.77990@i3g2000cwc.googlegroups.com>
Peder O. Klingenberg wrote:
> "remixer" <········@gmail.com> writes:
>
> > (defun load-movies (moviefile)
> >   (with-open-file (fin moviefile :direction :input)
> >     (let* ((fin (make-string-input-stream (slurp-file file))))
>         ^^^^^^^^^^^^
>
> This line doesn't make any sense.  Redefining fin immediately after
> defining it, and calling slurp-file on some undefined FILE.  Is this
> really the code you're using?

Sorry, it should be

(defun load-movies (moviefile)
  (let ((fin (make-string-input-stream (slurp-file moviefile))))
    (do ((line (read-line fin nil :eof) (read-line fin nil :eof)))
        ((eql line :eof) t)
      (line->movie line))))

> > (defun slurp-file (file)
> >   (with-open-file (fin file :direction :input)
> >     (let ((seq (make-string (file-length fin))))
> >       (read-sequence seq fin)
> >       seq)))
>
> So you allocate room for the entire file in memory before you turn it
> back into a stream and call read-line on it.  Seems wasteful.  What's
> wrong with using read-line directly on the file you're reading?

According to http://www.emmett.ca/~sabetts/slurp.html this can be more
efficient. However, plain read-line as you are suggesting yields the
same sluggish performance, of roughly 1MB per *minute*
From: Rob Warnock
Subject: Re: massive data analysis with lisp
Date: 
Message-ID: <CKKdnUz0Ku6SQ7fYnZ2dnUVZ_tWdnZ2d@speakeasy.net>
<·······@gmail.com> wrote:
+---------------
| Peder O. Klingenberg wrote:
| > So you allocate room for the entire file in memory before you turn it
| > back into a stream and call read-line on it.  Seems wasteful.  What's
| > wrong with using read-line directly on the file you're reading?
| 
| According to http://www.emmett.ca/~sabetts/slurp.html this can be more
| efficient.
+---------------

Only if you can *fit* the whole file into memory. If not,
you *must* process it incrementally (e.g., use READ-LINE
or modest-sized READ-SEQUENCE on it directly).

+---------------
| However, plain read-line as you are suggesting yields the
| same sluggish performance, of roughly 1MB per *minute*
+---------------

Something's seriously wrong here, then. A simple CMUCL loop
gets ~16 MB per *second* (~240 K lines/s) on a 117 MB file --
and it was an *NFS-mounted* file at that!!

    cmu> (time (with-open-file (s "fairly-large-mail-file.txt")
		 (loop for line = (read-line s nil nil)
		       while line
		   count line into count
		   sum (1+ (length line)) into total-length
		   finally (return (list count total-length)))))
    ; Compiling LAMBDA NIL: 
    ; Compiling Top-Level Form: 
    ; [GC threshold exceeded with 13,070,440 bytes in use.  Commencing GC.]
    ; [GC completed with 1,165,192 bytes retained and 11,905,248 bytes freed.]
    ; [GC will next occur when at least 13,165,192 bytes are in use.]
    ...[another 12 similar GC triplets]...
    ; [GC threshold exceeded with 13,390,112 bytes in use.  Commencing GC.]
    ; [GC completed with 1,386,680 bytes retained and 12,003,432 bytes freed.]
    ; [GC will next occur when at least 13,386,680 bytes are in use.]
    ; Evaluation took:
    ;   7.18f0 seconds of real time
    ;   5.65f0 seconds of user run time
    ;   0.3f0 seconds of system run time
    ;   12,932,822,018 CPU cycles
    ;   [Run times include 0.27f0 seconds GC run time]
    ;   0 page faults and
    ;   174,137,920 bytes consed.
    ; 
    (1721482 116790736)
    cmu> (/ 116790736 7.18f0)

    1.626612f+7
    cmu> (/ 1721482 7.18f0)

    239760.73f0
    cmu> 

Note that by doing individual READ-LINEs and making sure that the
values were garbage by the time the next GC occurred, the above run
never used more than ~13 MB of additional heap at any one time,
despite the fact that the file was an order of magnitude larger.

Any decent CL implementation should get similar (or better!)
performance. Try a similar simple loop in your configuration,
and see if you can spot the bottleneck.


-Rob

-----
Rob Warnock			<····@rpw3.org>
627 26th Avenue			<URL:http://rpw3.org/>
San Mateo, CA 94403		(650)572-2607
From: remixer
Subject: Re: massive data analysis with lisp
Date: 
Message-ID: <1160505270.867328.20430@m73g2000cwd.googlegroups.com>
Rob Warnock wrote:
> <·······@gmail.com> wrote:
> +---------------
> | Peder O. Klingenberg wrote:
> | > So you allocate room for the entire file in memory before you turn it
> | > back into a stream and call read-line on it.  Seems wasteful.  What's
> | > wrong with using read-line directly on the file you're reading?
> |
> | According to http://www.emmett.ca/~sabetts/slurp.html this can be more
> | efficient.
> +---------------
>
> Only if you can *fit* the whole file into memory. If not,
> you *must* process it incrementally (e.g., use READ-LINE
> or modest-sized READ-SEQUENCE on it directly).
>
> +---------------
> | However, plain read-line as you are suggesting yields the
> | same sluggish performance, of roughly 1MB per *minute*
> +---------------
>
> Something's seriously wrong here, then. A simple CMUCL loop
> gets ~16 MB per *second* (~240 K lines/s) on a 117 MB file --
> and it was an *NFS-mounted* file at that!!
>
>     cmu> (time (with-open-file (s "fairly-large-mail-file.txt")
> 		 (loop for line = (read-line s nil nil)
> 		       while line
> 		   count line into count
> 		   sum (1+ (length line)) into total-length
> 		   finally (return (list count total-length)))))
>     ; Compiling LAMBDA NIL:
>     ; Compiling Top-Level Form:
>     ; [GC threshold exceeded with 13,070,440 bytes in use.  Commencing GC.]
>     ; [GC completed with 1,165,192 bytes retained and 11,905,248 bytes freed.]
>     ; [GC will next occur when at least 13,165,192 bytes are in use.]
>     ...[another 12 similar GC triplets]...
>     ; [GC threshold exceeded with 13,390,112 bytes in use.  Commencing GC.]
>     ; [GC completed with 1,386,680 bytes retained and 12,003,432 bytes freed.]
>     ; [GC will next occur when at least 13,386,680 bytes are in use.]
>     ; Evaluation took:
>     ;   7.18f0 seconds of real time
>     ;   5.65f0 seconds of user run time
>     ;   0.3f0 seconds of system run time
>     ;   12,932,822,018 CPU cycles
>     ;   [Run times include 0.27f0 seconds GC run time]
>     ;   0 page faults and
>     ;   174,137,920 bytes consed.
>     ;
>     (1721482 116790736)
>     cmu> (/ 116790736 7.18f0)
>
>     1.626612f+7
>     cmu> (/ 1721482 7.18f0)
>
>     239760.73f0
>     cmu>
>
> Note that by doing individual READ-LINEs and making sure that the
> values were garbage by the time the next GC occurred, the above run
> never used more than ~13 MB of additional heap at any one time,
> despite the fact that the file was an order of magnitude larger.

Are there ways to make sure something is GC'ed -- there seem to be ways
to do the reverse, i.e. tenure some object. I read ratings from disk
and create objects and store them using allegrocache. (BTW, you're
right -- it is fast at reading, it is creating the objects and storing
them is what is taking time, so I should have said 1MB/min of
*processing* time, not reading time). I tell Allegrocache about limits
on cache, but that doesnt help -- it is sluggish and heap blows after
reading in less than 1% of the data.


> Any decent CL implementation should get similar (or better!)
> performance. Try a similar simple loop in your configuration,
> and see if you can spot the bottleneck.
>
>
> -Rob
>
> -----
> Rob Warnock			<····@rpw3.org>
> 627 26th Avenue			<URL:http://rpw3.org/>
> San Mateo, CA 94403		(650)572-2607
From: Rob Warnock
Subject: Re: massive data analysis with lisp
Date: 
Message-ID: <N9qdnab9XdqBpbHYnZ2dnUVZ_oadnZ2d@speakeasy.net>
remixer <········@gmail.com> wrote:
+---------------
| Rob Warnock wrote:
| > Note that by doing individual READ-LINEs and making sure that the
| > values were garbage by the time the next GC occurred, the above run
| > never used more than ~13 MB of additional heap at any one time,
| > despite the fact that the file was an order of magnitude larger.
| 
| Are there ways to make sure something is GC'ed ...
+---------------

"Don't hold onto it" is the only way I know. That is, overwrite any
variables or object slots that are holding a reference to the thing
you want to drop.

NOTA BENE: When working in the REPL, don't forget these:

    [CLHS]
    25.1.1 Top level loop
    ...

    *    +    /    -  
    **   ++   //      
    ***  +++  ///     
    Figure 25-1. Variables maintained by the Read-Eval-Print Loop

Fortunately, clearing them *all* out requires only typing "0<return>"
[or any other expression with minimal results] three or four times, e.g.:

    > (setf *print-length* 20)

    20
    > (make-array 10000000)

    #(0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...)
    > (gc :full t)
    ; [GC threshold exceeded with 40,040,408 bytes in use.  Commencing GC.]
    ; [GC completed with 40,034,984 bytes retained and 5,424 bytes freed.]
    ; [GC will next occur when at least 52,034,984 bytes are in use.]

    NIL
    > (gc :full t)
    ; [GC threshold exceeded with 40,039,016 bytes in use.  Commencing GC.]
    ; [GC completed with 40,032,848 bytes retained and 6,168 bytes freed.]
    ; [GC will next occur when at least 52,032,848 bytes are in use.]

    NIL
    > (gc :full t)
    ; [GC threshold exceeded with 40,036,880 bytes in use.  Commencing GC.]
    ; [GC completed with 40,032,848 bytes retained and 4,032 bytes freed.]
    ; [GC will next occur when at least 52,032,848 bytes are in use.]

    NIL
    > (gc :full t)
    ; [GC threshold exceeded with 40,036,880 bytes in use.  Commencing GC.]
    ; [GC completed with 32,848 bytes retained and 40,004,032 bytes freed.]
    ; [GC will next occur when at least 12,032,848 bytes are in use.]

    NIL
    > 

And now it's finally gone!  [Hint: What was in "*" after each input?]

+---------------
| I tell Allegrocache about limits on cache, but that doesnt help --
| it is sluggish and heap blows after reading in less than 1% of the data.
+---------------

Sounds like an Allegrocache issue then, but since I don't use it
I can't be of any further help. (Sorry.)


-Rob

-----
Rob Warnock			<····@rpw3.org>
627 26th Avenue			<URL:http://rpw3.org/>
San Mateo, CA 94403		(650)572-2607
From: ········@gmail.com
Subject: Re: massive data analysis with lisp
Date: 
Message-ID: <1160526230.987415.233290@i42g2000cwa.googlegroups.com>
> "Don't hold onto it" is the only way I know. That is, overwrite any
> variables or object slots that are holding a reference to the thing
> you want to drop.

Although you're right, I think (as per other notes I've left in this
thread) that in these sorts of applications one generally doesn't want
to load all the data incore (here dating myself!) at all, but just scan
it "entry at a time", modulate the perceptron or whatever, and then
immediately replace it with the next entry. This may require rescanning
a bunch of times, but you never have the whole dataset in core -- The
old time/space tradeoff. (I know that you know this, Rob; I'm just
preaching it for everyone's benefit in case some don't!) If scanning
the whole dataset takes, say, an hour, this is perfectly reasonable
(esp. as we have 5 years! :-) But obviously one also wants to just use
a subset of the data first to get the "loudest" statistics first, and
then refine. This could lead to overfitting if one isn't careful... but
one is, of course, careful! :-)
From: remixer
Subject: Re: massive data analysis with lisp
Date: 
Message-ID: <1160554628.482949.111050@m73g2000cwd.googlegroups.com>
········@gmail.com wrote:
> > "Don't hold onto it" is the only way I know. That is, overwrite any
> > variables or object slots that are holding a reference to the thing
> > you want to drop.
>
> Although you're right, I think (as per other notes I've left in this
> thread) that in these sorts of applications one generally doesn't want
> to load all the data incore (here dating myself!) at all, but just scan
> it "entry at a time", modulate the perceptron or whatever, and then
> immediately replace it with the next entry. This may require rescanning
> a bunch of times, but you never have the whole dataset in core -- The
> old time/space tradeoff. (I know that you know this, Rob; I'm just
> preaching it for everyone's benefit in case some don't!) If scanning
> the whole dataset takes, say, an hour, this is perfectly reasonable
> (esp. as we have 5 years! :-) But obviously one also wants to just use
> a subset of the data first to get the "loudest" statistics first, and
> then refine. This could lead to overfitting if one isn't careful... but
> one is, of course, careful! :-)

Thanks for your suggestions. I got rid of allegrocache and things work.
But here is a counter argument for storing all the data in RAM. Lets
say for each rating, I keep the userid, and how many stars they gave it
-- that is 10 million instances of a struct that contains two fixnums.
I am using an 8-byte Franz Allegro CL image, so that leads to a total
space usage of 10 mill * 2 * 8 bytes = 160 MB. There are 18000 movies,
for each we have a year and a name string, lets say the name is average
20 long, so (20 + 8) * 18000 =  500 KB

My lisp heap is 900 MB, so in theory, I can have all of this in RAM.
But I am missing the overheads, and the cost of indexing. Here is how I
am storing the data

(defstruct rating
  (uid :type :fixnum)
  (stars :type :fixnum))

(defstruct movie
  (name :type :string)
  (year :type :fixnum))

(excl:tenuring ;;keep these around

  (defvar *rating-data* (make-array 18000))
  (defvar *movie-data* (make-array 18000)))

The entries of the *movie-data* are instances of movie struct and
*rating-data* are list of rating objects. What is a more realistic
estimate of how much RAM is needed? My estimate of 160.5 MB  is wrong,
as the heap blows much before reading in all the data. I am positive
that I am not adding anything else to the memory.
From: ······@mytrashmail.com
Subject: Re: massive data analysis with lisp
Date: 
Message-ID: <1160665907.302517.167030@i42g2000cwa.googlegroups.com>
remixer wrote:
> Thanks for your suggestions. I got rid of allegrocache and things work.
> But here is a counter argument for storing all the data in RAM. Lets
> say for each rating, I keep the userid, and how many stars they gave it
> -- that is 10 million instances of a struct that contains two fixnums.
> I am using an 8-byte Franz Allegro CL image, so that leads to a total
> space usage of 10 mill * 2 * 8 bytes = 160 MB. There are 18000 movies,
> for each we have a year and a name string, lets say the name is average
> 20 long, so (20 + 8) * 18000 =  500 KB

> My estimate of 160.5 MB  is wrong,
> as the heap blows much before reading in all the data. I am positive
> that I am not adding anything else to the memory.

I believe that the correct number of instances is 100 million
which means that you are looking at about 1.6g of memory used.
From: =?iso-8859-1?B?QXNiavhybiBCavhybnN0YWQ=?=
Subject: Re: massive data analysis with lisp
Date: 
Message-ID: <1160497560.247107.21210@h48g2000cwc.googlegroups.com>
·······@gmail.com wrote:

> However, plain read-line as you are suggesting yields the
> same sluggish performance, of roughly 1MB per *minute*

Have you checked what the cpu is doing? Databases are slow
when entering/updating data because of all that ACID stuff.
You may want to convert the data using lisp (or whatever) and
then load tables directly into the database if possible. 
-- 
 -asbjxrn
From: Ondrej Svitek
Subject: Re: massive data analysis with lisp
Date: 
Message-ID: <1160481273.783505.139630@i42g2000cwa.googlegroups.com>
remixer wrote:
> (defun csv->list (str)
>  (excl::split-re "," str))

Try this instead:

(defun csv->list (str)
  (excl::split-re (load-time-value (excl::compile-re "," :return
:index)) str))

One would expect there is a compiler-macro doing similar
transformation, but it is sometimes better not to expect anything.
Calling compile-re many times on runtime can really slow down things..
From: jayessay
Subject: Re: massive data analysis with lisp
Date: 
Message-ID: <m33b9wyy2n.fsf@rigel.goldenthreadtech.com>
"remixer" <········@gmail.com> writes:

>   (setq *db* (open-file-database db-path
> 		 :if-does-not-exist :create
                 :class-cache-size 10000000) ; or some such number

That will force acache to try to limit cache size to what you gave.
Otherwise, it will just keep trying to increase the cache size.


/Jon

-- 
'j' - a n t h o n y at romeo/charley/november com
From: Joel Reymont
Subject: Re: massive data analysis with lisp
Date: 
Message-ID: <1160382066.027665.270860@k70g2000cwa.googlegroups.com>
remixer wrote:
> I tried Franz and Allegrocache -- and it failed with GC
> errors. Understandable, I was creating too many objects that didnt fit
> the memory. I searched in vain for a feature that would let me delete
> the objects from RAM and just keep them on disk. Couldnt find it.

Were you using the personal edition of ACL?

AllegroCache has an option to set the number of options to keep in the
memory cache.
From: boating12345
Subject: Re: massive data analysis with lisp
Date: 
Message-ID: <1160387458.857088.251970@m7g2000cwm.googlegroups.com>
We are going to have a bash at this...

I would have thought the first task is to get the data into a MYSQL
database for analasys.

We will probably use SBCL and GBBOPEN under lisp to do the analasys.

Should be fun!

Paul
http://www.hyperstring.net
From: Ken Tilton
Subject: Re: massive data analysis with lisp
Date: 
Message-ID: <C3rWg.5$Uk4.0@newsfe08.lga>
remixer wrote:
> I dont know details of Lisp memory management, but I am going to tell
> you how I failed to do something that shouldnt be very difficult.
> 
> I was thinking of playing with the data for Netflix prize
> (http://www.netflixprize.com/index) -- 

"An eligible algorithm ... must not require any third-party software or 
licenses, payment on the part of Netflix, or otherwise prevent Netflix 
from exercising the rights granted hereunder."

I love how the prospect of having to pay a hundred bucks to use some 
third-party lib "prevents Netflix from [using the software]".

So you cannot use AllegroCache, and....thank you, Richard Stallman!

kt

-- 
Cells: http://common-lisp.net/project/cells/

"I'll say I'm losing my grip, and it feels terrific."
    -- Smiling husband to scowling wife, New Yorker cartoon
From: Christopher Browne
Subject: Re: massive data analysis with lisp
Date: 
Message-ID: <87y7rp9y73.fsf@wolfe.cbbrowne.com>
In the last exciting episode, Ken Tilton <·········@gmail.com> wrote:
> remixer wrote:
>> I dont know details of Lisp memory management, but I am going to tell
>> you how I failed to do something that shouldnt be very difficult.
>> I was thinking of playing with the data for Netflix prize
>> (http://www.netflixprize.com/index) -- 
>
> "An eligible algorithm ... must not require any third-party software
> or licenses, payment on the part of Netflix, or otherwise prevent
> Netflix from exercising the rights granted hereunder."
>
> I love how the prospect of having to pay a hundred bucks to use some
> third-party lib "prevents Netflix from [using the software]".
>
> So you cannot use AllegroCache, and....thank you, Richard Stallman!

Nor can you use MySQL(tm); that represents third-party software
requiring licensing.  Depending on the purpose, licenses may either be
free of charge, or cost on the order of $400 USD...
-- 
(format nil ···@~S" "cbbrowne" "gmail.com")
http://linuxdatabases.info/info/slony.html
"That's  convenience, not cracker-proofing.   Security is  an emergent
property, not a feature." -- void <·····@interport.net>
From: Jack Unrue
Subject: Re: massive data analysis with lisp
Date: 
Message-ID: <8dpki29e1ek5aandi8jtle8t88d6uc7ekm@4ax.com>
On Mon, 09 Oct 2006 08:13:26 -0400, Ken Tilton <·········@gmail.com> wrote:
> 
> remixer wrote:
> > I dont know details of Lisp memory management, but I am going to tell
> > you how I failed to do something that shouldnt be very difficult.
> > 
> > I was thinking of playing with the data for Netflix prize
> > (http://www.netflixprize.com/index) -- 
> 
> "An eligible algorithm ... must not require any third-party software or 
> licenses, payment on the part of Netflix, or otherwise prevent Netflix 
> from exercising the rights granted hereunder."
> 
> I love how the prospect of having to pay a hundred bucks to use some 
> third-party lib "prevents Netflix from [using the software]".

This subject has been addressed in the forum FAQ. There is no
prohibtion against using well-known third-party tools or libs
as long as there are alternatives:

http://www.netflixprize.com/community/viewtopic.php?id=74

The issue is whether your *algorithm* is available for them to use
without having to license it or be sued or whatever.

-- 
Jack Unrue
From: Ken Tilton
Subject: Re: massive data analysis with lisp
Date: 
Message-ID: <wzCWg.16$Fv5.12@newsfe10.lga>
Jack Unrue wrote:
> On Mon, 09 Oct 2006 08:13:26 -0400, Ken Tilton <·········@gmail.com> wrote:
> 
>>remixer wrote:
>>
>>>I dont know details of Lisp memory management, but I am going to tell
>>>you how I failed to do something that shouldnt be very difficult.
>>>
>>>I was thinking of playing with the data for Netflix prize
>>>(http://www.netflixprize.com/index) -- 
>>
>>"An eligible algorithm ... must not require any third-party software or 
>>licenses, payment on the part of Netflix, or otherwise prevent Netflix 
>>from exercising the rights granted hereunder."
>>
>>I love how the prospect of having to pay a hundred bucks to use some 
>>third-party lib "prevents Netflix from [using the software]".
> 
> 
> This subject has been addressed in the forum FAQ. There is no
> prohibtion against using well-known third-party tools or libs
> as long as there are alternatives:

I feel a naggum coming on: "Please assume I understand English."

I quoted them saying "require". I know the word "require". Ergo my 
hypothetical third-party library was a commercial, proprietary, 
non-open, patented, go-fuck-yourself-its-ours but really great --- I 
don't know -- math or statistics or GA or something library. No lib, no 
10% improvement. $99 or $199 or $999 a seat, no runtime licensing 
charge. Actually, it could be $.01 but licensed proprietary.

That would "prevent Netflix from using it". PWUUAHAHAHAHAHAA. I wonder 
if they pay for electricity, and if so, how they can sleep at night. 
PWHAHAHAHAHAHAHA!

kenny

-- 
Cells: http://common-lisp.net/project/cells/

"I'll say I'm losing my grip, and it feels terrific."
    -- Smiling husband to scowling wife, New Yorker cartoon
From: Rahul Jain
Subject: Re: massive data analysis with lisp
Date: 
Message-ID: <871wpgkbzc.fsf@nyct.net>
Ken Tilton <·········@gmail.com> writes:

> That would "prevent Netflix from using it". PWUUAHAHAHAHAHAA. I wonder
> if they pay for electricity, and if so, how they can sleep at night. 

In the summer, in a very hot room.
In the winter, in a very cold room.

-- 
Rahul Jain
·····@nyct.net
Professional Software Developer, Amateur Quantum Mechanicist
From: Thomas A. Russ
Subject: Re: massive data analysis with lisp
Date: 
Message-ID: <ymi1wpg15tc.fsf@sevak.isi.edu>
Rahul Jain <·····@nyct.net> writes:

> Ken Tilton <·········@gmail.com> writes:
> 
> > That would "prevent Netflix from using it". PWUUAHAHAHAHAHAA. I wonder
> > if they pay for electricity, and if so, how they can sleep at night. 
> 
> In the summer, in a very hot room.
> In the winter, in a very cold room.

In California, it's all the same and it doesn't matter
what the season. ;)


-- 
Thomas A. Russ,  USC/Information Sciences Institute
From: Rahul Jain
Subject: Re: massive data analysis with lisp
Date: 
Message-ID: <87k637j3av.fsf@nyct.net>
···@sevak.isi.edu (Thomas A. Russ) writes:

> Rahul Jain <·····@nyct.net> writes:
>
>> Ken Tilton <·········@gmail.com> writes:
>> 
>> > That would "prevent Netflix from using it". PWUUAHAHAHAHAHAA. I wonder
>> > if they pay for electricity, and if so, how they can sleep at night. 
>> 
>> In the summer, in a very hot room.
>> In the winter, in a very cold room.
>
> In California, it's all the same and it doesn't matter
> what the season. ;)

Obviously! That's the secret weapon in their business plan!

-- 
Rahul Jain
·····@nyct.net
Professional Software Developer, Amateur Quantum Mechanicist
From: ········@gmail.com
Subject: Re: massive data analysis with lisp
Date: 
Message-ID: <1160515215.268997.19460@i3g2000cwc.googlegroups.com>
Rahul Jain wrote:
> ···@sevak.isi.edu (Thomas A. Russ) writes:
>
> > Rahul Jain <·····@nyct.net> writes:
> >
> >> Ken Tilton <·········@gmail.com> writes:
> >>
> >> > That would "prevent Netflix from using it". PWUUAHAHAHAHAHAA. I wonder
> >> > if they pay for electricity, and if so, how they can sleep at night.
> >>
> >> In the summer, in a very hot room.
> >> In the winter, in a very cold room.
> >
> > In California, it's all the same and it doesn't matter
> > what the season. ;)
>
> Obviously! That's the secret weapon in their business plan!
>
> --
> Rahul Jain
> ·····@nyct.net
> Professional Software Developer, Amateur Quantum Mechanicist

You guys clearly don't live in California, or at leat not the same
California that I live in; it is decidedly NOT the same temperature all
year 'round! (At least not in N.Cal.) But since everyone THINKS that it
is (including those who live here!) efficient heaters and air
conditioners are rare (in homes), so not only isn't it the same temp.
*outside*, but it's not the same temp *INSIDE* either!
Brrrrrrrrrrrrrrrr!
From: Raffael Cavallaro
Subject: Re: massive data analysis with lisp
Date: 
Message-ID: <2006101018100350073-raffaelcavallaro@pasdespamsilvousplaitmaccom>
On 2006-10-10 17:20:15 -0400, ········@gmail.com said:

> But since everyone THINKS that it
> is (including those who live here!) efficient heaters and air
> conditioners are rare (in homes), so not only isn't it the same temp.
> *outside*, but it's not the same temp *INSIDE* either!

Yes, but the temperature variation is so small that this inefficiency 
can be worked around by simply puting on a sweater when it's cool or 
wearing shorts and a t-shirt when its hot. The sweater won't do it in a 
part of the world where temperatures get well below freezing in the 
winter - something that just never happens in most of California. It's 
always amusing to see the confused looks on the faces of new students 
from California wearing shorts and flip-flops outdoors in freezing 
weather shivering their asses off because they looked out of their 
well-heated dorm rooms in the morning and saw that it was sunny...
From: Thomas A. Russ
Subject: Re: massive data analysis with lisp
Date: 
Message-ID: <ymifydvzpv3.fsf@sevak.isi.edu>
Raffael Cavallaro <················@pas-d'espam-s'il-vous-plait-mac.com> writes:

> It's
> always amusing to see the confused looks on the faces of new students
> from California wearing shorts and flip-flops outdoors in freezing
> weather shivering their asses off because they looked out of their
> well-heated dorm rooms in the morning and saw that it was sunny...

Well, the LA Times weather page today had the headline "Unseasonably
Cold", so I put on a long-sleeved shirt this morning to go to work.  :)

-- 
Thomas A. Russ,  USC/Information Sciences Institute
From: Rahul Jain
Subject: Re: massive data analysis with lisp
Date: 
Message-ID: <877iz7iyvr.fsf@nyct.net>
········@gmail.com writes:

> Rahul Jain wrote:
>> ···@sevak.isi.edu (Thomas A. Russ) writes:
>>
>> > Rahul Jain <·····@nyct.net> writes:
>> >
>> >> Ken Tilton <·········@gmail.com> writes:
>> >
>> > In California, it's all the same and it doesn't matter
>> > what the season. ;)
>>
>> Obviously! That's the secret weapon in their business plan!
>
> You guys clearly don't live in California, or at leat not the same
> California that I live in; it is decidedly NOT the same temperature all
> year 'round! (At least not in N.Cal.) But since everyone THINKS that it
> is (including those who live here!) efficient heaters and air
> conditioners are rare (in homes), so not only isn't it the same temp.
> *outside*, but it's not the same temp *INSIDE* either!
> Brrrrrrrrrrrrrrrr!

Shhhh! you'll confuse Kenny!

-- 
Rahul Jain
·····@nyct.net
Professional Software Developer, Amateur Quantum Mechanicist
From: ········@gmail.com
Subject: Re: massive data analysis with lisp
Date: 
Message-ID: <1160522530.315147.236790@b28g2000cwb.googlegroups.com>
Rahul Jain wrote:
>...
> > You guys clearly don't live in California, or at leat not the same
> > California that I live in; it is decidedly NOT the same temperature all
> > year 'round! (At least not in N.Cal.) But since everyone THINKS that it
> > is (including those who live here!) efficient heaters and air
> > conditioners are rare (in homes), so not only isn't it the same temp.
> > *outside*, but it's not the same temp *INSIDE* either!
> > Brrrrrrrrrrrrrrrr!
>
> Shhhh! you'll confuse Kenny!

I think it very unlikely, although one can always hold out hope. :-)
From: Rahul Jain
Subject: Re: massive data analysis with lisp
Date: 
Message-ID: <87lknmhmm6.fsf@nyct.net>
········@gmail.com writes:

> Rahul Jain wrote:
>>...
>> > You guys clearly don't live in California, or at leat not the same
>> > California that I live in; it is decidedly NOT the same temperature all
>> > year 'round! (At least not in N.Cal.) But since everyone THINKS that it
>> > is (including those who live here!) efficient heaters and air
>> > conditioners are rare (in homes), so not only isn't it the same temp.
>> > *outside*, but it's not the same temp *INSIDE* either!
>> > Brrrrrrrrrrrrrrrr!
>>
>> Shhhh! you'll confuse Kenny!
>
> I think it very unlikely, although one can always hold out hope. :-)

That's true. He's already confused himself. ;)

-- 
Rahul Jain
·····@nyct.net
Professional Software Developer, Amateur Quantum Mechanicist
From: Pascal Bourguignon
Subject: Re: massive data analysis with lisp
Date: 
Message-ID: <874pubg864.fsf@thalassa.informatimago.com>
········@gmail.com writes:

> Rahul Jain wrote:
>> ···@sevak.isi.edu (Thomas A. Russ) writes:
>>
>> > Rahul Jain <·····@nyct.net> writes:
>> >
>> >> Ken Tilton <·········@gmail.com> writes:
>> >>
>> >> > That would "prevent Netflix from using it". PWUUAHAHAHAHAHAA. I wonder
>> >> > if they pay for electricity, and if so, how they can sleep at night.
>> >>
>> >> In the summer, in a very hot room.
>> >> In the winter, in a very cold room.
>> >
>> > In California, it's all the same and it doesn't matter
>> > what the season. ;)
>>
>> Obviously! That's the secret weapon in their business plan!
>>
>> --
>> Rahul Jain
>> ·····@nyct.net
>> Professional Software Developer, Amateur Quantum Mechanicist
>
> You guys clearly don't live in California, or at leat not the same
> California that I live in; it is decidedly NOT the same temperature all
> year 'round! (At least not in N.Cal.) But since everyone THINKS that it
> is (including those who live here!) efficient heaters and air
> conditioners are rare (in homes), so not only isn't it the same temp.
> *outside*, but it's not the same temp *INSIDE* either!
> Brrrrrrrrrrrrrrrr!

Is it still California north of San Francisco?

;-)

-- 
__Pascal Bourguignon__                     http://www.informatimago.com/
You never feed me.
Perhaps I'll sleep on your face.
That will sure show you.
From: Espen Vestre
Subject: Re: massive data analysis with lisp
Date: 
Message-ID: <m11wpfz0wk.fsf@doduo.netfonds.no>
········@gmail.com writes:

> year 'round! (At least not in N.Cal.) But since everyone THINKS that it
> is (including those who live here!) efficient heaters and air
> conditioners are rare (in homes), so not only isn't it the same temp.
> *outside*, but it's not the same temp *INSIDE* either!
> Brrrrrrrrrrrrrrrr!

Yeah, this applies to a lot of office buildings as well, and is the
reason why a Xerox Lisp Machine was actually used as a heater on the
Stanford Campus in january years ago :-)

(hah! You never thought I could bring the thread on-topic again *that*
 way, did you?)
-- 
  (espen)
From: ········@gmail.com
Subject: Re: massive data analysis with lisp
Date: 
Message-ID: <1160433144.111898.212650@e3g2000cwe.googlegroups.com>
> I dont know details of Lisp memory management, but I am going to tell
> you how I failed to do something that shouldnt be very difficult.

You should line-read through the, and split-re is very slow. Also,
compile your code!

Some experiments (on the actual datafile!):

Allegro CL Enterprise Edition
8.0 [Linux (x86)] (Aug 2, 2006 16:05)
Copyright (C) 1985-2005, Franz Inc., Oakland, CA, USA.  All Rights
Reserved.

This development copy of Allegro CL is licensed to:
   [TC9917] The Carnegie Inst. of Washington

;; Loading cl-init.cl

;; **> Use (force-all) to setup for complete recompile...
;; **> Use (bll) for a quick load of basic BioLingua...
;; **> Use (lwl) to load BioLingua and the databases...

;; Optimization settings: safety 1, space 1, speed 1, debug 2.
;; For a complete description of all compiler switches given the
;; current optimization settings evaluate (EXPLAIN-COMPILER-SETTINGS).
CL-USER(10): (time
 (with-open-file
  (i "netflix.txt")
  (time
   (dotimes (k 10000)
     (let* ((line (read-line i nil nil))
            (parse (excl::split-re "," line))))))))
; cpu time (non-gc) 1,000 msec user, 0 msec system
; cpu time (gc)     40 msec user, 0 msec system
; cpu time (total)  1,040 msec user, 0 msec system
; real time  2,080 msec
; space allocation:
;  1,596,447 cons cells, 16,725,376 other bytes, 0 static bytes
; cpu time (non-gc) 1,000 msec user, 0 msec system
; cpu time (gc)     40 msec user, 0 msec system
; cpu time (total)  1,040 msec user, 0 msec system
; real time  2,082 msec
; space allocation:
;  1,596,766 cons cells, 16,752,448 other bytes, 0 static bytes
NIL
CL-USER(11): (time
 (with-open-file
  (i "netflix.txt")
  (time
   (dotimes (k 10000)
     (let* ((line (read-line i nil nil))
;           (parse (excl::split-re "," line))
            ))))))
; cpu time (non-gc) 100 msec user, 0 msec system
; cpu time (gc)     0 msec user, 0 msec system
; cpu time (total)  100 msec user, 0 msec system
; real time  108 msec
; space allocation:
;  190,017 cons cells, 803,664 other bytes, 0 static bytes
; cpu time (non-gc) 110 msec user, 0 msec system
; cpu time (gc)     0 msec user, 0 msec system
; cpu time (total)  110 msec user, 0 msec system
; real time  110 msec
; space allocation:
;  190,347 cons cells, 830,896 other bytes, 0 static bytes
NIL
CL-USER(12): (time
 (with-open-file
  (i "netflix.txt")
  (time
   (dotimes (k 10000)
     (let* ((line (read-line i nil nil))
            (comma-pos (position #\, line))
            (leftpart (subseq line 0 comma-pos))
            (leftpartrightbseq line comma-pos))
            )))))
; cpu time (non-gc) 170 msec user, 0 msec system
; cpu time (gc)     10 msec user, 0 msec system
; cpu time (total)  180 msec user, 0 msec system
; real time  293 msec
; space allocation:
;  480,017 cons cells, 1,051,216 other bytes, 0 static bytes
; cpu time (non-gc) 170 msec user, 0 msec system
; cpu time (gc)     10 msec user, 0 msec system
; cpu time (total)  180 msec user, 0 msec system
; real time  293 msec
; space allocation:
;  480,347 cons cells, 1,078,456 other bytes, 0 static bytes

You can even get faster than this if you reuse array space.... Oh, and
this isn't compiled. Here' the compiled speed (note that I had to
actually do something with the results or the compiler just removes
most of the code!):

CL-USER(1):  (progn
(compile
 (defun foo ()
   (with-open-file
    (i "netflix.txt")
    (time
     (dotimes (k 10000)
       (let* ((line (read-line i nil nil))
              (comma-pos (position #\, line))
              (leftpart (when comma-pos (subseq line 0 comma-pos)))
              (rightpart (when comma-pos (subseq line comma-pos)))
              )))))))
(time (foo))
)
; While compiling (:INTERNAL FOO 0):
Warning: Variable RIGHTPART is never used.
Warning: Variable LEFTPART is never used.
; cpu time (non-gc) 20 msec user, 0 msec system
; cpu time (gc)     10 msec user, 0 msec system
; cpu time (total)  30 msec user, 0 msec system
; real time  32 msec
; space allocation:
;  0 cons cells, 950,128 other bytes, 0 static bytes
; cpu time (non-gc) 20 msec user, 0 msec system
; cpu time (gc)     10 msec user, 0 msec system
; cpu time (total)  30 msec user, 0 msec system
; real time  33 msec
; space allocation:
;  340 cons cells, 974,992 other bytes, 0 static bytes
NIL

At those speeds, you shouldn't have any trouble at all processing that
-line file using Lisp! Should take you, like, a few minutes -- maybe an
hour if your're doing something complex.
From: ········@gmail.com
Subject: Re: massive data analysis with lisp
Date: 
Message-ID: <1160434584.296122.90280@i42g2000cwa.googlegroups.com>
Oops. I told myself to do something with the results and didn't! Here's
the "real" compiled timing. (Not much different!)

> You can even get faster than this if you reuse array space.... Oh, and
> this isn't compiled. Here' the compiled speed (note that I had to
> actually do something with the results or the compiler just removes
> most of the code!):
>
CL-USER(1): (progn
  (compile
   (defun foo ()
     (with-open-file
      (i "netflix.txt")
      (time
       (dotimes (k 10000)
         (let* ((line (read-line i nil nil))
                (comma-pos (position #\, line))
                (leftpart (when comma-pos (subseq line 0 comma-pos)))
                (rightpart (when comma-pos (subseq line comma-pos)))
                )
           (setf x (cons leftpart rightpart)))
         )))))
  (time (foo))
  )
; While compiling (:INTERNAL FOO 0):
Warning: Free reference to undeclared variable X assumed special.
; cpu time (non-gc) 20 msec user, 0 msec system
; cpu time (gc)     10 msec user, 0 msec system
; cpu time (total)  30 msec user, 0 msec system
; real time  32 msec
; space allocation:
;  10,000 cons cells, 950,128 other bytes, 0 static bytes
; cpu time (non-gc) 20 msec user, 0 msec system
; cpu time (gc)     10 msec user, 0 msec system
; cpu time (total)  30 msec user, 0 msec system
; real time  34 msec
; space allocation:
;  10,340 cons cells, 974,992 other bytes, 0 static bytes
NIL
From: ········@gmail.com
Subject: Re: massive data analysis with lisp
Date: 
Message-ID: <1160434788.117343.294590@e3g2000cwe.googlegroups.com>
In fact, compiled ACL8 read-lines (only) through the file at a million
lines/second on my fairly small machine (as described above).
From: ········@gmail.com
Subject: Re: massive data analysis with lisp
Date: 
Message-ID: <1160461246.154728.166700@c28g2000cwb.googlegroups.com>
With a very little work (using compiled ACL8 straight CL on my pokey
computer, as above), I can scan and reduced the entire netflix prize
download dataset into a more compact form in just over an hour. The
compact form permits much faster scanning of the whole dataset, but
what one really wants to do is to be able to index into this compacted
form in order to do specific computations. That wouldn't be much harder
that what I've already done.
From: remixer
Subject: Re: massive data analysis with lisp
Date: 
Message-ID: <1160463716.586258.183930@h48g2000cwc.googlegroups.com>
········@gmail.com wrote:
> With a very little work (using compiled ACL8 straight CL on my pokey
> computer, as above), I can scan and reduced the entire netflix prize
> download dataset into a more compact form in just over an hour. The
> compact form permits much faster scanning of the whole dataset, but
> what one really wants to do is to be able to index into this compacted
> form in order to do specific computations. That wouldn't be much harder
> that what I've already done.

There's two indices thats really needed: movie-id->ratings and user-id
-> ratings, and a list of all user-id's -- would you like to share your
code that does the above compaction? Or, team up for the prize? Though
the netflix engineers should be a little unhappy, someone has already
gotten the 10% improvement milestone within a week (they had the
competition open for 5 years!). But just for kicks, I'd be still up for
playing around with it.
From: ········@gmail.com
Subject: Re: massive data analysis with lisp
Date: 
Message-ID: <1160496923.704534.161210@i3g2000cwc.googlegroups.com>
remixer wrote:
> ········@gmail.com wrote:
> > With a very little work (using compiled ACL8 straight CL on my pokey
> > computer, as above), I can scan and reduced the entire netflix prize
> > download dataset into a more compact form in just over an hour. The
> > compact form permits much faster scanning of the whole dataset, but
> > what one really wants to do is to be able to index into this compacted
> > form in order to do specific computations. That wouldn't be much harder
> > that what I've already done.
>
> There's two indices thats really needed: movie-id->ratings and user-id
> -> ratings, and a list of all user-id's -- would you like to share your
> code that does the above compaction? Or, team up for the prize? Though
> the netflix engineers should be a little unhappy, someone has already
> gotten the 10% improvement milestone within a week (they had the
> competition open for 5 years!). But just for kicks, I'd be still up for
> playing around with it.

I've got it down to est. 45min preconversion for the whole dataset, and
then, once converted, it takes only est. 19min to scan the entire set
(with no computation or storing -- you don't want to store it
anyway[*]; you want to scan it and tune your prediction algorithm). I
don't want to partner as I don't think that this is an interesting
challenge -- to make it interesting you'd need a features list for
every movie. All they give you is the date and people's scores. But if
you publicly agree to give me 10% of whatever you get (regardless of
whether you end up using my code), I'll post it. (And if you publicly
agree to give me 25% I'll give it to you privately! :-)

[*] Estimates based on the first 1000 samples. There are ~17,000
entries in the whole set. That is, it took ~2.6min to convert 1000
entries, and ~1.1min to re-read the converted data. And I'm sure that
these could be improved somewhat if needed.
From: remixer
Subject: Re: massive data analysis with lisp
Date: 
Message-ID: <1160504856.110153.223900@c28g2000cwb.googlegroups.com>
········@gmail.com wrote:
> remixer wrote:
> > ········@gmail.com wrote:
> > > With a very little work (using compiled ACL8 straight CL on my pokey
> > > computer, as above), I can scan and reduced the entire netflix prize
> > > download dataset into a more compact form in just over an hour. The
> > > compact form permits much faster scanning of the whole dataset, but
> > > what one really wants to do is to be able to index into this compacted
> > > form in order to do specific computations. That wouldn't be much harder
> > > that what I've already done.
> >
> > There's two indices thats really needed: movie-id->ratings and user-id
> > -> ratings, and a list of all user-id's -- would you like to share your
> > code that does the above compaction? Or, team up for the prize? Though
> > the netflix engineers should be a little unhappy, someone has already
> > gotten the 10% improvement milestone within a week (they had the
> > competition open for 5 years!). But just for kicks, I'd be still up for
> > playing around with it.
>
> I've got it down to est. 45min preconversion for the whole dataset, and
> then, once converted, it takes only est. 19min to scan the entire set
> (with no computation or storing -- you don't want to store it
> anyway[*]; you want to scan it and tune your prediction algorithm). I
> don't want to partner as I don't think that this is an interesting
> challenge -- to make it interesting you'd need a features list for
> every movie. All they give you is the date and people's scores. But if
> you publicly agree to give me 10% of whatever you get (regardless of
> whether you end up using my code), I'll post it. (And if you publicly
> agree to give me 25% I'll give it to you privately! :-)

Sure, I'll agree to the public 10% version -- I'll give you 10% of
whatever I make from this contest -- I should warn you, though, that
someone has already achieved the netflix milestone, and the competition
is over for all practical purposes, so the chances of you making
anything are almost zero. But I could (and someone else) learn
something from what you did. 

Thanks.
From: ········@gmail.com
Subject: Re: massive data analysis with lisp
Date: 
Message-ID: <1160510708.221923.7440@k70g2000cwa.googlegroups.com>
> I should warn you, though, that
> someone has already achieved the netflix milestone, and the competition
> is over for all practical purposes, so the chances of you making
> anything are almost zero.

Yeah, I know; I was sort of kidding. (Again, those pesky smilies! :-)

> But I could (and someone else) learn
> something from what you did.

The main thing they'd learn is to use compile-file. The only thing
that's even vaguely tricky here is that instead of bothering to parse
the numbers, I just put them out to the bulk data file in lisp-readable
form, so that (READ ...) will do all the work when they are re-read
from that file.

Here's the code, but to get good speed, you need to compile it first:

;;; Suck in all the training files (in the training_set) directory
;;; into one big data file that has the data in a lisp-readable form.

(defun nfconvert (&key (dates? nil))
  (with-open-file
   (o "data.lisp" :direction :output :if-exists :supersede)
   (time
    (loop for file in (directory "/seed/jshrager/nf/training_set/*.*")
          as j from 1 to 1000 ; remove this limit to do them all
          do
          (when (zerop (mod j 100)) (print j)) ;; Just tells us we're
still alive.
          (nfloaddirdata file o dates?)))))

(defun nfloaddirdata (file o dates?)
  (with-open-file
   (i file)
   (let ((headline (read-line i)))
     (format o "(~a~%" (subseq headline 0 (1- (length headline)))))
   (loop as line = (read-line i nil nil)
         until (null line)
         as j from 1 by 1
         do (when (zerop (mod j 5)) (format o "~%"))
         ;;Since everything's in the SAME FORMAT I can use very
specific
         ;; parsing instead of having to split the line. This could be
made even
         ;; more efficient by re-using string arrays, and prob. other
techniques.
         (let* ((comma-pos-1 (position #\, line))
                (comma-pos-2 (position #\, line :start (1+
comma-pos-1))))
           (if dates?
               (format o " (~a ~a ~s) "
                       (subseq line 0 comma-pos-1)
                       (subseq line (1+ comma-pos-1) comma-pos-2)
                       (subseq line (1+ comma-pos-2)))
             ;; Use dots to save cons cells if we don't need the date.
             (format o " (~a . ~a) "
                     (subseq line 0 comma-pos-1)
                     (subseq line (1+ comma-pos-1) comma-pos-2)
                     )
             )))
   (format o "~% )~%")))

;;; Just tests re-read speed.

(defun readtest (n)
  (time
   (with-open-file
    (o "data.lisp")
    (loop for i below n
          collect (length (read o))))))
From: K Livingston
Subject: Re: massive data analysis with lisp
Date: 
Message-ID: <1160547198.029575.216590@k70g2000cwa.googlegroups.com>
Why all the calls to subseq?  and calling position and then subseq and
then presumably calling read (or read-from-string) on that - which
makes for at least 2 passes (3 - if you count the subsequent read) over
each number.  You could do it in one pass, and even right to memory if
you so choose, with the current file format.

read-from-string (or read)  will work to read one number at a time, it
will require no copying of the strings in the lisp code we write, and
can be given a start location to be able to jump over the commas.  (or
if you are using read from the string directly, just pull a character
off the stram to grab the comma etc.)  Here is a simple example to grab
two numbers, it could obviously be put into a loop or whatever is
needed.

cl-user(2): (let ((str "1234,5678,9101112"))
              (multiple-value-bind (item1 end-pos-item1)
                  (read-from-string str)
                (pprint item1)
                (pprint (read-from-string str t nil :start (1+
end-pos-item1)))))

1234
5678

note: read-from-string has two optional parameters (like read)
regarding EOF error conditions, which you must specify before you get
to the keyword list where you can specify a start location.

-Kevin



········@gmail.com wrote:
>
> (defun nfloaddirdata (file o dates?)
>   (with-open-file
>    (i file)
>    (let ((headline (read-line i)))
>      (format o "(~a~%" (subseq headline 0 (1- (length headline)))))
>    (loop as line = (read-line i nil nil)
>          until (null line)
>          as j from 1 by 1
>          do (when (zerop (mod j 5)) (format o "~%"))
>          ;;Since everything's in the SAME FORMAT I can use very
> specific
>          ;; parsing instead of having to split the line. This could be
> made even
>          ;; more efficient by re-using string arrays, and prob. other
> techniques.
>          (let* ((comma-pos-1 (position #\, line))
>                 (comma-pos-2 (position #\, line :start (1+
> comma-pos-1))))
>            (if dates?
>                (format o " (~a ~a ~s) "
>                        (subseq line 0 comma-pos-1)
>                        (subseq line (1+ comma-pos-1) comma-pos-2)
>                        (subseq line (1+ comma-pos-2)))
>              ;; Use dots to save cons cells if we don't need the date.
>              (format o " (~a . ~a) "
>                      (subseq line 0 comma-pos-1)
>                      (subseq line (1+ comma-pos-1) comma-pos-2)
>                      )
>              )))
>    (format o "~% )~%")))
>
From: ········@gmail.com
Subject: Re: massive data analysis with lisp
Date: 
Message-ID: <1160581142.237255.324170@c28g2000cwb.googlegroups.com>
K Livingston wrote:
> Why all the calls to subseq?  and calling position and then subseq and
> then presumably calling read (or read-from-string) on that

Much slower:

CL-USER(4): (time (loop for i below 1000000
do (subseq "123,456,789" (position #\, "123,456,789"))))
; cpu time (non-gc) 10,910 msec user, 0 msec system
; cpu time (gc)     540 msec user, 0 msec system
; cpu time (total)  11,450 msec user, 0 msec system
; real time  22,935 msec
; space allocation:
;  17,000,167 cons cells, 136,001,384 other bytes, 0 static bytes
NIL
CL-USER(8): (time (loop for i below 1000000
do (read-from-string "123,456,789")))
; cpu time (non-gc) 15,720 msec user, 0 msec system
; cpu time (gc)     430 msec user, 0 msec system
; cpu time (total)  16,150 msec user, 0 msec system
; real time  32,619 msec
; space allocation:
;  15,000,615 cons cells, 104,170,016 other bytes, 0 static bytes
NIL
CL-USER(9):
From: Thomas A. Russ
Subject: Re: massive data analysis with lisp
Date: 
Message-ID: <ymibqoizscf.fsf@sevak.isi.edu>
········@gmail.com writes:

> K Livingston wrote:
> > Why all the calls to subseq?  and calling position and then subseq and
> > then presumably calling read (or read-from-string) on that

> Much slower:
> 
> CL-USER(4): (time (loop for i below 1000000
> do (subseq "123,456,789" (position #\, "123,456,789"))))
> ; cpu time (non-gc) 10,910 msec user, 0 msec system
> ; cpu time (gc)     540 msec user, 0 msec system
> ; cpu time (total)  11,450 msec user, 0 msec system
> ; real time  22,935 msec
> ; space allocation:
> ;  17,000,167 cons cells, 136,001,384 other bytes, 0 static bytes
> NIL
> CL-USER(8): (time (loop for i below 1000000
> do (read-from-string "123,456,789")))
> ; cpu time (non-gc) 15,720 msec user, 0 msec system
> ; cpu time (gc)     430 msec user, 0 msec system
> ; cpu time (total)  16,150 msec user, 0 msec system
> ; real time  32,619 msec
> ; space allocation:
> ;  15,000,615 cons cells, 104,170,016 other bytes, 0 static bytes
> NIL
> CL-USER(9):

But remember that you are computing two different things in these
loops.  The result of the "do" clause of the first loop is a string,
whereas in the second case it is a number.  To compare the timing
you should at least put a PARSE-INTEGER call in the first loop.

Speaking of PARSE-INTEGER, you could use it just directly with offsets,
right?

(defun read-integers (input)
  (let ((results nil)
        (int nil)
        (pos 0))
    (loop do (multiple-value-setq (int pos)
                (parse-integer input :start pos :junk-allowed t))
          when (null int) do (loop-finish)
          else do (push int results) 
                  (incf pos))
    (nreverse results)))

This could be optimized a bit more, perhaps if you had a fixed number
of fields to process.

-- 
Thomas A. Russ,  USC/Information Sciences Institute
From: ········@gmail.com
Subject: Re: massive data analysis with lisp
Date: 
Message-ID: <1160593622.539169.51400@e3g2000cwe.googlegroups.com>
Thomas A. Russ wrote:
> ········@gmail.com writes:
>
> > K Livingston wrote:
> > > Why all the calls to subseq?  and calling position and then subseq and
> > > then presumably calling read (or read-from-string) on that
...
> But remember that you are computing two different things in these
> loops.  The result of the "do" clause of the first loop is a string,
> whereas in the second case it is a number.  To compare the timing
> you should at least put a PARSE-INTEGER call in the first loop.

No No No! Everyone has missed the point here! I do NOT need a
parse-integer. I'm lisp-i-forming the data by going directly out to a
file (with ~a, so there aren't any 2quotes!) Then when the file is
re-read it gets re-parsed by the reader. The read-from-string solution
is doing an EXTRA parse-integer (more precisely: an extra
read-from-string). My solution is ABSOLUTELY better than Livingston's
(at least for this approach). (This isn't to say that my approach
couldn't be improved, but Livingston's solution is worse.)
From: grackle
Subject: Re: massive data analysis with lisp
Date: 
Message-ID: <1160602801.215829.266660@h48g2000cwc.googlegroups.com>
I think everyone has missed another point here, which is that reading
files and parsing strings is not a big issue for this problem.  I
originally parsed the netflix data in the dumbest and easiest way
possible, even using cl-ppcre:split to break up the lines.  After
reading the comments here, I replaced the regexes with split and subseq
and got a minor speedup.

Bypassing the database inserts, on the other hand, gives a huge
speedup.  Performance with database inserts, for qualifying.txt (51 MB,
about 2.8 million lines, resulting in about 2.8 million rows inserted):

Evaluation took:
  734.149 seconds of real time
  544.71405 seconds of user run time
  39.726482 seconds of system run time
  [Run times include 30.258 seconds GC run time.]
  0 calls to %EVAL
  0 page faults and
  63,460,744,592 bytes consed.

Performance without database inserts:

Evaluation took:
  11.467 seconds of real time
  11.328708 seconds of user run time
  0.108007 seconds of system run time
  [Run times include 0.552 seconds GC run time.]
  0 calls to %EVAL
  0 page faults and
  929,187,392 bytes consed.

The OP should parse the files in whatever manner he finds easiest and
worry about getting the data into the database.

My code is below.  Comments on the code are welcome!  (Can anyone tell
me what the (when file) test is for?  I copied the with-open-file code
out of Practical Common Lisp.)

-David

#!/usr/local/bin/sbcl --noinform

(require 'asdf)
(asdf:operate 'asdf:load-op 'clsql)
(asdf:operate 'asdf:load-op 'cl-ppcre)

(use-package :clsql-user)

(connect '("" db-name db-user db-password) :database-type :mysql)
(enable-sql-reader-syntax)

;; Populate table "qualifying" from qualifying.txt
;; For output-func I pass one of these two functions:
;;     qualifying-output-db
;;     qualifying-output-null
(defun populate-qualifying-position (output-func)
  (with-open-file
   (file "../download/qualifying.txt" :external-format :iso-8859-1)
   (when file  ;; can anyone tell me what this test is for?
     (let ((movie-id "-1")
	   (count 0))
       (do ((line (read-line file nil) (read-line file nil)))
	   ((null line))
	   (let ((colon-pos (position #\: line)))
	     (if colon-pos
		 (setf movie-id (subseq line 0 colon-pos))
	       (let ((comma-pos (position #\, line)))
		 (let ((customer-id (subseq line 0 comma-pos))
		       (date (subseq line (+ comma-pos 1) nil)))
		   (setf count (+ count 1))
		   (funcall output-func movie-id customer-id date count))))))))))

(defun qualifying-output-db (movie-id customer-id date count)
  (insert-records :into [qualifying]
		  :attributes '([movie_id] [customer_id] [date])
		  :values (list movie-id customer-id date))
  (when (zerop (mod count 10000))
    (format t "Inserted ~a records~%" count)))

(defun qualifying-output-null (movie-id customer-id date count)
  (declare (ignore movie-id customer-id date))
  (when (zerop (mod count 10000))
    (format t "Ignored ~a records~%" count)))

(time (populate-qualifying-position #'qualifying-output-db))
;;(time (populate-qualifying-position #'qualifying-output-null))
From: ········@gmail.com
Subject: Re: massive data analysis with lisp
Date: 
Message-ID: <1160667859.910872.177080@c28g2000cwb.googlegroups.com>
The idea (at least my idea) was not to put this stuff into a database
at all, but to create special-purpose compressions that could be more
rapidly scanned to extract desired statistics. Disk space is free, so
you can do a large number of such analysis, continually leaving interim
results in files.I showed a trivial example of such a compression.

If you know what calculations you want to do, this is the most
efficient way to do it. If you don't know what calculations you want to
do, you should be exploring a much small amount of the data (in memory,
if you like) until you do know what calculations you want to do.
From: ······@mytrashmail.com
Subject: Re: massive data analysis with lisp
Date: 
Message-ID: <1160721120.313423.75460@i42g2000cwa.googlegroups.com>
········@gmail.com wrote:
> The idea (at least my idea) was not to put this stuff into a database
> at all, but to create special-purpose compressions that could be more
> rapidly scanned to extract desired statistics. Disk space is free, so
> you can do a large number of such analysis, continually leaving interim
> results in files.I showed a trivial example of such a compression.
>
> If you know what calculations you want to do, this is the most
> efficient way to do it. If you don't know what calculations you want to
> do, you should be exploring a much small amount of the data (in memory,
> if you like) until you do know what calculations you want to do.

This is similar to what i did.

Modification to your original code:

(defun nfconvert (&key (dates? nil))
  (with-open-file
      (oindex "index.txt" :direction :output :if-exists :supersede)
    (with-open-file
        (o "data.lisp" :direction :output :if-exists :supersede)
    (loop for file in (directory
                       "download/training_set/*.*")
       sum 1 into j do
       (when (zerop (mod j 100)) (print j))
       (nfloaddirdata file o oindex dates?)))))

(defun nfloaddirdata (file o oindex dates?)
  (with-open-file
      (i file)
    (let ((headline (read-line i)))
      (format oindex "~a~%" (file-position o)) ... rest is the same

This creates an index (on movie id) allowing one to quickly load only
the examples he needs.
Index can be kept in memory at all times (around 100k)

If the average number of movies a customer has rated is moderate and
depending on physical
memory same approach can be taken in order to create a customer->movies
mapping that can
also be kept in memory at all times. With these indexes in place there
is no need for someone
to use a database and customer and movie iteration is extremely fast.
From: ········@gmail.com
Subject: Re: massive data analysis with lisp
Date: 
Message-ID: <1160722889.360890.292410@i3g2000cwc.googlegroups.com>
······@mytrashmail.com wrote:
> ········@gmail.com wrote:
> > The idea (at least my idea) was not to put this stuff into a database
> > at all, ...
> > If you don't know what calculations you want to
> > do, you should be exploring a much small amount of the data (in memory,
> > if you like) until you do know what calculations you want to do.
>
> This is similar to what i did.
> Modification to your original code:
> ...

Yes, an excellent additional trick!

To finish complete your thought, all one needs to do is:

  (file-position data-stream indexed-position)
  (read data-stream)

and you've got the data nearly instantly! In this manner you can get
hold of as many or few of the entries totally at random access.

Thanks for the extension!
From: ······@mytrashmail.com
Subject: Re: massive data analysis with lisp
Date: 
Message-ID: <1160813074.661147.202280@h48g2000cwc.googlegroups.com>
> Yes, an excellent additional trick!
>
> To finish complete your thought, all one needs to do is:
>
>   (file-position data-stream indexed-position)
>   (read data-stream)
>
> and you've got the data nearly instantly! In this manner you can get
> hold of as many or few of the entries totally at random access.
>
> Thanks for the extension!

To finish up on the subject this is the code for a second index
(customer->ratings).

(defun read-mov-idx ()
  (with-open-file (i "mov_index.txt")
    (let ((htable (make-hash-table)))
      (loop for res = (read i nil)
           sum 1 into k
           while res do
           (setf (gethash k htable) res))
      htable)))

(defun make-cust-idx (movidx)
  (let ((assoctab (make-hash-table)))
    (with-open-file (ifile "data.lisp")
      (loop for k being the hash-keys in movidx using (hash-value v) do
           (file-position ifile v)
           (let ((res (cdr (read ifile))))
             (loop for elem in res do
                  (let ((custid (first elem)))
                    (if (not (gethash custid assoctab))
                        (setf (gethash custid assoctab)
                              (make-array 1 :element-type 'fixnum
                                          :fill-pointer 0 :adjustable
t)))
                    (vector-push-extend k (gethash custid
assoctab)))))))
    (with-open-file (ofidx "cust_index.txt" :direction :output
                           :if-exists :supersede)
      (with-open-file (ofcust "cust.lisp" :direction :output
                              :if-exists :supersede)
        (loop for k being the hash-keys in assoctab using
             (hash-value v) do
             (format ofidx "~A ~A~%" k (file-position ofcust))
             (format ofcust "~S~%" v))))))

It requires about 600mb of mem in order to build the index and after
its
done cust.lisp takes up 550mb and cust_index.txt about 8.5mb.

So with both indices in place and resident in memory, total memory
requirements
are about  8.6mb for a very fast way to get to your data at minimum
time wasted, no sql
and fully integrated with lisp ;-)

I'm sure someone could optimize this further if needed to reduce memory
used at build
time.
From: ········@gmail.com
Subject: Re: massive data analysis with lisp
Date: 
Message-ID: <1160829704.262426.324940@m73g2000cwd.googlegroups.com>
······@mytrashmail.com wrote:
> To finish up on the subject this is the code for a second index
> (customer->ratings).

Good implementation skills, bad negotiation skills! You should've asked
for another 10% from Remixer... :-)
From: remixer
Subject: Re: massive data analysis with lisp
Date: 
Message-ID: <1160885496.253102.287810@e3g2000cwe.googlegroups.com>
······@mytrashmail.com wrote:
> > Yes, an excellent additional trick!
> >
> > To finish complete your thought, all one needs to do is:
> >
> >   (file-position data-stream indexed-position)
> >   (read data-stream)
> >
> > and you've got the data nearly instantly! In this manner you can get
> > hold of as many or few of the entries totally at random access.
> >
> > Thanks for the extension!
>
> To finish up on the subject this is the code for a second index
> (customer->ratings).
>
> (defun read-mov-idx ()
>   (with-open-file (i "mov_index.txt")
>     (let ((htable (make-hash-table)))
>       (loop for res = (read i nil)
>            sum 1 into k
>            while res do
>            (setf (gethash k htable) res))
>       htable)))
>
> (defun make-cust-idx (movidx)
>   (let ((assoctab (make-hash-table)))
>     (with-open-file (ifile "data.lisp")
>       (loop for k being the hash-keys in movidx using (hash-value v) do
>            (file-position ifile v)
>            (let ((res (cdr (read ifile))))
>              (loop for elem in res do
>                   (let ((custid (first elem)))
>                     (if (not (gethash custid assoctab))
>                         (setf (gethash custid assoctab)
>                               (make-array 1 :element-type 'fixnum
>                                           :fill-pointer 0 :adjustable
> t)))
>                     (vector-push-extend k (gethash custid
> assoctab)))))))
>     (with-open-file (ofidx "cust_index.txt" :direction :output
>                            :if-exists :supersede)
>       (with-open-file (ofcust "cust.lisp" :direction :output
>                               :if-exists :supersede)
>         (loop for k being the hash-keys in assoctab using
>              (hash-value v) do
>              (format ofidx "~A ~A~%" k (file-position ofcust))
>              (format ofcust "~S~%" v))))))
>
> It requires about 600mb of mem in order to build the index and after
> its
> done cust.lisp takes up 550mb and cust_index.txt about 8.5mb.
>
> So with both indices in place and resident in memory, total memory
> requirements
> are about  8.6mb for a very fast way to get to your data at minimum
> time wasted, no sql
> and fully integrated with lisp ;-)

Thanks, this is great. Is it right that the third entry in the movie
index does not correspond to the entry for movie-id=3, as the output of
directory is not sorted? I ended up doing something like this to make
sure that the movie-index was accurate.

;;  MovieIDs range from 1 to 17770 sequentially

(defparameter *num-movies* 17770)

;; mv_0000026.txt

(defun make-filename-for-movie (mid)
  (format nil "~Amv_~7,'0D.txt" *trainingdir* mid))

(defun load-all-ratings-sequentially (o oindex dates? &optional
(num-movies *num-movies*))
  (dotimes (i num-movies)
    (read-movie (make-filename-for-movie (+ i 1)) o oindex dates?)))
From: =?iso-8859-1?B?QXNiavhybiBCavhybnN0YWQ=?=
Subject: Re: massive data analysis with lisp
Date: 
Message-ID: <1160728920.628368.42100@f16g2000cwb.googlegroups.com>
······@mytrashmail.com wrote:

> This is similar to what i did.
>
> Modification to your original code:
>
> (defun nfconvert (&key (dates? nil))
>   (with-open-file
>       (oindex "index.txt" :direction :output :if-exists :supersede)
>     (with-open-file
>         (o "data.lisp" :direction :output :if-exists :supersede)
>     (loop for file in (directory
>                        "download/training_set/*.*")
>        sum 1 into j do
>        (when (zerop (mod j 100)) (print j))
>        (nfloaddirdata file o oindex dates?)))))
>
> (defun nfloaddirdata (file o oindex dates?)
>   (with-open-file
>       (i file)
>     (let ((headline (read-line i)))
>       (format oindex "~a~%" (file-position o)) ... rest is the same
>
> This creates an index (on movie id) allowing one to quickly load only
> the examples he needs.
> Index can be kept in memory at all times (around 100k)

Another approach is as a friend of mine likes to say:
"The filesystem is an exellent database."

Use one directory with one file for each move with "raw" data, just
modified for
easy reading. One directory for user data (maybe a directory tree
depending
on how much data you want to generate for each user and whether you
want
to keep that for different users data in separate files or group them
together.)
One directory for statistics about each movie? (Maybe just have a
subdirectory for
each movie and keep different types of data in different files.) If you
want to get
fancy, you can use symbolic links as foreign keys. And so on.

Basically, the directory structure becomes your index. Easy to
add/delete
indexes, grab a subset of the data etc. 
-- 
 -asbjxrn
From: ········@gmail.com
Subject: Re: massive data analysis with lisp
Date: 
Message-ID: <1160751058.041998.249220@m7g2000cwm.googlegroups.com>
Asbjørn Bjørnstad wrote:
> Another approach is as a friend of mine likes to say:
> "The filesystem is an exellent database."

Actually, the file systems is a very poor database if you care about
speed (which you always do for large amounts of data). It's very slow
to do directories and constantly open and close files. The method you
suggest would be hundreds of times slower than the single file +
fileposition indexing method that has been developed in this thread.
(In fact, this is how the data starts out when you get it: as a set of
directories with tens of thousans of small files; much of the time
wasted in the initial compression phase is just opening and closing
those files!)
From: =?iso-8859-1?B?QXNiavhybiBCavhybnN0YWQ=?=
Subject: Re: massive data analysis with lisp
Date: 
Message-ID: <1160758787.177429.90160@e3g2000cwe.googlegroups.com>
········@gmail.com wrote:
> Asbjørn Bjørnstad wrote:
> > Another approach is as a friend of mine likes to say:
> > "The filesystem is an exellent database."
>
> Actually, the file systems is a very poor database if you care about
> speed (which you always do for large amounts of data).

Depends on what you compare with, I think. This thread started
with the original poster complaining about processing 1MB/min.
(Yes, reading from a database has much less overhead than
writing to it. I agree.)
And in my friends defence, he knows databases have their place.
He's just a strong supporter of the keep it simple approach and
if a flat file/filesystem solution is good enough, and ACID is
irrellevant
(as in this case), why bother with the database?

> It's very slow
> to do directories and constantly open and close files. The method you
> suggest would be hundreds of times slower than the single file +
> fileposition indexing method that has been developed in this thread.

Hundreds of times slower? How did you get to that number?
Even if I really optimize the data processing, I don't get more
than a 3 time speedup. (Not that that is insignificant.):

·······@valad:~/netflix/converted_set$ ls | wc
  17770   17770  195470
·······@valad:~/netflix/converted_set$ time ls | xargs cat > /dev/null

real    3m39.811s
user    0m0.580s
sys     0m10.881s
·······@valad:~/netflix/converted_set$ time ls | xargs head > /dev/null

real    2m21.596s
user    0m0.472s
sys     0m9.737s

One can increase the speed a bit by splitting the data into smaller
directories, but I only got a 10 sec speedup for this test when I
split the data into 10 subdirectories.

I think we kind of agree, actually. For this task a "real" database is
not
needed. I myself is more comfertable with a bunch of smaller files, for
more easy manipulation with unix file tools, but I haven't looked at
this
problem so maybe the added speed of a single file approach is needed.
-- 
 -asbjxrn
From: grackle
Subject: Re: massive data analysis with lisp
Date: 
Message-ID: <1160765255.229615.134380@k70g2000cwa.googlegroups.com>
Asbjørn Bjørnstad wrote:
> if a flat file/filesystem solution is good enough, and ACID is
> irrellevant
> (as in this case), why bother with the database?

The best reason to use a database is so you don't have to rewrite the
parser for every language or tool you want to use.  It's very easy to
import database tables into R (r-project.org), for example.  

-David
From: John Thingstad
Subject: Re: massive data analysis with lisp
Date: 
Message-ID: <op.thdjg8nfpqzri1@pandora.upc.no>
On Fri, 13 Oct 2006 16:50:58 +0200, <········@gmail.com> wrote:

> Asbj�rn Bj�rnstad wrote:
>> Another approach is as a friend of mine likes to say:
>> "The filesystem is an exellent database."
>
> Actually, the file systems is a very poor database if you care about
> speed (which you always do for large amounts of data). It's very slow
> to do directories and constantly open and close files. The method you
> suggest would be hundreds of times slower than the single file +
> fileposition indexing method that has been developed in this thread.
> (In fact, this is how the data starts out when you get it: as a set of
> directories with tens of thousans of small files; much of the time
> wasted in the initial compression phase is just opening and closing
> those files!)
>

I think I would use MySQL with ISAM tables for this.
It's fast and relativly trouble free.
It compromises ACID for speed, but that is just a advantage here.


-- 
Using Opera's revolutionary e-mail client: http://www.opera.com/mail/
From: ········@gmail.com
Subject: Re: massive data analysis with lisp
Date: 
Message-ID: <1160586973.794328.295340@i3g2000cwc.googlegroups.com>
ps. on this...

K Livingston wrote:
> Why all the calls to subseq?  and calling position and then subseq and
> then presumably calling read (or read-from-string) on that...

Wrong presumption. I never read-string the numbers. I directly
lisp-i-form them into another file which is later read by the normal
read mechanism. (This may well internally use read-from-string, but it
would have to be done anyway upon re-read, unless you wanted to do some
sort of fancy encoding.) So on a balanced comparison my solution is at
least 30% faster than yours.
From: grackle
Subject: Re: massive data analysis with lisp
Date: 
Message-ID: <1160511435.880225.88120@m73g2000cwd.googlegroups.com>
remixer wrote:
> I should warn you, though, that
> someone has already achieved the netflix milestone, and the competition
> is over for all practical purposes, so the chances of you making
> anything are almost zero.

The contest is still wide open!  Three teams have improved on Netflix'
own technology, but the top team on the leaderboard has only achieved a
1.06% improvement.  The leaders are probably experts using mature
technology, so I'm guessing their scores will improve as they tweak
their algorithms for the Netflix data and then quickly plateau.

-David
From: ········@gmail.com
Subject: Re: massive data analysis with lisp
Date: 
Message-ID: <1160512960.785123.212030@i3g2000cwc.googlegroups.com>
> The contest is still wide open!  Three teams have improved on Netflix'
> own technology, but the top team on the leaderboard has only achieved a
> 1.06% improvement.  The leaders are probably experts using mature
> technology, so I'm guessing their scores will improve as they tweak
> their algorithms for the Netflix data and then quickly plateau.

I think that what one would need in order to do much better is to have
a more detailed description of the movies. Maybe this could be
screen-scraped from IMDB? (That would take a while, but might be worth
it.)
From: Pascal Bourguignon
Subject: Re: massive data analysis with lisp
Date: 
Message-ID: <878xjng881.fsf@thalassa.informatimago.com>
········@gmail.com writes:

>> The contest is still wide open!  Three teams have improved on Netflix'
>> own technology, but the top team on the leaderboard has only achieved a
>> 1.06% improvement.  The leaders are probably experts using mature
>> technology, so I'm guessing their scores will improve as they tweak
>> their algorithms for the Netflix data and then quickly plateau.
>
> I think that what one would need in order to do much better is to have
> a more detailed description of the movies. Maybe this could be
> screen-scraped from IMDB? (That would take a while, but might be worth
> it.)

IMDB databases are available in machine readable form.


-- 
__Pascal Bourguignon__                     http://www.informatimago.com/
You never feed me.
Perhaps I'll sleep on your face.
That will sure show you.
From: Sidney Markowitz
Subject: Re: massive data analysis with lisp
Date: 
Message-ID: <452c1cc6$0$96156$742ec2ed@news.sonic.net>
Pascal Bourguignon wrote, On 11/10/06 10:40 AM:
> IMDB databases are available in machine readable form.

http://us.imdb.com/Licensing/FAQ.html says

Q: How much does it cost?
A: Send pricing inquiries here [link omitted]

http://www.netflixprize.com/rules says

An eligible algorithm [...] must not require any third-party software or
licenses, payment on the part of Netflix


-- 
    Sidney Markowitz
    http://www.sidney.com
From: ········@gmail.com
Subject: Re: massive data analysis with lisp
Date: 
Message-ID: <1160522946.938638.63220@m73g2000cwd.googlegroups.com>
>...
> An eligible algorithm [...] must not require any third-party software or
> licenses, payment on the part of Netflix

This rule is obviously ridiculous. If we managed to improve the
algorithm by 1000% for a cost of, say, 5% of improved profits (just as
an example), they'd be perfectly happy to pay for it. Indeed, it would
be a perfectly fine business proposition to pay any any amount (< n
100)% of profits attributable to the algorithm in order to get the
algorithm since they make an additional (- 100 n)% Saying that you
aren't going to consider licensing is not business it's ... some sort
of stupid misunderstanding of Open Sourceism, and it's ridiculous.
From: ········@gmail.com
Subject: Re: massive data analysis with lisp
Date: 
Message-ID: <1160523245.045447.118730@i3g2000cwc.googlegroups.com>
I wonder why NetFlix doesn't give us this data to begin with? Obviously
they have it! Maybe THEY also use the IMDB database and so can't make
it public, in which case, the challenge is ridiculous since we don't
have the same information they do.
From: Ken Tilton
Subject: Re: massive data analysis with lisp
Date: 
Message-ID: <un%Wg.811$jj1.301@newsfe08.lga>
Sidney Markowitz wrote:
> Pascal Bourguignon wrote, On 11/10/06 10:40 AM:
> 
>>IMDB databases are available in machine readable form.
> 
> 
> http://us.imdb.com/Licensing/FAQ.html says
> 
> Q: How much does it cost?
> A: Send pricing inquiries here [link omitted]
> 
> http://www.netflixprize.com/rules says
> 
> An eligible algorithm [...] must not require any third-party software or
> licenses, payment on the part of Netflix
> 
> 

Brilliant! Dummies like Pascal make submissions that (a) solve the 
problem and (b) violate the rules and (c) obediently are licensed 
openly. Netflix then (a) refuses to pay the million and (b) pays the 
negligible cost of IMDB or whatever and (c) uses the now irrevocably 
open-licensed submission.

Perfect. Capitalism figures out how to abuse free (pwuahhaahaha) software.

Was I the only one wondering why Netflix had the open licensing 
condition on submissions? I ain't wondering any more.

PWUAAAAHAHAHAHAHAHHAHAA!

kenny (hoping RMS is hard at work on a submission)

-- 
Cells: http://common-lisp.net/project/cells/

"I'll say I'm losing my grip, and it feels terrific."
    -- Smiling husband to scowling wife, New Yorker cartoon
From: ········@gmail.com
Subject: Re: massive data analysis with lisp
Date: 
Message-ID: <1160587460.685135.205700@k70g2000cwa.googlegroups.com>
> kenny (hoping RMS is hard at work on a submission)

For the record -- and we've been down this path before -- it is a
common misunderstanding (on the part of nearly everyone) that open
source software should cost nothing.

From: "The Free Software Definition"
          http://www.gnu.org/philosophy/free-sw.html

"Free software is a matter of liberty, not price. To understand the
concept, you should think of free as in free speech, not as in free
beer. "

So, much as I agree with you (Kenny) that the open source movement is
painful and  confused one shouldn't blame RMS for something he didn't
say. It's NetFlix that is confused about how markets work, not RMS.
(RMS may well be confused as well, but not in this particular way.)
From: Ken Tilton
Subject: Re: massive data analysis with lisp
Date: 
Message-ID: <pPhXg.2853$lD3.274@newsfe12.lga>
········@gmail.com wrote:
>>kenny (hoping RMS is hard at work on a submission)
> 
> 
> For the record -- and we've been down this path before -- it is a
> common misunderstanding (on the part of nearly everyone) that open
> source software should cost nothing.
> 
> From: "The Free Software Definition"
>           http://www.gnu.org/philosophy/free-sw.html
> 
> "Free software is a matter of liberty, not price. To understand the
> concept, you should think of free as in free speech, not as in free
> beer. "
> 
> So, much as I agree with you (Kenny) that the open source movement is
> painful and  confused one shouldn't blame RMS for something he didn't
> say.

I refer you to the GNU Manifesto, where RMS concedes that yes, 
programmers will make less money (since all they can charge for is 
hourly support or consulting).

Sure, I am free to use GPL software and then (a) make my work freely 
available under the same license and (b) charge $1000 for it. I just 
might not get many people paying me for what they can download for free. 
(Which is why RMS conceded programmers might get paid no more than 
windowwashers or something.)

kt

-- 
Cells: http://common-lisp.net/project/cells/

"I'll say I'm losing my grip, and it feels terrific."
    -- Smiling husband to scowling wife, New Yorker cartoon
From: Pascal Bourguignon
Subject: Re: massive data analysis with lisp
Date: 
Message-ID: <87odsicsow.fsf@thalassa.informatimago.com>
Ken Tilton <·········@gmail.com> writes:

> Sidney Markowitz wrote:
>> Pascal Bourguignon wrote, On 11/10/06 10:40 AM:
>> 
>>>IMDB databases are available in machine readable form.
>> http://us.imdb.com/Licensing/FAQ.html says
>> Q: How much does it cost?
>> A: Send pricing inquiries here [link omitted]
>> http://www.netflixprize.com/rules says
>> An eligible algorithm [...] must not require any third-party
>> software or
>> licenses, payment on the part of Netflix
>> 
>
> Brilliant! Dummies like Pascal make submissions that (a) solve the
> problem and (b) violate the rules and (c) obediently are licensed
> openly. Netflix then (a) refuses to pay the million and (b) pays the
> negligible cost of IMDB or whatever and (c) uses the now irrevocably
> open-licensed submission.
>
> Perfect. Capitalism figures out how to abuse free (pwuahhaahaha) software.
>
> Was I the only one wondering why Netflix had the open licensing
> condition on submissions? I ain't wondering any more.
>
> PWUAAAAHAHAHAHAHAHHAHAA!


Well, if I had spent thousands of dollars to get a lisp implementation
while at the same time witnessing the others having fun with free lisp
implementations, I guess I'd be as despised as Kenny.


-- 
__Pascal Bourguignon__                     http://www.informatimago.com/

"Our users will know fear and cower before our software! Ship it!
Ship it and let them flee like the dogs they are!"
From: Pascal Bourguignon
Subject: Re: massive data analysis with lisp
Date: 
Message-ID: <87mz83eoax.fsf@thalassa.informatimago.com>
Sidney Markowitz <······@sidney.com> writes:

> Pascal Bourguignon wrote, On 11/10/06 10:40 AM:
>> IMDB databases are available in machine readable form.
>
> http://us.imdb.com/Licensing/FAQ.html says

Wrong domain. Try imdb.org <<<

-- 
__Pascal Bourguignon__                     http://www.informatimago.com/

HANDLE WITH EXTREME CARE: This product contains minute electrically
charged particles moving at velocities in excess of five hundred
million miles per hour.
From: grackle
Subject: Re: massive data analysis with lisp
Date: 
Message-ID: <1160528710.977240.21130@m73g2000cwd.googlegroups.com>
Pascal Bourguignon wrote:
> Wrong domain. Try imdb.org <<<

imdb.org redirects me to imdb.com, where I can't find anything freely
available.  They even forbid bots without written permission.  Is there
an open project with a similar name, perhaps?

-David
From: Pascal Bourguignon
Subject: Re: massive data analysis with lisp
Date: 
Message-ID: <87bqojej33.fsf@thalassa.informatimago.com>
"grackle" <···········@gmail.com> writes:

> Pascal Bourguignon wrote:
>> Wrong domain. Try imdb.org <<<
>
> imdb.org redirects me to imdb.com, where I can't find anything freely
> available.  They even forbid bots without written permission.  Is there
> an open project with a similar name, perhaps?

Sorry, I remembered a time when they were different.  Anyways, you can
always download the files from here:

http://www.imdb.com/interfaces

-- 
__Pascal Bourguignon__                     http://www.informatimago.com/
-----BEGIN GEEK CODE BLOCK-----
Version: 3.12
GCS d? s++:++ a+ C+++ UL++++ P--- L+++ E+++ W++ N+++ o-- K- w--- 
O- M++ V PS PE++ Y++ PGP t+ 5+ X++ R !tv b+++ DI++++ D++ 
G e+++ h+ r-- z? 
------END GEEK CODE BLOCK------
From: remixer
Subject: Re: massive data analysis with lisp
Date: 
Message-ID: <1160521069.828419.167970@b28g2000cwb.googlegroups.com>
grackle wrote:
> remixer wrote:
> > I should warn you, though, that
> > someone has already achieved the netflix milestone, and the competition
> > is over for all practical purposes, so the chances of you making
> > anything are almost zero.
>
> The contest is still wide open!  Three teams have improved on Netflix'
> own technology, but the top team on the leaderboard has only achieved a
> 1.06% improvement.  The leaders are probably experts using mature
> technology, so I'm guessing their scores will improve as they tweak
> their algorithms for the Netflix data and then quickly plateau.

So ·····@gmail, you are on for 10% cut!
From: ········@gmail.com
Subject: Re: massive data analysis with lisp
Date: 
Message-ID: <1160522594.923983.88010@i3g2000cwc.googlegroups.com>
remixer wrote:
>...
> So ·····@gmail, you are on for 10% cut!

That's not my complete email address, but we can work that out in 2011
:-)
From: Sidney Markowitz
Subject: Re: massive data analysis with lisp
Date: 
Message-ID: <452c1a17$0$96153$742ec2ed@news.sonic.net>
remixer wrote, On 11/10/06 7:27 AM:
> I should warn you, though, that someone has already
> achieved the netflix milestone, and the competition
> is over for all practical purposes

I think you are reading the Leaderboard wrong
(http://www.netflixprize.com/leaderboard). The top bar lists the
qualifying score for the Grand Prize, 0.8563 (10% improvement). The next
line is the current leading team, who scored 0.9413 (1.06% improvement).
The next bar lists the qualifying score for this year's Progress Prize,
which is 0.9419 (1% improvement). So all that board indicates is that
one team is already in the running for the first year's Progress Prize,
and that only two other teams have achieved any results better than the
netflix baseline (but less than 1% better).

-- 
    Sidney Markowitz
    http://www.sidney.com