array-total-size-limit

From: Bruce L. Lambert
Subject: array-total-size-limit
Date: Thu, 13 Jan 2000 00:00:00 +0000
Message-ID: <s7suhcbqoj862@corp.supernews.com>

In ACL 5.0, array-total-size-limit = 2^24 = 16777216. What can I do if I
want to store more than 16777216 values? My aplication is document
clustering, and I am trying to store an n x n interdocument similarity
matrix. As far as I can tell, I cannot do this when n > 4096. This seems
unreasonable. Am I missing something?

-bruce

Re: array-total-size-limit Bruce L. Lambert
- Re: array-total-size-limit Barry Margolin
  - Re: array-total-size-limit ········@my-deja.com
  - Re: array-total-size-limit Bruce L. Lambert
- Re: array-total-size-limit Bruce L. Lambert
  - Re: array-total-size-limit Robert Monfera
  - Re: array-total-size-limit Bernhard Pfahringer
Re: array-total-size-limit Barry Margolin
Re: array-total-size-limit Robert Monfera
Re: array-total-size-limit Michael L. Harper
- Re: array-total-size-limit Robert Monfera

From: Bruce L. Lambert
Subject: Re: array-total-size-limit
Date: Fri, 14 Jan 2000 00:00:00 +0000
Message-ID: <85o7eo$sos$2@newsx.cc.uic.edu>

Bruce L. Lambert <········@uic.edu> wrote in message
··················@corp.supernews.com...
> In ACL 5.0, array-total-size-limit = 2^24 = 16777216. What can I do if I
> want to store more than 16777216 values? My aplication is document
> clustering, and I am trying to store an n x n interdocument similarity
> matrix. As far as I can tell, I cannot do this when n > 4096. This seems
> unreasonable. Am I missing something?

Barry Margolin's strategy will certainly work, but with Bill Brodie's
(·······@panix.com) help, I found another approach. Below is the approach I
took. I have not thoroughly tested it for efficiency, but it appears to work
as one might suspect. Remember I am trying to store a large sparse matrix
without storing all the zero values. If I do store all the zeros I get into
the array-total-size-limit problem. I welcome your comments and
(constructive) criticisms.

-bruce

(in-package :user)

;;; Row-index is a vector of fixnums
(setf *row-index* (make-array 0
         :adjustable t
         :fill-pointer 0
         :element-type '(integer 0 65535)))

;;; Col-index is a vector of vectors of fixnums.
(setf *col-index* (make-array 0
         :adjustable t
         :fill-pointer 0
         :element-type '(array (integer 0 65535))))

;;; Contents is a vector of vectors of fixnums
(setf *contents* (make-array 0
        :adjustable t
        :fill-pointer 0
        :element-type '(array (integer 0 65535))))

(defun brodie-aref (i j)

  "Accesses an entry from a compressed sparse matrix
   data structure as designed by Bill Brodie ·······@panix.com."

  (let* ((i-index (position i *row-index*))
 (j-index (position j (aref *col-index* i-index))))

    (aref (aref *contents* i-index) j-index)))



(defun brodie-setf (i j val)

  "Places a new value in a Brodie-style sparse matrix data structure."

  (let ((i-index nil))

    ;; if there's no entry in *row-index* for row i
    (if (null (setf i-index (position i *row-index*)))

 ;; then

 (progn

   ;; add i to *row-index*
   (vector-push-extend i *row-index*)

   ;; push j onto a new vector in *col-vector*
   (vector-push-extend
    (make-array 1
         :adjustable t
         :fill-pointer 1
         :element-type '(integer 0 65535)
         :initial-element j)
    *col-index*)

   ;; push val onto a new vector in *contents*
   (vector-push-extend
    (make-array 1
         :adjustable t
         :fill-pointer 1
         :element-type '(integer 0 65535)
         :initial-element val)
    *contents*))

      ;; else

      (progn

 (vector-push-extend j (aref *col-index* i-index))

 (vector-push-extend val (aref *contents* i-index))))))

From: Barry Margolin
Subject: Re: array-total-size-limit
Date: Sat, 15 Jan 2000 00:00:00 +0000
Message-ID: <KJRf4.96$%%2.872@burlma1-snr2>

In article <············@newsx.cc.uic.edu>,
Bruce L. Lambert <········@uic.edu> wrote:
>Barry Margolin's strategy will certainly work, but with Bill Brodie's
>(·······@panix.com) help, I found another approach. Below is the approach I
>took. I have not thoroughly tested it for efficiency, but it appears to work
>as one might suspect. Remember I am trying to store a large sparse matrix
>without storing all the zero values.

If I had known that, I wouldn't have suggested my strategy; I actually
thought of suggesting that you use sparse matrix technology, but assumed
you would have said that you data was sparse if that was significant.  But
I couldn't "remember" it because you never said it.  Is this implicit in
the idea of a "similarity matrix", something I'm not really familiar with?

-- 
Barry Margolin, ······@bbnplanet.com
GTE Internetworking, Powered by BBN, Burlington, MA
*** DON'T SEND TECHNICAL QUESTIONS DIRECTLY TO ME, post them to newsgroups.
Please DON'T copy followups to me -- I'll assume it wasn't posted to the group.

From: ········@my-deja.com
Subject: Re: array-total-size-limit
Date: Sat, 15 Jan 2000 00:00:00 +0000
Message-ID: <85put1$o8g$1@nnrp1.deja.com>

In article <················@burlma1-snr2>,
  Barry Margolin <······@bbnplanet.com> wrote:
> In article <············@newsx.cc.uic.edu>,
> Bruce L. Lambert <········@uic.edu> wrote:
> >Barry Margolin's strategy will certainly work,
but with Bill Brodie's
> >(·······@panix.com) help, I found another
approach. Below is the approach I
> >took. I have not thoroughly tested it for
efficiency, but it appears to work
> >as one might suspect. Remember I am trying to
store a large sparse matrix
> >without storing all the zero values.
>
> If I had known that, I wouldn't have suggested
my strategy; I actually
> thought of suggesting that you use sparse
matrix technology, but assumed
> you would have said that you data was sparse if
that was significant.  But
> I couldn't "remember" it because you never said
it.  Is this implicit in
> the idea of a "similarity matrix", something
I'm not really familiar with?
>
> --

An obvious point here -- similarity relations are
normally symmetric -- and on-axis have values of
1.  (sym i i) = 1

So you only want to store less than half the
matrix anyway, which would tend to favour Barry
Margolin's approach.


Sent via Deja.com http://www.deja.com/
Before you buy.

From: Bruce L. Lambert
Subject: Re: array-total-size-limit
Date: Sat, 15 Jan 2000 00:00:00 +0000
Message-ID: <s81njkseeh633@corp.supernews.com>

Barry Margolin <······@bbnplanet.com> wrote in message
·····················@burlma1-snr2...
> In article <············@newsx.cc.uic.edu>,
> Bruce L. Lambert <········@uic.edu> wrote:
> >Barry Margolin's strategy will certainly work, but with Bill Brodie's
> >(·······@panix.com) help, I found another approach. Below is the approach
I
> >took. I have not thoroughly tested it for efficiency, but it appears to
work
> >as one might suspect. Remember I am trying to store a large sparse matrix
> >without storing all the zero values.
>
> If I had known that, I wouldn't have suggested my strategy; I actually
> thought of suggesting that you use sparse matrix technology, but assumed
> you would have said that you data was sparse if that was significant.  But
> I couldn't "remember" it because you never said it.  Is this implicit in
> the idea of a "similarity matrix", something I'm not really familiar with?

Sorry Barry. I should have been more explicit about my problem in my initial
post. Not all similarity matrices are sparse, but in information retrieval
(IR) applications, where the units being compared are 'documents' at some
level of abstraction, interdocument similarity matrices are almost always
sparse. In the domain I'm working in (document clustering), where the
'documents' are very short, roughly 70% of the values are zero. Most texts
on IR give methods for computing sim matrices without computing all the
zeros, but very few discuss how to store these big sparse matrices.

The approach Bill Brodie suggested to me was a revelation when I finally
understood how it worked. In some initial comparisons, it is giving me a
very significant space-efficiency boost compared to methods that store all
the zeros. I'll post some specific numbers in a day or so.

As for sim matrices being symmetric, this is a fact I'm well aware of, and
hence I only represent  the nonzero elements of the N(N-1)/2 unique sim
values from an N x N matrix. In effect, I only store the 'lower left' half
of the matrix, excluding the main diagonal of self-similarities. When
accessing an element {i, j} from a matrix stored in this way, you just have
to use (max i j) as the row index and (min i j) as the column index. As
usual, there is a time penalty to pay in order to get the space savings.

Barry, if you know of some other sparse matrix techniques, I'd still
appreciate any pointers, just as I appreciate your other frequent
contributions to this group.

Thanks.

-Bruce

From: Bruce L. Lambert
Subject: Re: array-total-size-limit
Date: Sun, 16 Jan 2000 00:00:00 +0000
Message-ID: <s84739uneh684@corp.supernews.com>

My sparse matrix approach (described in my message of 1/14) is, perhaps not
surprisingly, much more compact but also much slower. My function
'brodie-aref' which I now call 'sparse-matrix-aref' calls position on the
row-index vector in order to find the right column index vector and contents
vector, then it calls position on the column index vector in order to find
the index in contents where the appropriate value resides. These calls to
position are slowing things down tremendously! Again, I probably shouldn't
be surprised. (So how slow is position? Is it O(n) for a sequence of length
n?)

I am, however, looking for a better, faster way. I thought of using
hash-tables to represent the row-indices. Each entry would have the
row-index as its key and the position as its value. Same thing for the
column indices. I'll have to give back some space to gain some speed, but
maybe this is a happy medium. The other alternative (which Bill Brodie
suggested in his original post ay back when) was to store the row-indices
and column indices in binary search trees.

Any suggestions are appreciated.


-bruce

Bruce L. Lambert <········@uic.edu> wrote in message
·················@newsx.cc.uic.edu...
>
> Bruce L. Lambert <········@uic.edu> wrote in message
> ··················@corp.supernews.com...
> > In ACL 5.0, array-total-size-limit = 2^24 = 16777216. What can I do if I
> > want to store more than 16777216 values? My aplication is document
> > clustering, and I am trying to store an n x n interdocument similarity
> > matrix. As far as I can tell, I cannot do this when n > 4096. This seems
> > unreasonable. Am I missing something?
>
> Barry Margolin's strategy will certainly work, but with Bill Brodie's
> (·······@panix.com) help, I found another approach. Below is the approach
I
> took. I have not thoroughly tested it for efficiency, but it appears to
work
> as one might suspect. Remember I am trying to store a large sparse matrix
> without storing all the zero values. If I do store all the zeros I get
into
> the array-total-size-limit problem. I welcome your comments and
> (constructive) criticisms.
>
> -bruce
>
> (in-package :user)
>
> ;;; Row-index is a vector of fixnums
> (setf *row-index* (make-array 0
>          :adjustable t
>          :fill-pointer 0
>          :element-type '(integer 0 65535)))
>
> ;;; Col-index is a vector of vectors of fixnums.
> (setf *col-index* (make-array 0
>          :adjustable t
>          :fill-pointer 0
>          :element-type '(array (integer 0 65535))))
>
> ;;; Contents is a vector of vectors of fixnums
> (setf *contents* (make-array 0
>         :adjustable t
>         :fill-pointer 0
>         :element-type '(array (integer 0 65535))))
>
> (defun brodie-aref (i j)
>
>   "Accesses an entry from a compressed sparse matrix
>    data structure as designed by Bill Brodie ·······@panix.com."
>
>   (let* ((i-index (position i *row-index*))
>  (j-index (position j (aref *col-index* i-index))))
>
>     (aref (aref *contents* i-index) j-index)))
>
>
>
> (defun brodie-setf (i j val)
>
>   "Places a new value in a Brodie-style sparse matrix data structure."
>
>   (let ((i-index nil))
>
>     ;; if there's no entry in *row-index* for row i
>     (if (null (setf i-index (position i *row-index*)))
>
>  ;; then
>
>  (progn
>
>    ;; add i to *row-index*
>    (vector-push-extend i *row-index*)
>
>    ;; push j onto a new vector in *col-vector*
>    (vector-push-extend
>     (make-array 1
>          :adjustable t
>          :fill-pointer 1
>          :element-type '(integer 0 65535)
>          :initial-element j)
>     *col-index*)
>
>    ;; push val onto a new vector in *contents*
>    (vector-push-extend
>     (make-array 1
>          :adjustable t
>          :fill-pointer 1
>          :element-type '(integer 0 65535)
>          :initial-element val)
>     *contents*))
>
>       ;; else
>
>       (progn
>
>  (vector-push-extend j (aref *col-index* i-index))
>
>  (vector-push-extend val (aref *contents* i-index))))))
>
>
>
>
>
>
>

From: Robert Monfera
Subject: Re: array-total-size-limit
Date: Sun, 16 Jan 2000 00:00:00 +0000
Message-ID: <3882245F.DCA57C18@fisec.com>

"Bruce L. Lambert" wrote:

> I am, however, looking for a better, faster way.

Yes, binary (logarithmic) search would be tremendously faster O(log2 n)
than linear search O(n).  Hashing would also work, they should be even
faster (constant search time).  If you have enough memory, you can just
create a vector of size n which would contain vectors starting with size
(1- n) ending with size 1.

Also, if the elements are integers that fit in 1 byte, you can use
:element-type '(unsigned-byte 8) - some implementations will make an
array 1/4 size of a fixnum.

If you get a loss of speed because you miss values very often, you may
consider an array of bits just to store if there is an element for those
documents in your sparse array - the size of this bit array will be very
small if your implementation is clever enough to store 1 bit in 1 bit.

Maybe you have some special knowledge on the distribution of the data -
there may be documents or ranges where the matrix is locally dense,
accounting for much of the total number of elements.  In this case you
can split your representation to a dense and a sparse part.  If you
don't have dense ranges readily available, maybe you can sort the
documents so that such ranges are likely to form.

If you know what the spatial distribution of the elements will be (e.g.,
even or Bell curve), you may construct a custom hashing function
yourselves.  In all cases, the point is to make use of the special
knowledge you have on data and the way it will be accessed.

Regards
Robert

From: Bernhard Pfahringer
Subject: Re: array-total-size-limit
Date: Sun, 16 Jan 2000 00:00:00 +0000
Message-ID: <85tdt2$4qsa$1@www.univie.ac.at>

In article <·············@corp.supernews.com>,
Bruce L. Lambert <········@uic.edu> wrote:
>
>I am, however, looking for a better, faster way. I thought of using
>hash-tables to represent the row-indices. Each entry would have the
>row-index as its key and the position as its value. Same thing for the
>column indices. I'll have to give back some space to gain some speed, but
>maybe this is a happy medium. The other alternative (which Bill Brodie
>suggested in his original post ay back when) was to store the row-indices
>and column indices in binary search trees.
>
>Any suggestions are appreciated.
>

 you could try the following (sorry, undocumented) code
 Careful: no error checking, could/should be added

;;;
;;; (C) Bernhard Pfahringer 16-Jan-2000.
;;; This code is protected by the following license:
;;; LGPL, see http://www.gnu.org/copyleft/lesser.html for details.
;;;

(defstruct sparse-2d-array
  (dimensions nil)	    ;;; for out-of-range checking purposes
  (y-offset 0)
  (contents nil))


(defun my-make-sparse-2d-array (x-dim y-dim)
  (let ((x-offset (integer-length (1- x-dim)))
	(y-offset (integer-length (1- y-dim))))
    (make-sparse-2d-array :dimensions (list x-dim y-dim)
			  :y-offset y-offset
			  :contents (make-hash-table
				     :test
				     (if (typep (ash 1 (+ x-offset y-offset))
						'fixnum)
					 #'eq
					 #'eql)))))


(defun sparse-2d-aref (array x y)
  (let ((index (+ (ash x (sparse-2d-array-y-offset array))
		  y)))
    (gethash index
	     (sparse-2d-array-contents array)
	     0)))


(defun (setf sparse-2d-aref) (value array x y)
  (let ((index (+ (ash x (sparse-2d-array-y-offset array))
		  y)))
    (if (zerop value)
	(remhash index (sparse-2d-array-contents array))
	(setf (gethash index (sparse-2d-array-contents array)) value)))
  value)

-- 
--------------------------------------------------------------------------
Bernhard Pfahringer, OeFAI           http://www.ai.univie.ac.at/~bernhard/
--------------------------------------------------------------------------
  iSteve: i is iCeo in iApple. You'll be iified. iResistance is inVain.

From: Barry Margolin
Subject: Re: array-total-size-limit
Date: Fri, 14 Jan 2000 00:00:00 +0000
Message-ID: <_kzf4.43$%%2.523@burlma1-snr2>

In article <·············@corp.supernews.com>,
Bruce L. Lambert <········@uic.edu> wrote:
>In ACL 5.0, array-total-size-limit = 2^24 = 16777216. What can I do if I
>want to store more than 16777216 values? My aplication is document
>clustering, and I am trying to store an n x n interdocument similarity
>matrix. As far as I can tell, I cannot do this when n > 4096. This seems
>unreasonable. Am I missing something?

Use an array of arrays.  So instead of (aref matrix x y) you would do (aref
(aref matrix x) y).  Performance won't be as good, but it will be better
than the performance of a program that doesn't run at all.

-- 
Barry Margolin, ······@bbnplanet.com
GTE Internetworking, Powered by BBN, Burlington, MA
*** DON'T SEND TECHNICAL QUESTIONS DIRECTLY TO ME, post them to newsgroups.
Please DON'T copy followups to me -- I'll assume it wasn't posted to the group.

From: Robert Monfera
Subject: Re: array-total-size-limit
Date: Sat, 15 Jan 2000 00:00:00 +0000
Message-ID: <3880B040.E9848A44@fisec.com>

"Bruce L. Lambert" wrote:
>
> In ACL 5.0, array-total-size-limit = 2^24 = 16777216. What can I do if I
> want to store more than 16777216 values? My aplication is document
> clustering, and I am trying to store an n x n interdocument similarity
> matrix. As far as I can tell, I cannot do this when n > 4096. This seems
> unreasonable. Am I missing something?

There were many constructive ideas for workarounds (like vectors of
vectors), but the reasons for the limitation were not addressed.  There
could be a number of reasons:

1. It is desirable for implementors and users that cells are addressable
in the fixnum range, which already limits the total size.

2. On some platforms, there may be memory allocation limits that do not
allow the allocation of more than 64MB.

The first one may be a relatively hard limit; both (possible)
constraints can be worked around by either implementors or users by
clustering arrays or vectors of smaller size.  It would be interesting
to know if there are other possible limits I am not aware of.

Regards
Robert

From: Michael L. Harper
Subject: Re: array-total-size-limit
Date: Sun, 16 Jan 2000 00:00:00 +0000
Message-ID: <3881515D.2E375055@alcoa.com>

I do not have ACL 5 documentation handy but you can build a new ACL base
image and increase this value in the process. Look undder the doc section on
building images; there is a parameter that controls this.

Mike

"Bruce L. Lambert" wrote:

> In ACL 5.0, array-total-size-limit = 2^24 = 16777216. What can I do if I
> want to store more than 16777216 values? My aplication is document
> clustering, and I am trying to store an n x n interdocument similarity
> matrix. As far as I can tell, I cannot do this when n > 4096. This seems
> unreasonable. Am I missing something?
>
> -bruce

From: Robert Monfera
Subject: Re: array-total-size-limit
Date: Sun, 16 Jan 2000 00:00:00 +0000
Message-ID: <38816625.9B8D50F0@fisec.com>

Array size limits and heap limit are conceptually unrelated.  It is
advisable, however, to rebuild the ACL image with a bigger heap and
oldspace size to accomodate the big array(s), and to use :allocation
:old in make-array.

Robert

"Michael L. Harper" wrote:
>
> I do not have ACL 5 documentation handy but you can build a new ACL base
> image and increase this value in the process. Look undder the doc section on
> building images; there is a parameter that controls this.
>
> Mike