How to represent a LARGE sparse matrix??

From: Bruce L. Lambert
Subject: How to represent a LARGE sparse matrix??
Date: Sat, 22 Feb 1997 00:00:00 +0000
Message-ID: <lambertb-2202971939540001@207.70.66.82>

Hi,

I have a need to represent a large sparse matrix of fixnums. At present I
use a hash-table to represent the matrix. Each key to the hash-table is a
dotted pair whose car is the row index and whose cdr is the column index.
This is okay for some of my experiments, but when the tasks grow larger,
and I need to represent millions of values, I get the dreaded "no more
room for lisp objects" error. (I'm using a 1995 release of CLISP by the
way, on a DEC 5000 running Ultrix 4.3).

How can I represent these large sparse matrices more space-efficiently.
Can I use some sort of tree structure? It's times like this where my lack
of formal training in computer science really makes my life difficult. Any
suggestions would be helpful.

-bruce

-- 
Bruce L. Lambert, Ph.D.
Dept. of Pharmacy Administration (M/C 871)
University of Illinois at Chicago 
833 S. Wood St., Rm. 241
Chicago, IL  60612-7231

http://ludwig.pmad.uic.edu/~bruce/

Re: How to represent a LARGE sparse matrix?? John Atwood
- Re: How to represent a LARGE sparse matrix?? Bill Brodie
- Re: How to represent a LARGE sparse matrix?? Tim Bradshaw
  - Re: How to represent a LARGE sparse matrix?? Bruce L. Lambert
Re: How to represent a LARGE sparse matrix?? Marco Antoniotti
Re: How to represent a LARGE sparse matrix?? Mark McConnell
Re: How to represent a LARGE sparse matrix?? Thomas A. Russ
Re: How to represent a LARGE sparse matrix?? Marco Antoniotti

From: John Atwood
Subject: Re: How to represent a LARGE sparse matrix??
Date: Sat, 01 Mar 1997 00:00:00 +0000
Message-ID: <5f852o$2f$1@news.nero.net>

>I have a need to represent a large sparse matrix of fixnums. At present I
>use a hash-table to represent the matrix. Each key to the hash-table is a
>dotted pair whose car is the row index and whose cdr is the column index.
>This is okay for some of my experiments, but when the tasks grow larger,
>and I need to represent millions of values, I get the dreaded "no more
>room for lisp objects" error. (I'm using a 1995 release of CLISP by the
>way, on a DEC 5000 running Ultrix 4.3).
>
>How can I represent these large sparse matrices more space-efficiently.
>Can I use some sort of tree structure? It's times like this where my lack
>of formal training in computer science really makes my life difficult. Any
>suggestions would be helpful.

Since no one else has answered, I'll give it a whack.  A tree structure
doesn't sound like the way to go, unless your data occurs in clumps, and
you use some kind of hamming code-like scheme.  Your approach sounds
pretty efficient to me.  How much RAM and swapspace do you have?  you
might just need to bump that up.  You'll probably get a better response
on the CLISP mailing list; you can find out where to subscribe in the
LISP FAQ:
http://www.cis.ohio-state.edu/hypertext/faq/usenet/lisp-faq/part4/faq.html
You might also look at the archives of that list and see if getting a
later version would help.

In the meantime, you might save some space by making your hash key a
symbol rather than a cons:  (make-symbol (format nil "~a,~a" num1 num2)) 
;though this seems pretty inefficient time-wise, there's probably a
better way to get a symbol from 2 numbers. Do you really need bignums? 
For the matrix indices as well as the values? When you're dealing with
millions of entries (and, if it's sparse, the size of the matrix must be
tens of millions), a byte or a word saved per entry can really add up. 
So if you could find an unboxed representation (saving the pointer
indirection) that might help a lot.  Look at the variables, 
> most-positive-fixnum
536870911
> most-negative-fixnum
-536870912

If you can guarantee that your data, either the keys or values, will be
within that range, you might use fixnums rather than bignums.

P.S. I don't know the DEC 5000, is it an Alpha? i.e. 64 bit? 
P.S.S. I also don't know CLISP, and your soln is imple. dependent, but
this'll give you something to read over the weekend.

John



-- 
--Office phone: 541-737-5583 (Batcheller 349) home: 757-8772
  Office mail:  303 Dearborn Hall, OSU, Corvallis, OR  97331
--Frames are evil; see why: http://wwwvoice.com/hatefrm.html

From: Bill Brodie
Subject: Re: How to represent a LARGE sparse matrix??
Date: Sat, 01 Mar 1997 00:00:00 +0000
Message-ID: <5farvg$1hn@panix.com>

John Atwood (·······@ice.CS.ORST.EDU) wrote:
: >I have a need to represent a large sparse matrix of fixnums. At present I
: >use a hash-table to represent the matrix. Each key to the hash-table is a
: >dotted pair whose car is the row index and whose cdr is the column index.
: >This is okay for some of my experiments, but when the tasks grow larger,
: >and I need to represent millions of values, I get the dreaded "no more
: >room for lisp objects" error. (I'm using a 1995 release of CLISP by the
: >way, on a DEC 5000 running Ultrix 4.3).

: ... You might save some space by making your hash key a
: symbol rather than a cons:  (make-symbol (format nil "~a,~a" num1 num2)) 
: ;though this seems pretty inefficient time-wise, there's probably a
: better way to get a symbol from 2 numbers. Do you really need bignums? 
: For the matrix indices as well as the values? When you're dealing with
: millions of entries (and, if it's sparse, the size of the matrix must be
: tens of millions), a byte or a word saved per entry can really add up. 
: So if you could find an unboxed representation (saving the pointer
: indirection) that might help a lot.

: If you can guarantee that your data, either the keys or values, will be
: within that range, you might use fixnums rather than bignums.

Interning a symbol takes a lot more space than allocating a cons cell.
Also, the use of fixnums vs. bignums is transparent to the
application:  if some of them are too big to be represented as
fixnums, the application will represent them, and only them, as
bignums.

One easy hack is to keep the original representation, but make the
hash key equal to (+ (lsh i 16) j) if both i and j are less than 32768
(assuming that fixnums are 30 bits long); that saves a cons cell for
all such entries.  (You have to tweak this a little to use the sign
bit when i takes a full 15 bits.)

From: Tim Bradshaw
Subject: Re: How to represent a LARGE sparse matrix??
Date: Mon, 03 Mar 1997 00:00:00 +0000
Message-ID: <ey34tet8856.fsf@grimsay.aiai.ed.ac.uk>

* John Atwood wrote:
> * Bruce L Lambert wrote:

>> I have a need to represent a large sparse matrix of fixnums. At present I
>> use a hash-table to represent the matrix. Each key to the hash-table is a
>> dotted pair whose car is the row index and whose cdr is the column index.

> In the meantime, you might save some space by making your hash key a
> symbol rather than a cons:  (make-symbol (format nil "~a,~a" num1 num2)) 
> ;though this seems pretty inefficient time-wise, there's probably a
> better way to get a symbol from 2 numbers. Do you really need bignums? 
> For the matrix indices as well as the values? When you're dealing with
> millions of entries (and, if it's sparse, the size of the matrix must be
> tens of millions), a byte or a word saved per entry can really add up. 
> So if you could find an unboxed representation (saving the pointer
> indirection) that might help a lot.  Look at the variables, 
>> most-positive-fixnum
> 536870911
>> most-negative-fixnum
> -536870912

If you're already consing a symbol as the hash key, the right thing to
do would be to keep the value as SYMBOL-VALUE of the symbol.  That way
you don't need a hashtable at all (other than the hashtable of
symbols...). But I would have thought that a hashtable is a good way of
doing it, as symbols are not that small (pointer for name, package,
plist, value, fn value + storage for name?  Which sounds to me like it
would be a lot bigger than your cons + hashtable entry).

How many objects are there?

--tim

From: Bruce L. Lambert
Subject: Re: How to represent a LARGE sparse matrix??
Date: Tue, 11 Mar 1997 00:00:00 +0000
Message-ID: <5g2vtn$fjv@nntp.interaccess.com>

>If you're already consing a symbol as the hash key, the right thing to
>do would be to keep the value as SYMBOL-VALUE of the symbol.  That way
>you don't need a hashtable at all (other than the hashtable of
>symbols...). But I would have thought that a hashtable is a good way of
>doing it, as symbols are not that small (pointer for name, package,
>plist, value, fn value + storage for name?  Which sounds to me like it
>would be a lot bigger than your cons + hashtable entry).

>How many objects are there?

Hi folks, 

I appreciate all the feedback about my sparse hash-table problem. Let me
give you a little more context. The sparse matrix is an interdocument
similarity matrix for a document clustering application. The sparse
matrix representation should (ideally, of course) be space efficient (see
below) and access-time efficient, because I need to iterate over its
elements many times to do the clustering. Analysis of a 4400 document
collection yielded about 2 million nonzero similarities. So, I was trying
to allocate a hash-table of :size 2000000. Here's what happened.

> (time (progn (setf foo (make-hash-table :size 2000000))))

*** - no more room for LISP objects
1. Break> (room)
33069720 ;
44152
1. Break> (lisp:shell "/usr/etc/pstat -s")
35468k allocated + 1044k reserved = 36512k used, 15284k available
1. Break> (lisp:bye)

Real time: 401.57315 sec.
Run time: 16.82 sec.
Space: 32012552 Bytes
GC: 2, GC time: 12.6 sec.
Bye.

As you can see, I have run into some memory limits, but they appear not
to be swap space limits. Marcus Daniels has been extremely helpful in my
attempts to overcome these limits, but we haven't quite succeeded yet. 
I'm running a 64 bit version of CLISP on a Sun Sparc 4, SunOS Release
4.1.4 (GENERIC). I think the machine has only 16MB of real memory, but
I'm not absolutely certain at this point. The Lisp stack overflows right
around 32MB.

Also,  in this CLISP implementation, hash-tables appear to use about 40
bytes per entry no matter what type the key is. For example:

> (time (make-hash-table :size 100))

Real time: 0.059696 sec.
Run time: 0.01 sec.
Space: 4152 Bytes
#S(HASH-TABLE EQL)
> (time (make-hash-table :size 100 :initial-contents '(((1 . 1) 1000))))

Real time: 0.001277 sec.
Run time: 0.0 sec.
Space: 4152 Bytes
#S(HASH-TABLE EQL ((1 . 1) 1000))

> (time (make-hash-table :size 100 :initial-contents '(("1 1" . 1000))))


Real time: 0.004736 sec.
Run time: 0.01 sec.
Space: 4152 Bytes
#S(HASH-TABLE EQL ("1 1" . 1000))

Thanks in advance for additional insight into these problems.

-bruce

From: Marco Antoniotti
Subject: Re: How to represent a LARGE sparse matrix??
Date: Mon, 03 Mar 1997 00:00:00 +0000
Message-ID: <s08vi78gb37.fsf@crawdad.ICSI.Berkeley.EDU>

I have been following this thread and I still have the follwing
question?

How do you code a sparse matrix in FORTRAN, or C, or COBOL, or Java?
It seemes to me that there should be *a lot* of literature by various
numerical analysts on the subject.

If anybody has pointers to such literature, please send me a message
and I will post a summary.

Thanks

-- 
Marco Antoniotti - Resistente Umano
===============================================================================

From: Mark McConnell
Subject: Re: How to represent a LARGE sparse matrix??
Date: Mon, 03 Mar 1997 00:00:00 +0000
Message-ID: <331B4733.528@math.okstate.edu>

Marco Antoniotti wrote:
> 
> How do you code a sparse matrix in FORTRAN, or C, or COBOL, or Java?
> It seemes to me that there should be *a lot* of literature by various
> numerical analysts on the subject.
> 
> If anybody has pointers to such literature, please send me a message
> and I will post a summary.

One source is _Direct Methods for Sparse Matrices_, by I. S. Duff,
A. M. Erisman, and J. K. Reid, in "Monographs on Numerical
Analysis", Clarendon Press, Oxford, 1986 [corrected 1989].

Or:
"Data Structures, Algorithms, and Software for Sparse Matrices,"
Iain S. Duff, in _Sparsity and its Appl'ns_, ed.
David J. Evans, Camb. U. Press, 1985, pp. 13-15.

Duff and Reid have been big names in this field for several
decades.  [Apologies to other big names I'm not naming--I don't
know this field].

I have missed any parts of this thread that dealt with the Lisp
aspects.  Since 1987 I personally have used a certain kind of sparse
matrix data structure adapted to Lisp.  I use only *integer* matrices,
which means I can't use much of what's in the numerical analysis
literature.  I need fast access to both row and column
operations, and have tried to write a sparse data structure
that provides it, with partial success.  You can look at my Lisp
code in
http://www.math.okstate.edu/~mmcconn/shh0297/sparsezmat.lisp

From: Thomas A. Russ
Subject: Re: How to represent a LARGE sparse matrix??
Date: Mon, 10 Mar 1997 00:00:00 +0000
Message-ID: <ymilo7vg38j.fsf@hobbes.isi.edu>

 >> * Bruce L Lambert wrote:
 > 
 >>> I have a need to represent a large sparse matrix of fixnums. At present I
 >>> use a hash-table to represent the matrix. Each key to the hash-table is a
 >>> dotted pair whose car is the row index and whose cdr is the column index.

One potential problem is that using a cons as a hash key means that you
need to use an EQUAL hash table.  Hashing for EQUAL is much more time
consuming than using an EQL hash table.  This can be particularly
painful when the hash table needs to be expanded, since that requires
rehashing all the entries!

A better approach would be to use nested hash tables.  The top one for
rows, and have each element be the entries for that row.  If it is
really sparse, you could potentially save time and possibly space by
using an alist instead of a hash table.  Depending on the
implementation, alists can be faster to use for tables with up to 20-100
elements.

-- 
Thomas A. Russ,  USC/Information Sciences Institute          ···@isi.edu

From: Marco Antoniotti
Subject: Re: How to represent a LARGE sparse matrix??
Date: Mon, 10 Mar 1997 00:00:00 +0000
Message-ID: <s08bu8r1uas.fsf@crawdad.ICSI.Berkeley.EDU>

···@ISI.EDU (Thomas A. Russ) writes:

> 
>  >> * Bruce L Lambert wrote:
>  > 
>  >>> I have a need to represent a large sparse matrix of fixnums. At present I
>  >>> use a hash-table to represent the matrix. Each key to the hash-table is a
>  >>> dotted pair whose car is the row index and whose cdr is the column index.
> 
> One potential problem is that using a cons as a hash key means that you
> need to use an EQUAL hash table.  Hashing for EQUAL is much more time
> consuming than using an EQL hash table.  This can be particularly
> painful when the hash table needs to be expanded, since that requires
> rehashing all the entries!

Why a EQUAL hash table should be much more slow (anybody has numbers
here?) that an EQL based one?  Should it not be slower just for the
factor that EQUAL is slower that EQL? (times the number of accesses?)
AFAIU the sparse hash table could be indexed over CONS pairs, which,
intuitively should not require more than

#I((time-of 'eql) * 2 + (overhead-of 'equal) + (overhead-of '%sxhash-cons))

where (overhead-of 'equal) and (overhead-of '%sxhas-cons) should not
be that big (numbers needed again per Common Lisp implementation) and
%sxhash-cons is considered not too exepensive either.

-- 
Marco Antoniotti - Resistente Umano
===============================================================================
International Computer Science Institute	| ·······@icsi.berkeley.edu
1947 Center STR, Suite 600			| tel. +1 (510) 643 9153
Berkeley, CA, 94704-1198, USA			|      +1 (510) 642 4274 x149
===============================================================================
	...it is simplicity that is difficult to make.
	...e` la semplicita` che e` difficile a farsi.
				Bertholdt Brecht