From: Robert Monfera
Subject: Memory storage efficiency
Date: 
Message-ID: <383A0C3F.B7309AFA@fisec.com>
Hello,

I need some help about how to best store data that resemble a database
table.  I am interested in comments for both ACL and LWW.

I've got well-behaving records - say, each record is made up of 6
fixnums. On the surface, it would make sense to represent a record as a
class instance, a structure instance or a vector, and the table as a
vector or a hash-table.

Normally I would not care about the overhead, but there could
potentially be millions of such records, so it is important that the
size of a table does not exceed (* 6 4 rownum) bytes.

The problem I am having is that the overhead of instances or vectors is
higher than the number of bytes stored.  This overhead serves no useful
purpose, because the table is homogeneous.

I could use a two-dimensional array of fixna or unsigned bytes, but then
I would loose the ability to use the built-in sort function - which is
not nearly as fast on unboxed values as boxed ones.  I need to be able
to efficiently sort based on any of the 6 numbers.

I am interested in ways to minimize overhead when storing vectors or
instances in vectors or hash-tables.  Alternatively, pointers to sorting
algorithms that are at least as efficient as the built-in sort function
(when it uses merge-sort), but work on arrays of numbers are
appreciated.  In both cases, I am wondering how to measure space
consumption of objects, besides using time.  (By the way, what's the
size of a cons cell in the above implementations?)

(One specialty here is that a possibly pre-sorted array is the object of
sorting.  For example, the array is pre-sorted by columns A,B,C,D,E,F,
and the requirement is to sort by e.g., C,A,B,D,F,E or sort _and_ group
by C,B,F only.  Again, I appreciate any pointers on relevant
algorithms.)

Thanks for your help,
Robert

From: Robert Monfera
Subject: Re: Memory storage efficiency
Date: 
Message-ID: <383A169F.31A5FD67@fisec.com>
... it seems that posting on the Usenet enlightens me about obvious
solutions.  In this case, I can prepare a vector of array row indices,
which is then easy to sort with the built-in sort function, referring
back to array column values for comparison.  Retrieval is as easy as
walking through the sorted row numbers and retrieving the rows.  The
advantage is that row numbers are small (guaranteed to be fixnums), so
there is not much data to move around.

I think that mergesort may be preferable because of the pre-ordering and
preferable worst-case scenario.  I still wonder though how can I best
realize the benefits of pre-ordering -

for example, when there are two colums:

A 1
A 2
A 3
B 2
B 5
C 1
C 4
C 6

Here the target is to order by the 2nd column primarily, and the 1st
column secondarily.  Merge-sort may give better performance if we use
the knowledge that we can 'cut' at the boundaries of A, B and C, and
merge from three sources.

Is there any published (or proprietary :-) algorithm to perform such a
specialised merge-sort?  I can probably figure it out, but there may be
some worst-case scenarios - like merge-sorting from so many sources as
rows (if all letters are the same), and there may be multi-column
complications.

Thanks,
Robert
From: Robert Monfera
Subject: Re: Memory storage efficiency
Date: 
Message-ID: <383A49C6.5D17EB84@fisec.com>
Re sorting:

I was overly optimistic - I thought comparison operators would work on
fixnums or unsigned bytes without boxing (is there a reason why they
don't?).  Because I cannot afford the 2x space and speed overhead of a
simple-array ("specialized" for t), it seems it leaves me no option. 
(Also, boolean and bit functions either do or require consing.)

I saw a posting from Duane, in which he writes that unboxed comparison
may be difficult to do in the general case.  I'd be happy with a
solution where I _need_ to declare that fixnums are used:

(defun sorti (array index column)
  "Sorts index (fixnum vector) based on column number of fixnum array."
  (declare (optimize (speed 3) (safety 0)))
  (time (sort index #'(lambda (rownum1 rownum2) 
                        (declare (type fixnum rownum1 rownum2))
                        (let ((foo (aref array rownum1 column))
                              (bar (aref array rownum2 column)))
                          (declare (type fixnum foo bar))
                          (< foo bar)))))
  nil)

(sorti a i 1)

LWW says:

1.8 seconds used.
Standard Allocation 12324304 bytes.
Fixlen Allocation 0 bytes.

(Runtime goes up to 8(!) seconds if index is preordered, so it probably
does not even use a mergesort.)

Why it is difficult to compare two fixnums, if it's perfectly possible
to subtract them.  I'd even settle with minusp - too bad that's boxing
too, and wants to use real instead of a fixnum!

Thanks for any suggestions in my challenge.
Robert
From: Pekka P. Pirinen
Subject: Re: Memory storage efficiency
Date: 
Message-ID: <ixaeo5krku.fsf@gaspode.cam.harlequin.co.uk>
Robert Monfera <·······@fisec.com> writes:
> I was overly optimistic - I thought comparison operators would work on
> fixnums or unsigned bytes without boxing (is there a reason why they
> don't?).  [...]
> (Also, boolean and bit functions either do or require consing.)

None of those box on fixnums.  I tried your SORTI function in LWW 4.1,
and it didn't cons anything either.  There must be something strange
about the way you initialize the arrays.

> I saw a posting from Duane, in which he writes that unboxed comparison
> may be difficult to do in the general case.

This is quite true.  SORT takes a functional argument, after all.  So
some advanced cross-function optimization would need to happen.  But
fixnums are never boxed.
-- 
Pekka P. Pirinen
Adaptive Memory Management Group, Harlequin Limited
If it's spam, it's a scam.  Don't do business with net abusers.