More efficient alternatives to (subseq) for accessing parts of a string array?

From: dpapathanasiou
Subject: More efficient alternatives to (subseq) for accessing parts of a string array?
Date: Thu, 12 Apr 2007 13:27:34 +0000
Message-ID: <1176384454.413064.108770@w1g2000hsg.googlegroups.com>

I'm using (read-sequence) within a (with-open-file) block to read a
plain text file into a one-dimensional array.

The file is large, but each line is a fixed width, so I can load any n
lines of the file by computing the size of the array (read-sequence)
fills to match (* n line-length).

Once in memory, I'd like to be able to pick out lists of the same
attribute that appears in each line of the file.

I've come up with this, and although it works logically, the
combination of (push) and (subseq) is creating a lot of consing.

Is there a more efficient way of doing this?

(defun get-attribute-vector (chunk attribute increment points-fn)
  (let ((line-count (/ (length chunk) increment)))
    (when (integerp line-count)
      (let ((v nil))
        (multiple-value-bind (a b)
            (funcall points-fn 0 attribute)
          (dotimes (i line-count)
            (push (subseq chunk (+ (* i increment) a) (+ (* i
increment) b)) v)))
        v))))

n.b.: chunk is the one-dimensional array returned by (read-sequence),
increment is the line size, and points-fn is the function which
calculates the (subseq) start and end points for a specific line
number and attribute definition (defined as parameters elsewhere).

Re: More efficient alternatives to (subseq) for accessing parts of a string array? dpapathanasiou
- Re: More efficient alternatives to (subseq) for accessing parts of a string array? Edi Weitz
  - Re: More efficient alternatives to (subseq) for accessing parts of a string array? dpapathanasiou
    - Re: More efficient alternatives to (subseq) for accessing parts of a string array? Edi Weitz
      - Re: More efficient alternatives to (subseq) for accessing parts of a string array? dpapathanasiou
- Re: More efficient alternatives to (subseq) for accessing parts of a string array? Pascal Bourguignon
  - Re: More efficient alternatives to (subseq) for accessing parts of a string array? dpapathanasiou
    - Re: More efficient alternatives to (subseq) for accessing parts of a string array? Pascal Bourguignon
  - cl-ppcre vs. search-string (was: More efficient alternatives to (subseq) for accessing parts of a string array?) Kirk Job Sluder
Re: More efficient alternatives to (subseq) for accessing parts of a string array? Rob Warnock
Re: More efficient alternatives to (subseq) for accessing parts of a string array? Tim Bradshaw

From: dpapathanasiou
Subject: Re: More efficient alternatives to (subseq) for accessing parts of a string array?
Date: Thu, 12 Apr 2007 15:49:55 +0000
Message-ID: <1176392995.541430.184300@w1g2000hsg.googlegroups.com>

I just realized that I could use (aref) instead of (subseq), like
this:

(defun get-attribute-vector (chunk attribute increment points-fn)
  (let ((line-count (/ (length chunk) increment)))
    (when (integerp line-count)
      (let ((v nil))
        (multiple-value-bind (a b)
            (funcall points-fn 0 attribute)
          (dotimes (i line-count)
            (push
             (with-output-to-string (s)
               (dotimes (j (- b a))
                 (format s "~A" (aref chunk (+ (* i increment) a
j)))))
             v)))
        v))))

But that still leaves me with (push) as a significant consing element
-- any other way to implement that logic?

The data sets I'm playing with are such that chunk is around 40 mb in
size, and while not super large, the increment (line length) value is
90 bytes, so this routine winds up calling push on the order of
450,000 times, so a non/less-consing alternative to (push) *will* make
a huge difference.

From: Edi Weitz
Subject: Re: More efficient alternatives to (subseq) for accessing parts of a string array?
Date: Thu, 12 Apr 2007 16:03:32 +0000
Message-ID: <u8xcx37hn.fsf@agharta.de>

On 12 Apr 2007 08:49:55 -0700, "dpapathanasiou" <···················@gmail.com> wrote:

> But that still leaves me with (push) as a significant consing
> element [...] so this routine winds up calling push on the order of
> 450,000 times, so a non/less-consing alternative to (push) *will*
> make a huge difference.

Have you actually profiled your code or is this just superstition?  I
can't believe that calling PUSH half a million times can be a problem.

  CL-USER 1 > (defun foo (n)
                (let (result)
                  (dotimes (i n)
                    (push i result))))
  FOO

  CL-USER 2 > (compile *)
  FOO
  NIL
  NIL

  CL-USER 3 > (time (foo 500000))
  Timing the evaluation of (FOO 500000)

  User time    =        0.100
  System time  =        0.000
  Elapsed time =        0.100
  Allocation   = 6003844 bytes
  0 Page faults
  NIL

  CL-USER 4 > (time (foo 500000))
  Timing the evaluation of (FOO 500000)

  User time    =        0.060
  System time  =        0.000
  Elapsed time =        0.071
  Allocation   = 6000000 bytes
  0 Page faults
  NIL

  CL-USER 5 > (time (foo 500000))
  Timing the evaluation of (FOO 500000)

  User time    =        0.060
  System time  =        0.000
  Elapsed time =        0.060
  Allocation   = 6000000 bytes
  0 Page faults
  NIL

-- 

Lisp is not dead, it just smells funny.

Real email: (replace (subseq ·········@agharta.de" 5) "edi")

From: dpapathanasiou
Subject: Re: More efficient alternatives to (subseq) for accessing parts of a string array?
Date: Thu, 12 Apr 2007 19:13:47 +0000
Message-ID: <1176405227.447837.310890@y80g2000hsf.googlegroups.com>

> Have you actually profiled your code or is this just superstition?  I
> can't believe that calling PUSH half a million times can be a problem.

Yes, here's a test example: chunk (aapl-q in this run) is 41,856,815
bytes and 459,965 lines:

CL-USER> (length aapl-q) ; chunk, the 1-d array filled by (read-
sequence)
41856815
CL-USER> (/ (length aapl-q) *quote-increment*) ; number of text lines
in chunk
459965

First, the (push) + (subseq) version:

CL-USER> (time (get-attribute-vector aapl-q 'EXCHANGE *quote-
increment* #'quote-subseq-points))
; Compiling LAMBDA NIL:
; Compiling Top-Level Form:

; Evaluation took:
;   21.06 seconds of real time
;   18.49 seconds of user run time
;   0.68 seconds of system run time
;   54,621,027,220 CPU cycles
;   [Run times include 1.51 seconds GC run time]
;   0 page faults and
;   526,626,696 bytes consed.
;


Switching to (aref) instead of (subseq) is better, but still
relatively slow:

CL-USER> (time (get-attribute-vector aapl-q 'EXCHANGE *quote-
increment* #'quote-subseq-points))
; Compiling LAMBDA NIL:
; Compiling Top-Level Form:

; Evaluation took:
;   4.81 seconds of real time
;   2.4 seconds of user run time
;   0.4 seconds of system run time
;   12,498,156,496 CPU cycles
;   [Run times include 1.18 seconds GC run time]
;   0 page faults and
;   355,390,792 bytes consed.
;

From: Edi Weitz
Subject: Re: More efficient alternatives to (subseq) for accessing parts of a string array?
Date: Thu, 12 Apr 2007 19:46:02 +0000
Message-ID: <uk5whz891.fsf@agharta.de>

On 12 Apr 2007 12:13:47 -0700, "dpapathanasiou" <···················@gmail.com> wrote:

>> Have you actually profiled your code or is this just superstition?
>> I can't believe that calling PUSH half a million times can be a
>> problem.
>
> Yes, here's a test example:
>
> [...]
>
> First, the (push) + (subseq) version:
>
> CL-USER> (time (get-attribute-vector aapl-q 'EXCHANGE *quote-increment* #'quote-subseq-points))
> ; Compiling LAMBDA NIL:
> ; Compiling Top-Level Form:
>
> ; Evaluation took:
> ;   21.06 seconds of real time
> ;   18.49 seconds of user run time
> ;   0.68 seconds of system run time
> ;   54,621,027,220 CPU cycles
> ;   [Run times include 1.51 seconds GC run time]
> ;   0 page faults and
> ;   526,626,696 bytes consed.
>
> Switching to (aref) instead of (subseq) is better, but still
> relatively slow:
>
> CL-USER> (time (get-attribute-vector aapl-q 'EXCHANGE *quote-increment* #'quote-subseq-points))
> ; Compiling LAMBDA NIL:
> ; Compiling Top-Level Form:
>
> ; Evaluation took:
> ;   4.81 seconds of real time
> ;   2.4 seconds of user run time
> ;   0.4 seconds of system run time
> ;   12,498,156,496 CPU cycles
> ;   [Run times include 1.18 seconds GC run time]
> ;   0 page faults and
> ;   355,390,792 bytes consed.

Measuring the overall time isn't /profiling/, and using AREF instead
of SUBSEQ has nothing to do with PUSH.  Note that the difference
between the two versions is more than 16 seconds, but the difference
in GC run time is only about a third of a second.

-- 

Lisp is not dead, it just smells funny.

Real email: (replace (subseq ·········@agharta.de" 5) "edi")

From: dpapathanasiou
Subject: Re: More efficient alternatives to (subseq) for accessing parts of a string array?
Date: Thu, 12 Apr 2007 21:16:19 +0000
Message-ID: <1176412579.409206.111160@n76g2000hsh.googlegroups.com>

> Measuring the overall time isn't /profiling/, and using AREF instead
> of SUBSEQ has nothing to do with PUSH.  Note that the difference
> between the two versions is more than 16 seconds, but the difference
> in GC run time is only about a third of a second.

Right, (subseq)/(aref) has nothing to do with (push) -- if you look
closely at the two versions, you'll notice they both use (push).

But, replacing (subseq) with (aref) *did* make a difference, and so
coming up with alternatives to using (push), e.g. using a vector
instead of a list, thereby eliminating the need for (push), etc.,
seemed the next logical step.

From: Pascal Bourguignon
Subject: Re: More efficient alternatives to (subseq) for accessing parts of a string array?
Date: Thu, 12 Apr 2007 16:04:58 +0000
Message-ID: <87odltk28l.fsf@voyager.informatimago.com>

"dpapathanasiou" <···················@gmail.com> writes:

> I just realized that I could use (aref) instead of (subseq), like
> this:
>
> (defun get-attribute-vector (chunk attribute increment points-fn)
>   (let ((line-count (/ (length chunk) increment)))
>     (when (integerp line-count)
>       (let ((v nil))
>         (multiple-value-bind (a b)
>             (funcall points-fn 0 attribute)
>           (dotimes (i line-count)
>             (push
>              (with-output-to-string (s)
>                (dotimes (j (- b a))
>                  (format s "~A" (aref chunk (+ (* i increment) a
> j)))))
>              v)))
>         v))))
>
> But that still leaves me with (push) as a significant consing element
> -- any other way to implement that logic?

WITH-OUTPUT-TO-STRING will probably cons more than subseq, and it +
FORMAT will take much more time.

What you can do, is to avoid copying the characters you already store
in chunk, by using displaced arrays.  See my NSUBSEQ in
http://darcs.informatimago.com/lisp/common-lisp/utility.lisp

Displaced arrays are like pointers inside arrays.  They take some
space to describe the displacement, but not to store the data (storage
is done in the original array).  If your attributes (subseqs) are
bigger than a few words (12 or 16 octets in most implementations I'd
guess), it may be worthwhile.  On the other hand, if you have unicode
characters, it might be worthwhile as soon as you have attributes of
more than 3 or 4 characters.

> The data sets I'm playing with are such that chunk is around 40 mb in
> size, and while not super large, the increment (line length) value is
> 90 bytes, so this routine winds up calling push on the order of
> 450,000 times, so a non/less-consing alternative to (push) *will* make
> a huge difference.

Of course, you have to allocate a vector or a list to collect the
attributes.  Since you have a lot of entries, a vector would take half
the space than a list, so you could start by counting them, allocate
the vector v and store the attributes.

Now, indexing into v, and going thru a displaced array takes some
time.  Since you seem to have fixed-length fields always at the same
position on the lines, it could be that it is not faster than indexing
directly the  chunk.  You could write a some function to index and use
these attributes directly in chunk, and if you need to extract them to
pass them to other lisp routines, use NSUBSEQ (or even SUBSEQ),
creating only short lived displaced arrays.

-- 
__Pascal Bourguignon__
http://www.informatimago.com
http://pjb.ogamita.org

From: dpapathanasiou
Subject: Re: More efficient alternatives to (subseq) for accessing parts of a string array?
Date: Thu, 12 Apr 2007 19:27:56 +0000
Message-ID: <1176406076.917287.157690@n76g2000hsh.googlegroups.com>

> WITH-OUTPUT-TO-STRING will probably cons more than subseq, and it +
> FORMAT will take much more time.

Well, the second version was better, so while (with-output-to-string)
and (format) are consing, they are doing less of it than (subseq)
alone (see my reply to Edi).

Most attributes are one-byte ascii characters, so I could, in those
cases, get away without using (with-output-to-string) and (format),
perhaps loading the result into another array or vector, instead of
pushing to a list.

> What you can do, is to avoid copying the characters you already store
> in chunk, by using displaced arrays.  See my NSUBSEQ inhttp://darcs.informatimago.com/lisp/common-lisp/utility.lisp
>
> Displaced arrays are like pointers inside arrays.  They take some
> space to describe the displacement, but not to store the data (storage
> is done in the original array).  If your attributes (subseqs) are
> bigger than a few words (12 or 16 octets in most implementations I'd
> guess), it may be worthwhile.  On the other hand, if you have unicode
> characters, it might be worthwhile as soon as you have attributes of
> more than 3 or 4 characters.

Thanks, I'll definitely take a closer look at that; at first glance,
it seems exactly what I need (i.e., since they leverage the data
already stored in chunk as opposed to re-copying it elsewhere).

> Of course, you have to allocate a vector or a list to collect the
> attributes.  Since you have a lot of entries, a vector would take half
> the space than a list, so you could start by counting them, allocate
> the vector v and store the attributes.
>
> Now, indexing into v, and going thru a displaced array takes some
> time.  Since you seem to have fixed-length fields always at the same
> position on the lines, it could be that it is not faster than indexing
> directly the  chunk.  You could write a some function to index and use
> these attributes directly in chunk, and if you need to extract them to
> pass them to other lisp routines, use NSUBSEQ (or even SUBSEQ),
> creating only short lived displaced arrays.

I'll have to find out which works better; lookup time is important,
but not at the expense of creating the index in the first place.

From: Pascal Bourguignon
Subject: Re: More efficient alternatives to (subseq) for accessing parts of a string array?
Date: Fri, 13 Apr 2007 07:05:40 +0000
Message-ID: <87fy74kb3v.fsf@voyager.informatimago.com>

"dpapathanasiou" <···················@gmail.com> writes:
>
> Most attributes are one-byte ascii characters, so I could, in those
> cases, get away without using (with-output-to-string) and (format),
> perhaps loading the result into another array or vector, instead of
> pushing to a list.

CHARACTER are not one-byte, usually, in today implementations.  Most
CL implementation nowadays handle unicode, and use 32 bits to store
one character. (21 for unicode, plus some for tags, plus some filler).

Some implementations (eg allegro or clisp compiled without unicode
support) come in two flavors, a 8-bit per character version and a
unicode version using more space.

If you choose to store only the Common Lisp standard characters (which
is the same set as the ASCII character set), your loss, you have space
to store some ideograms too.

If you want to be sure that  you use only 8 bits per character, either
use implementation  dependent features  (like specifying a  subtype of
character  for your  vectors,  type  declarations may  or  may not  do
something  in   your  implementation),  or   explicitely  encode  your
character into (unsigned-byte 8).

>> What you can do, is to avoid copying the characters you already store
>> in chunk, by using displaced arrays.  See my NSUBSEQ inhttp://darcs.informatimago.com/lisp/common-lisp/utility.lisp
>>
>> Displaced arrays are like pointers inside arrays.  They take some
>> space to describe the displacement, but not to store the data (storage
>> is done in the original array).  If your attributes (subseqs) are
>> bigger than a few words (12 or 16 octets in most implementations I'd
>> guess), it may be worthwhile.  On the other hand, if you have unicode
>> characters, it might be worthwhile as soon as you have attributes of
>> more than 3 or 4 characters.
>
> Thanks, I'll definitely take a closer look at that; at first glance,
> it seems exactly what I need (i.e., since they leverage the data
> already stored in chunk as opposed to re-copying it elsewhere).

One thing worth mentionning is that you can use a displaced array as a
window, the displacement can be modified dynamically, so you don't
need to allocate new displaced arrays, just reuse the same when you
want to look at different parts.  This is very handy when you want to
avoid consing.

-- 
__Pascal Bourguignon__
http://www.informatimago.com
http://pjb.ogamita.org

From: Kirk Job Sluder
Subject: cl-ppcre vs. search-string (was: More efficient alternatives to (subseq) for accessing parts of a string array?)
Date: Thu, 12 Apr 2007 18:20:21 +0000
Message-ID: <m2odlt1ml6.fsf_-_@blue.local>

On a related note, anyone have any advise on the relative performance
of ppcre without wildcards vs. search-string?  Is the additional
performance worth the mess of creating and switching between two
search functions?

-- 
Kirk Job Sluder

From: Rob Warnock
Subject: Re: More efficient alternatives to (subseq) for accessing parts of a string array?
Date: Fri, 13 Apr 2007 01:02:59 +0000
Message-ID: <EIWdnb82c5teS4PbnZ2dnUVZ_qOpnZ2d@speakeasy.net>

dpapathanasiou <···················@gmail.com> wrote:
+---------------
| I'm using (read-sequence) within a (with-open-file) block to read a
| plain text file into a one-dimensional array.
...
| Once in memory, I'd like to be able to pick out lists of the same
| attribute that appears in each line of the file.
+---------------

In addition to other advice you've been given, I've found
the CL functions SEARCH, MISMATCH, POSITION, and POSITION-IF
to be *very* useful for doing non-consing string parsing.

Then, having found the bounding indices for the tokens
you're looking for, you can use either/both READ-FROM-STRING
or/and PARSE-INTEGER with the :START & :END keyword args
to further avoid using SUBSEQ.


-Rob

p.s. Don't forget ":JUNK-ALLOWED T" in PARSE-INTEGER calls,
when appropriate.

-----
Rob Warnock			<····@rpw3.org>
627 26th Avenue			<URL:http://rpw3.org/>
San Mateo, CA 94403		(650)572-2607

From: Tim Bradshaw
Subject: Re: More efficient alternatives to (subseq) for accessing parts of a string array?
Date: Fri, 13 Apr 2007 09:14:36 +0000
Message-ID: <1176455676.114014.137630@l77g2000hsb.googlegroups.com>

On Apr 12, 2:27 pm, "dpapathanasiou" <···················@gmail.com>
wrote:
> I'm using (read-sequence) within a (with-open-file) block to read a
> plain text file into a one-dimensional array.
>
> The file is large, but each line is a fixed width, so I can load any n
> lines of the file by computing the size of the array (read-sequence)
> fills to match (* n line-length).
>

Be a little careful in how you count lines.  In particular it's not
generally safe to assume that you can know the number of lines based
on the length of the file and the length of each line, because the
length of the file (as returned by any common OS) will be in octets
and the mapping on to characters and lines is not quite as simple as
you'd like.

The modern case of this is encodings which take more than one octet
per character.

The more traditional case is line-ending differences, and in
particular the differences between the normal Unix and DOS sequences.
DOS-encoded files have fewer lines naive Unix-based programs think
they have.  I've spent some time fixing bugs induced in Lisp libraries
due to just this Unix-centrism.

(Of course, you may know the encoding of your files, and/or you may be
reading line-by-line and know n from some other information.)

--tim