Iterating & Comparing String Data Sets Efficiently?

From: dpapathanasiou
Subject: Iterating & Comparing String Data Sets Efficiently?
Date: Fri, 01 Dec 2006 01:30:39 +0000
Message-ID: <1164936639.783408.219140@16g2000cwy.googlegroups.com>

I have data consisting of lists of strings, which I need to compare
versus each other, to find which strings each set has in common.

As an example, say I have this set, called my-data, composed of 5 lists
of strings:

> my-data

(("fancy" "ourselves" "creative" "DIY" "types" "mention" "cheap"
"furniture"
  "pile" "pillows" "roof" "tarp")
 ("shelling" "grade" "stuff" "folks" "dates" "breakfast" "sponsoring"
"tasty"
  "consisting" "shots")
 ("photographers" "saving" "pennies" "roof" "burns" "knees" "starting"
  "awfully")
 ("painful" "Tiny" "Dispatch" "Couches" "Valley" "Italian" "Chair"
"Pillows")
 ("Ladies" "breakfast" "posh" "lunches" "breed" "female" "charity"
"stuff"))

What I want to do is compare the first set -- (nth 0 my-data) -- versus
each of the other four sets and find out which string tokens they have
in common.

In this example, the answer for the first set -- (nth 0 my-data) -- is:

(NIL ; ignore itself
 NIL
 ("roof")      ; "roof" is common to both (nth 0) and (nth 2)
 ("Pillows") ; "pillows"             ''                       ''
(nth 3)
 NIL)

And so on for (nth 1 my-data), (nth 2 my-data), etc.

I've been using the (intersection) function wrapped inside a mapcar
iterator for this -- the real data, though different in content from
the example above, is structured exactly the same way: a list of lists
of strings.

Although it's logically correct, it's slow (the real data is quite
large), and so I've been thinking about ways of improving its
efficiency.

For example, instead of lists of strings, using hash tables, i.e.
iterate each hash, and look for where the keys (string) match, since
hash lookups are fast.

Also, I was wondering about more effective ways of storing large
amounts of string data in memory in the first place, e.g. perhaps as
lower-level arrays of chars or bits, but I'm not quite sure how to go
about that efficiently.

Any suggestions?

Re: Iterating & Comparing String Data Sets Efficiently? KevinZzz
Re: Iterating & Comparing String Data Sets Efficiently? Victor Kryukov
- Re: Iterating & Comparing String Data Sets Efficiently? dpapathanasiou
  - Re: Iterating & Comparing String Data Sets Efficiently? Victor Kryukov
    - Re: Iterating & Comparing String Data Sets Efficiently? dpapathanasiou
- Re: Iterating & Comparing String Data Sets Efficiently? Barry Margolin
Re: Iterating & Comparing String Data Sets Efficiently? dpapathanasiou
Re: Iterating & Comparing String Data Sets Efficiently? Rahul Jain

From: KevinZzz
Subject: Re: Iterating & Comparing String Data Sets Efficiently?
Date: Fri, 01 Dec 2006 03:37:00 +0000
Message-ID: <1164944220.202160.73600@80g2000cwy.googlegroups.com>

dpapathanasiou wrote:
> I have data consisting of lists of strings, which I need to compare
> versus each other, to find which strings each set has in common.
>
> As an example, say I have this set, called my-data, composed of 5 lists
> of strings:
>
> > my-data
>
> (("fancy" "ourselves" "creative" "DIY" "types" "mention" "cheap"
> "furniture"
>   "pile" "pillows" "roof" "tarp")
>  ("shelling" "grade" "stuff" "folks" "dates" "breakfast" "sponsoring"
> "tasty"
>   "consisting" "shots")
>  ("photographers" "saving" "pennies" "roof" "burns" "knees" "starting"
>   "awfully")
>  ("painful" "Tiny" "Dispatch" "Couches" "Valley" "Italian" "Chair"
> "Pillows")
>  ("Ladies" "breakfast" "posh" "lunches" "breed" "female" "charity"
> "stuff"))
>
> What I want to do is compare the first set -- (nth 0 my-data) -- versus
> each of the other four sets and find out which string tokens they have
> in common.
>
> In this example, the answer for the first set -- (nth 0 my-data) -- is:
>
> (NIL ; ignore itself
>  NIL
>  ("roof")      ; "roof" is common to both (nth 0) and (nth 2)
>  ("Pillows") ; "pillows"             ''                       ''
> (nth 3)
>  NIL)
>
> And so on for (nth 1 my-data), (nth 2 my-data), etc.
>
> I've been using the (intersection) function wrapped inside a mapcar
> iterator for this -- the real data, though different in content from
> the example above, is structured exactly the same way: a list of lists
> of strings.
>
> Although it's logically correct, it's slow (the real data is quite
> large), and so I've been thinking about ways of improving its
> efficiency.
>
> For example, instead of lists of strings, using hash tables, i.e.
> iterate each hash, and look for where the keys (string) match, since
> hash lookups are fast.
>
> Also, I was wondering about more effective ways of storing large
> amounts of string data in memory in the first place, e.g. perhaps as
> lower-level arrays of chars or bits, but I'm not quite sure how to go
> about that efficiently.
>
> Any suggestions?

D,

my two cent: create 'first-letter-&-word-length' representations of
your data-set (words)..

I'm possibly way off... but it'd make intersection much easier.. of
course, the cost/benefit ratio depends on your specific
data-set-'scenario'...

K

From: Victor Kryukov
Subject: Re: Iterating & Comparing String Data Sets Efficiently?
Date: Fri, 01 Dec 2006 04:02:33 +0000
Message-ID: <1164945753.650647.230460@79g2000cws.googlegroups.com>

dpapathanasiou wrote:
> I have data consisting of lists of strings, which I need to compare
> versus each other, to find which strings each set has in common.

> I've been using the (intersection) function wrapped inside a mapcar
> iterator for this -- the real data, though different in content from
> the example above, is structured exactly the same way: a list of lists
> of strings.
>
> Although it's logically correct, it's slow (the real data is quite
> large), and so I've been thinking about ways of improving its
> efficiency.

Hi dpapathanasiou,

As you may see below, intersection is not very efficient (at least in
SBCL) - intersection for two sets of length N and K is O(NK)
operations. If you would sort all sets upfront, and write your own
function to traverse lists in parallel, you would get O(max(N,K))
operations.

Just my 2 cents,
Victor.

(defun intersection (list1 list2
                     &key key (test #'eql testp) (test-not nil notp))
  #!+sb-doc
  "Return the intersection of LIST1 and LIST2."
  (declare (inline member))
  (when (and testp notp)
    (error ":TEST and :TEST-NOT were both supplied."))
  (let ((key (and key (%coerce-callable-to-fun key))))
    (let ((res nil))
      (dolist (elt list1)
        (if (with-set-keys (member (apply-key key elt) list2))
            (push elt res)))
      res)))

From: dpapathanasiou
Subject: Re: Iterating & Comparing String Data Sets Efficiently?
Date: Fri, 01 Dec 2006 14:44:22 +0000
Message-ID: <1164984261.989101.310750@n67g2000cwd.googlegroups.com>

Victor Kryukov wrote:
> operations. If you would sort all sets upfront, and write your own
> function to traverse lists in parallel, you would get O(max(N,K))
> operations.

Hi Victor,

Thanks for suggesting this -- we're using CMUCL here, but is it safe to
say that running (member) on a pre-sorted list is more efficient than
(intersection) in general, or will it vary from implementation to
implementation?

From: Victor Kryukov
Subject: Re: Iterating & Comparing String Data Sets Efficiently?
Date: Fri, 01 Dec 2006 15:12:16 +0000
Message-ID: <m2ac27zm3z.fsf@victor-kryukovs-computer.local>

Hello Denis,

"dpapathanasiou" <···················@gmail.com> writes:

> Victor Kryukov wrote:
>> operations. If you would sort all sets upfront, and write your own
>> function to traverse lists in parallel, you would get O(max(N,K))
>> operations.
>
> Thanks for suggesting this -- we're using CMUCL here, but is it safe to
> say that running (member) on a pre-sorted list is more efficient than
> (intersection) in general, or will it vary from implementation to
> implementation?

I wasn't mean to use the member - I believe even in the case of sorted
lists, that will lead to O(N*K) operations. What I mean is something
like the following, when you traverse both lists only *once* - e.g. if
we found that first elements of the lists are equal, we could safely
add first element to the sorted-intersection of the CDRs of the
initial lists. In that case, we have O(min(length(list1),
length(list2))) computations.

(defun sorted-intersection (list1 list2 &key (test #'eql) cmp)
  "Intersection for sorted lists. cmp should be the same function that
was used for sorting lists."
  (do ((result nil)
       (tail1 list1)
       (tail2 list2))
      ((or (null tail1) (null tail2)) result)
    (cond
      ((funcall test (car tail1) (car tail2))
       (push (car tail1) result)
       (setf tail1 (cdr tail1)
	     tail2 (cdr tail2)))
      ((funcall cmp (car tail1) (car tail2))
       (setf tail1 (cdr tail1)))
      (t
       (setf tail2 (cdr tail2))))))

(setf a '("this" "is" "a" "first" "list")
      b '("this" "list" "is" "a" "second" "one"))

(setf a (sort a #'string<) b (sort b #'string<))

CL-USER> (sorted-intersection a b :test #'equal :cmp #'string<)
("this" "list" "is" "a")

Regards,
Victor.

From: dpapathanasiou
Subject: Re: Iterating & Comparing String Data Sets Efficiently?
Date: Fri, 01 Dec 2006 15:21:41 +0000
Message-ID: <1164986501.270822.130970@l12g2000cwl.googlegroups.com>

Victor Kryukov wrote:

> I wasn't mean to use the member - I believe even in the case of sorted
> lists, that will lead to O(N*K) operations. What I mean is something
> like the following, when you traverse both lists only *once* - e.g. if
> we found that first elements of the lists are equal, we could safely
> add first element to the sorted-intersection of the CDRs of the
> initial lists. In that case, we have O(min(length(list1),
> length(list2))) computations.

Ah, thanks for the clarification, and I see what you mean: i.e. don't
fall into the "Schlemiel the painter" type of inefficiency
(http://discuss.fogcreek.com/techInterview/default.asp?cmd=show&ixPost=153).

From: Barry Margolin
Subject: Re: Iterating & Comparing String Data Sets Efficiently?
Date: Sat, 02 Dec 2006 04:21:26 +0000
Message-ID: <barmar-8C5665.23212601122006@comcast.dca.giganews.com>

In article <························@79g2000cws.googlegroups.com>,
 "Victor Kryukov" <··············@gmail.com> wrote:

> As you may see below, intersection is not very efficient (at least in
> SBCL) - intersection for two sets of length N and K is O(NK)
> operations.

Repeatedly calling MEMBER is not really the most efficient way to do 
this for large data sets.  A better way would be to create a hash table 
with the elements of one set as the keys, and then call GETHASH for the 
elements of the other set.  This is O(N+K).

Or, as someone else suggested, you can sort the two lists, and then walk 
them together.  The walk is O(min(N,K)), but the sorts will be something 
like O(NlogN)+O(KlogK).

-- 
Barry Margolin, ······@alum.mit.edu
Arlington, MA
*** PLEASE post questions in newsgroups, not directly to me ***
*** PLEASE don't copy me on replies, I'll read them in the group ***

From: dpapathanasiou
Subject: Re: Iterating & Comparing String Data Sets Efficiently?
Date: Fri, 01 Dec 2006 14:32:34 +0000
Message-ID: <1164983554.763533.266590@80g2000cwy.googlegroups.com>

Madhu wrote:
> More recently (in 1998) the data structure, Ternary Search Trees was
> described in Dr.Dobbs journal and claims to be appropriate. There is
> an elisp implementation in the `predictive' package.

Madhu,

Thanks so much for mentioning this.

If anyone else is interested, the guys who wrote that article have a
site dedicated to that concept
(http://www.cs.princeton.edu/~rs/strings/) and even though their DDJ
link is dead, I also found the original article as well
(http://www.ddj.com/dept/windows/184410528).

From: Rahul Jain
Subject: Re: Iterating & Comparing String Data Sets Efficiently?
Date: Sun, 03 Dec 2006 01:10:13 +0000
Message-ID: <878xhpwzre.fsf@nyct.net>

Madhu <·······@meer.net> writes:

> CLORB used to include a simple trie implementation which should be
> sufficient for most purposes. (I preferred to add a hash-table like
> API on top and functions to merge, compress, and manipulate it). This
> is sufficient for the biggest dictionaries in english, and the only
> caveat is to be careful with development environments (like slime)
> which barf on printing huge conses on error.

PAIP also uses tries for its unification algorithm. Maybe cl-unification
also has a trie implementation.

-- 
Rahul Jain
·····@nyct.net
Professional Software Developer, Amateur Quantum Mechanicist