Problem with search

From: tim
Subject: Problem with search
Date: Sat, 08 Mar 2008 00:51:19 +0000
Message-ID: <13t3oo75ca5nd04@corp.supernews.com>

I am trying to find an efficient search algorithm for this problem. I
hasten to add this is not a homework assignment.

The problem is variable name references in COBOL. Basically you can define
nested structures consisting of names (and attributes, which I have
omitted here). Names need not be unique. However when accessing a variable
you must specify a unique path, which may be a partial path.

Let's say a path a.b.c.d exists. You could refer to "d" as "a.d" or
"a.b.c.d" or "d", as long as these paths are unique. If a path "a.x.d" also
existed, you could not use "d" or "a.d" to access either variable because
this path is now ambiguous.

Here is some sample data and a simplistic solution.

(defparameter *vars*
  '((a) ;; a is a single variable
    (b) ;; b is a single variable
    (c (d) (e) (f)) ;; c is a variable made up of d, e and f
    (c (d) (f) (g)) ;; c is a variable made up of d, f and g
    (c (d) (f) (z)) ;; c is a variable made up of d, f and g
    (f) ;; f is a single variable
    (g) ;; g is a single variable
    (h (i (j (k (l (m) (n))))))) ;; h contains i, which contains j
                                  ;; which contains k which contains l
                                  ;; which contains m and n
  "Define the nested variable structures. At each level you have
  a list. A list is a variable symbol followed by the symbols
  for variables within it, if any. Note that multiple variables can have the
  same name. It is only an error if a reference is ambiguous.")

(defun find-a-var (var-component-list)
  (let ((sol (find-var-name nil var-component-list *vars*)))
    (if (> (length sol) 1)
        (format t "~%Ambiguous")
        (if (not sol)
            (format t "~%No solution")))
    (dolist (el sol)
      (format t "~%Solution ~S" el))
    (format t "~%")
    sol))
            

(defun find-var-name (acc search-name-left vars)
  (let ((soln-acc nil))
    (if (not search-name-left)
        (let ((soln (list (append (reverse acc) search-name-left vars))))
          soln)
        (if (not vars)
            nil
            (progn
              (dolist (var vars)
                (let
                   ((soln
                      (if (eq (first search-name-left) (first var))
                          (find-var-name (cons (first search-name-left) acc)
                                         (rest search-name-left)
                                         (rest var))
                          (find-var-name (cons (first var) acc)
                                         search-name-left
                                         (rest var)))))
                  (if soln (setf soln-acc (append soln-acc soln)))))
              soln-acc)))))

The problem is how to efficiently match a qualified variable name with the
known definitions.

What I am currently doing is using a hash table for the lowest level and
then going up the structure for each match to see if it matches
completely. This is inefficient if there are many lowest level variables
with the same name. 

In real programs, the same names tend to occur many times. The same
structure may be embedded in multiple different other structures. I have
seen programs where the same low level name occurs up to 100 times, so
my algorithm will be slow for such cases.

I could put all sub-paths in the hash table but this would result in a
very large hash table that would take a long time to build. The number of
subsets is 2^(n-1)+1 where n is the length of the path.

I was thinking of using my current algorithm but supplementing it with
a cache of paths that were already used. Assuming the programmer tends
to use the same name paths repeatedly, this might work reasonably well.

Another possible approach would use nested hash tables. I think the size
of the tables would order n^2 of the number of levels (which can be up to
50) and so the build time would be O(n^2).

Any suggestions about a faster algorithm would be appreciated. My ain is
that the algorithm should be O(N) where N is the length of the path
supplied.

Tim Josling

Re: Problem with search Gene
- Re: Problem with search Gene
  - Re: Problem with search tim
- Re: Problem with search tim
Re: Problem with search Scott Burson
- Re: Problem with search Ken Tilton
  - Re: Problem with search tim
    - Re: Problem with search Ken Tilton
      - Re: Problem with search Howard Brazee

From: Gene
Subject: Re: Problem with search
Date: Sat, 08 Mar 2008 03:42:23 +0000
Message-ID: <341681a0-dd2c-45b8-8ed2-1e1122764f16@60g2000hsy.googlegroups.com>

On Mar 7, 7:51 pm, tim <····@internet.com> wrote:
> I am trying to find an efficient search algorithm for this problem. I
> hasten to add this is not a homework assignment.
>
> The problem is variable name references in COBOL. Basically you can define
> nested structures consisting of names (and attributes, which I have
> omitted here). Names need not be unique. However when accessing a variable
> you must specify a unique path, which may be a partial path.
>
> Let's say a path a.b.c.d exists. You could refer to "d" as "a.d" or
> "a.b.c.d" or "d", as long as these paths are unique. If a path "a.x.d" also
> existed, you could not use "d" or "a.d" to access either variable because
> this path is now ambiguous.
>
> Here is some sample data and a simplistic solution.
>
> (defparameter *vars*
>   '((a) ;; a is a single variable
>     (b) ;; b is a single variable
>     (c (d) (e) (f)) ;; c is a variable made up of d, e and f
>     (c (d) (f) (g)) ;; c is a variable made up of d, f and g
>     (c (d) (f) (z)) ;; c is a variable made up of d, f and g
>     (f) ;; f is a single variable
>     (g) ;; g is a single variable
>     (h (i (j (k (l (m) (n))))))) ;; h contains i, which contains j
>                                   ;; which contains k which contains l
>                                   ;; which contains m and n
>   "Define the nested variable structures. At each level you have
>   a list. A list is a variable symbol followed by the symbols
>   for variables within it, if any. Note that multiple variables can have the
>   same name. It is only an error if a reference is ambiguous.")
>
> (defun find-a-var (var-component-list)
>   (let ((sol (find-var-name nil var-component-list *vars*)))
>     (if (> (length sol) 1)
>         (format t "~%Ambiguous")
>         (if (not sol)
>             (format t "~%No solution")))
>     (dolist (el sol)
>       (format t "~%Solution ~S" el))
>     (format t "~%")
>     sol))
>
> (defun find-var-name (acc search-name-left vars)
>   (let ((soln-acc nil))
>     (if (not search-name-left)
>         (let ((soln (list (append (reverse acc) search-name-left vars))))
>           soln)
>         (if (not vars)
>             nil
>             (progn
>               (dolist (var vars)
>                 (let
>                    ((soln
>                       (if (eq (first search-name-left) (first var))
>                           (find-var-name (cons (first search-name-left) acc)
>                                          (rest search-name-left)
>                                          (rest var))
>                           (find-var-name (cons (first var) acc)
>                                          search-name-left
>                                          (rest var)))))
>                   (if soln (setf soln-acc (append soln-acc soln)))))
>               soln-acc)))))
>
> The problem is how to efficiently match a qualified variable name with the
> known definitions.
>
> What I am currently doing is using a hash table for the lowest level and
> then going up the structure for each match to see if it matches
> completely. This is inefficient if there are many lowest level variables
> with the same name.
>
> In real programs, the same names tend to occur many times. The same
> structure may be embedded in multiple different other structures. I have
> seen programs where the same low level name occurs up to 100 times, so
> my algorithm will be slow for such cases.
>
> I could put all sub-paths in the hash table but this would result in a
> very large hash table that would take a long time to build. The number of
> subsets is 2^(n-1)+1 where n is the length of the path.
>
> I was thinking of using my current algorithm but supplementing it with
> a cache of paths that were already used. Assuming the programmer tends
> to use the same name paths repeatedly, this might work reasonably well.
>
> Another possible approach would use nested hash tables. I think the size
> of the tables would order n^2 of the number of levels (which can be up to
> 50) and so the build time would be O(n^2).
>
> Any suggestions about a faster algorithm would be appreciated. My ain is
> that the algorithm should be O(N) where N is the length of the path
> supplied.

First thanks for the exceptionally clear problem description.  Hope I
can be equally clear.

Processing in reverse as you suggest seems like the right approach.
After that, you are essentially building a finite state machine that
processes the dotted path in reverse order and accepts appropriate
sequences of ids.  For a single path a.b.c.d, the FSM has 5 states.

If I am reading your code correctly, the transition table looks like
this:

   a b c d
1        2
2* 5 4 3
3* 5 4
4* 5
5*

State 1 is the start state.  The asterisks mean that the machine is
accepting.  If you end up at an accepting state when the whole reverse
path is consumed, it's a valid variable.

As you continue to build the FSM by adding paths, states may change
from accepting to not when ambiguities are introduced.  If you add b.d
above, all stays the same except state 4 becomes non-accepting

You're right that there are 1 + N(N-1)/2 transitions, which is
O(N^2).  But the squared term is in length of the path, not the number
of paths that you have to incorporate.  Since defs will outnumber uses
and very long paths don't seem likely, this ought to be okay.

Now there are lots of ways to implement a FSM. One as you suggested is
to implement each state as a hash table for transitions and an
indicator of the state's accepting/non-accepting character.  You could
label accepting states with that path(s) that get(s) you there, for
example.

Or you could preprocess all the symbols to map each to a small
integer.  Then you can use a 2d array for the FSM as I did above.  The
array will be sparse, so you'd probably implement it with some kind of
encoded representation, which more or less takes you back to hash
tables.

From: Gene
Subject: Re: Problem with search
Date: Sat, 08 Mar 2008 06:15:09 +0000
Message-ID: <ebe6703d-bf88-4282-b2e4-aa931e16414a@h25g2000hsf.googlegroups.com>

On Mar 7, 10:42 pm, Gene <············@gmail.com> wrote:
> On Mar 7, 7:51 pm, tim <····@internet.com> wrote:
>
>
>
>
>
> > I am trying to find an efficient search algorithm for this problem. I
> > hasten to add this is not a homework assignment.
>
> > The problem is variable name references in COBOL. Basically you can define
> > nested structures consisting of names (and attributes, which I have
> > omitted here). Names need not be unique. However when accessing a variable
> > you must specify a unique path, which may be a partial path.
>
> > Let's say a path a.b.c.d exists. You could refer to "d" as "a.d" or
> > "a.b.c.d" or "d", as long as these paths are unique. If a path "a.x.d" also
> > existed, you could not use "d" or "a.d" to access either variable because
> > this path is now ambiguous.
>
> > Here is some sample data and a simplistic solution.
>
> > (defparameter *vars*
> >   '((a) ;; a is a single variable
> >     (b) ;; b is a single variable
> >     (c (d) (e) (f)) ;; c is a variable made up of d, e and f
> >     (c (d) (f) (g)) ;; c is a variable made up of d, f and g
> >     (c (d) (f) (z)) ;; c is a variable made up of d, f and g
> >     (f) ;; f is a single variable
> >     (g) ;; g is a single variable
> >     (h (i (j (k (l (m) (n))))))) ;; h contains i, which contains j
> >                                   ;; which contains k which contains l
> >                                   ;; which contains m and n
> >   "Define the nested variable structures. At each level you have
> >   a list. A list is a variable symbol followed by the symbols
> >   for variables within it, if any. Note that multiple variables can have the
> >   same name. It is only an error if a reference is ambiguous.")
>
> > (defun find-a-var (var-component-list)
> >   (let ((sol (find-var-name nil var-component-list *vars*)))
> >     (if (> (length sol) 1)
> >         (format t "~%Ambiguous")
> >         (if (not sol)
> >             (format t "~%No solution")))
> >     (dolist (el sol)
> >       (format t "~%Solution ~S" el))
> >     (format t "~%")
> >     sol))
>
> > (defun find-var-name (acc search-name-left vars)
> >   (let ((soln-acc nil))
> >     (if (not search-name-left)
> >         (let ((soln (list (append (reverse acc) search-name-left vars))))
> >           soln)
> >         (if (not vars)
> >             nil
> >             (progn
> >               (dolist (var vars)
> >                 (let
> >                    ((soln
> >                       (if (eq (first search-name-left) (first var))
> >                           (find-var-name (cons (first search-name-left) acc)
> >                                          (rest search-name-left)
> >                                          (rest var))
> >                           (find-var-name (cons (first var) acc)
> >                                          search-name-left
> >                                          (rest var)))))
> >                   (if soln (setf soln-acc (append soln-acc soln)))))
> >               soln-acc)))))
>
> > The problem is how to efficiently match a qualified variable name with the
> > known definitions.
>
> > What I am currently doing is using a hash table for the lowest level and
> > then going up the structure for each match to see if it matches
> > completely. This is inefficient if there are many lowest level variables
> > with the same name.
>
> > In real programs, the same names tend to occur many times. The same
> > structure may be embedded in multiple different other structures. I have
> > seen programs where the same low level name occurs up to 100 times, so
> > my algorithm will be slow for such cases.
>
> > I could put all sub-paths in the hash table but this would result in a
> > very large hash table that would take a long time to build. The number of
> > subsets is 2^(n-1)+1 where n is the length of the path.
>
> > I was thinking of using my current algorithm but supplementing it with
> > a cache of paths that were already used. Assuming the programmer tends
> > to use the same name paths repeatedly, this might work reasonably well.
>
> > Another possible approach would use nested hash tables. I think the size
> > of the tables would order n^2 of the number of levels (which can be up to
> > 50) and so the build time would be O(n^2).
>
> > Any suggestions about a faster algorithm would be appreciated. My ain is
> > that the algorithm should be O(N) where N is the length of the path
> > supplied.
>
> First thanks for the exceptionally clear problem description.  Hope I
> can be equally clear.
>
> Processing in reverse as you suggest seems like the right approach.
> After that, you are essentially building a finite state machine that
> processes the dotted path in reverse order and accepts appropriate
> sequences of ids.  For a single path a.b.c.d, the FSM has 5 states.
>
> If I am reading your code correctly, the transition table looks like
> this:
>
>    a b c d
> 1        2
> 2* 5 4 3
> 3* 5 4
> 4* 5
> 5*
>
> State 1 is the start state.  The asterisks mean that the machine is
> accepting.  If you end up at an accepting state when the whole reverse
> path is consumed, it's a valid variable.
>
> As you continue to build the FSM by adding paths, states may change
> from accepting to not when ambiguities are introduced.  If you add b.d
> above, all stays the same except state 4 becomes non-accepting
>
> You're right that there are 1 + N(N-1)/2 transitions, which is
> O(N^2).  But the squared term is in length of the path, not the number
> of paths that you have to incorporate.  Since defs will outnumber uses
> and very long paths don't seem likely, this ought to be okay.
>
> Now there are lots of ways to implement a FSM. One as you suggested is
> to implement each state as a hash table for transitions and an
> indicator of the state's accepting/non-accepting character.  You could
> label accepting states with that path(s) that get(s) you there, for
> example.
>
> Or you could preprocess all the symbols to map each to a small
> integer.  Then you can use a 2d array for the FSM as I did above.  The
> array will be sparse, so you'd probably implement it with some kind of
> encoded representation, which more or less takes you back to hash
> tables.- Hide quoted text -
>
> - Show quoted text -

Here.  I think this works.  Not well tested.

;;; A state is a list of two elements:
;;; 1. List of complete definitions that get you to the state.
;;; 2. A hash mapping symbols to next states.
;;; An FSM is its start state.
(defun build-fsm (scopes)
  (let ((fsm (list nil (make-hash-table))))
    (labels ((insert-paths (scopes rev-path)
	       (if scopes
		   (dolist (scope scopes)
		     (insert-paths (cdr scope)
				   (cons (car scope) rev-path)))
		   (insert-in-fsm rev-path fsm))))
      (insert-paths scopes nil))
    fsm))

(defun insert-in-fsm (rev-path fsm)
  (labels ((insert (rest-rev-path states)
	     ;; Show we can get to this state using path.
	     (push (reverse rev-path) (first (first states)))
	     (when rest-rev-path
	       (let ((next-state (gethash (car rest-rev-path)
					  (second (first states)))))
		 ;; If there is no next state, build one.
		 (when (null next-state)
		   (setq next-state (list nil (make-hash-table))))
		 ;; Connect the next state to the trailing ones
		 ;; we've already traversed.
		 (dolist (state states)
		   (setf (gethash (car rest-rev-path)
				  (second state))
			 next-state))
		 ;; Continue with rest of path.
		 (insert (rest rest-rev-path)
			 (cons next-state states))))))
    ;; First transition is unique, so build here.
    (let ((state-2 (gethash (car rev-path) (second fsm))))
      (when (null state-2)
	(setq state-2 (list nil (make-hash-table)))
	(setf (gethash (car rev-path) (second fsm)) state-2))
      ;; Build the rest.
      (insert (cdr rev-path) (list state-2)))))

(defparameter *fsm* (build-fsm *vars*))

(defun check-path (path &optional (fsm *fsm*))
  (labels ((check (rev-path fsm)
	     (cond ((null rev-path) (first fsm))
		   (t (let ((next-state (gethash (first rev-path)
						 (second fsm))))
			(if next-state
			    (check (rest rev-path) next-state)
			    :fail))))))
    (check (reverse path) fsm)))

CL-USER> (check-path '(c d))
((C D) (C D) (C D))
CL-USER> (check-path '(a))
((A))
CL-USER> (check-path '(h n))
((H I J K L N))
CL-USER> (check-path '(h j n))
((H I J K L N))
CL-USER> (check-path '(n))
((H I J K L N))
CL-USER> (check-path '(z))
((C Z))
CL-USER> (check-path '(y))
:FAIL

From: tim
Subject: Re: Problem with search
Date: Sat, 08 Mar 2008 09:53:44 +0000
Message-ID: <13t4oh8p97bnef8@corp.supernews.com>

On Fri, 07 Mar 2008 22:15:09 -0800, Gene wrote:

> Here.  I think this works.  Not well tested.
> 

Thanks Gene,
Tim Josling

From: tim
Subject: Re: Problem with search
Date: Sat, 08 Mar 2008 09:13:26 +0000
Message-ID: <13t4m5m691fuf1f@corp.supernews.com>

On Fri, 07 Mar 2008 19:42:23 -0800, Gene wrote:

>
> Processing in reverse as you suggest seems like the right approach.
> After that, you are essentially building a finite state machine that
> processes the dotted path in reverse order and accepts appropriate
> sequences of ids.  For a single path a.b.c.d, the FSM has 5 states.
>

Thanks - very insightful. this gives me some good ideas to work with. 

Tim Josling

From: Scott Burson
Subject: Re: Problem with search
Date: Sat, 08 Mar 2008 06:39:49 +0000
Message-ID: <5e9165ce-51e0-406f-8719-da5f0d25b02c@s12g2000prg.googlegroups.com>

On Mar 7, 4:51 pm, tim <····@internet.com> wrote:
> The problem is variable name references in COBOL.

Hello Tim,

I am looking for developers with both Cobol and Lisp experience.
There may be three in the world :)  Please contact me by email.

-- Scott

From: Ken Tilton
Subject: Re: Problem with search
Date: Mon, 10 Mar 2008 12:55:45 +0000
Message-ID: <47d52fd2$0$5610$607ed4bc@cv.net>

Scott Burson wrote:
> On Mar 7, 4:51 pm, tim <····@internet.com> wrote:
> 
>>The problem is variable name references in COBOL.
> 
> 
> Hello Tim,
> 
> I am looking for developers with both Cobol and Lisp experience.
> There may be three in the world :)  

Four!

http://smuglispweeny.blogspot.com/2008/03/kenny-and-firing-squad-episode-ii-cobol.html

kenny

-- 
http://smuglispweeny.blogspot.com/
http://www.theoryyalgebra.com/

"In the morning, hear the Way;
  in the evening, die content!"
                     -- Confucius

From: tim
Subject: Re: Problem with search
Date: Tue, 11 Mar 2008 01:52:35 +0000
Message-ID: <13tbpf39lq7u283@corp.supernews.com>

On Mon, 10 Mar 2008 08:55:45 -0400, Ken Tilton wrote:

> Scott Burson wrote:
>> On Mar 7, 4:51 pm, tim <····@internet.com> wrote:
>> 
>>>The problem is variable name references in COBOL.
>> 
>> 
>> Hello Tim,
>> 
>> I am looking for developers with both Cobol and Lisp experience.
>> There may be three in the world :)  
> 
> Four!
> 
> http://smuglispweeny.blogspot.com/2008/03/kenny-and-firing-squad-episode-ii-cobol.html
> 
> kenny
>

Yes, COBOL has macros (kind of).

Tim

From: Ken Tilton
Subject: Re: Problem with search
Date: Tue, 11 Mar 2008 02:23:27 +0000
Message-ID: <47d5ed2e$0$15178$607ed4bc@cv.net>

tim wrote:
> On Mon, 10 Mar 2008 08:55:45 -0400, Ken Tilton wrote:
> 
> 
>>Scott Burson wrote:
>>
>>>On Mar 7, 4:51 pm, tim <····@internet.com> wrote:
>>>
>>>
>>>>The problem is variable name references in COBOL.
>>>
>>>
>>>Hello Tim,
>>>
>>>I am looking for developers with both Cobol and Lisp experience.
>>>There may be three in the world :)  
>>
>>Four!
>>
>>http://smuglispweeny.blogspot.com/2008/03/kenny-and-firing-squad-episode-ii-cobol.html
>>
>>kenny
>>
> 
> 
> Yes, COBOL has macros (kind of).

<g> Well, I think C calls them macros, so I thought I could get away 
with it (kind of).

The whacky thing is that when I explored a port of my C version of the 
Algebra software to Lisp I hid from Antlr all my macros that were 
dereferencing pointers and handles (bless the Mac) so Antler thought 
they were function calls and ... can you say "accessors"? Had I decided 
to continue down that path I would have been able to do so as if 
pointers and handles never existed.

Meanwhile GvR says: No macros for you!

kenny

-- 
http://smuglispweeny.blogspot.com/
http://www.theoryyalgebra.com/

"In the morning, hear the Way;
  in the evening, die content!"
                     -- Confucius

From: Howard Brazee
Subject: Re: Problem with search
Date: Tue, 11 Mar 2008 16:58:01 +0000
Message-ID: <6gedt3hljbl3a6q0b6us0i51hq6gu6jhcf@4ax.com>

On Mon, 10 Mar 2008 22:23:27 -0400, Ken Tilton
<···········@optonline.net> wrote:

><g> Well, I think C calls them macros, so I thought I could get away 
>with it (kind of).

EasyTrieve macros are basically copy members.