Re: common utilities in CL

From: Gareth McCaughan
Subject: Re: common utilities in CL
Date: Tue, 22 Dec 1998 00:00:00 +0000
Message-ID: <8667b4qgdt.fsf@g.pet.cam.ac.uk>

Sam Steingold wrote:

> Someone mentioned that we need to re-implement everything in CL.
> Among the necessary utilities `patch' and `diff' were mentioned.
> The following toy code works fine with lisp code, i.e.,
>         (equalp x (patch y (diff y x))) ==> T
...
> comments welcome.

Hmm. This isn't, of course, much like "diff". I've hacked together
something that behaves much more like "diff" (it finds a maximal
common subtree and records, in effect, the differences between each
argument tree and this subtree); the downside is that, unlike Sam's
code, it's terribly inefficient (it uses, in most cases, both space
and time on the order of MN where M,N are the sizes of the trees
being compared). I can't see a simple way to apply the same technique
used in e.g. GNU diff (probably every diff on the planet, but GNU is
the only one I've read the code of) to reduce the space costs much.

So, now you have a choice between simple fast code that solves the
wrong problem (Sam's), or more complicated terribly inefficient code
that solves the right one (mine). Does anyone fancy trying to produce
the best of both worlds?

Here's the code. It's mostly untested, but the ideas are simple
enough that any bugs are probably shallow. (It's just doing
dynamic programming.) It's also entirely devoid of declarations
and suchlike.

---------------------------- code begins ----------------------------
;;;; Tree-based "diff"-like stuff.
;;; This is elegant but inefficient, especially on memory. To
;;; make it useful we need some heuristic for producing decent
;;; performance even on large-ish trees.
;;; Incidentally, the ideas in here would all work perfectly well
;;; for other languages, working with abstract syntax trees. Instead
;;; of just CAR and CDR we'd have N accessors, where N depends on
;;; what sort of node we're looking at.
;;; To use this really effectively for source code control we'd
;;; need some sensible way of representing comments.

;;; A 2d-hashtable hashes on two values. We implement it with nested
;;; hashtables. We don't distinguish between NIL values and non-values.

(defun make-2d-hashtable () (make-hash-table :test #'eql))

(defun gethash-2d (a b hash)
  (let ((ha (gethash a hash)))
    (if ha (gethash b ha) nil)))

(defun gethash-2d-setter (a b hash value)
  (let ((ha (gethash a hash)))
    (setf (gethash b (or ha (setf (gethash a hash)
                                  (make-hash-table :test #'eql))))
          value) ))
(defsetf gethash-2d gethash-2d-setter)

(defstruct patch
  type
  score
  other
  child)

;;; Find a maximal common subtree of A and B, and return something
;;; that describes the relationship between A and B in terms of it.
(defun treediff (a b)
  (let ((hash (make-2d-hashtable)))
    (labels
      ((mcs (a b)
         (let ((probe (gethash-2d a b hash)))
           (when probe (return-from mcs probe)))
         (setf (gethash-2d a b hash)
           (if (and (atom a) (atom b))
             ;; Both are atoms. This is therefore trivial.
             ;; (Except that EQUAL isn't quite right. It's
             ;; close enough.)
             (if (equal a b)
               (make-patch :type  :equal
                           :score 1
                           :other nil
                           :child nil)
               (make-patch :type  :change
                           :score 0
                           :other (cons a b)
                           :child nil))
             ;; At least one is not an atom. Try descending moves.
             (let ((best nil)
                   (best-score 0)
                   (best-type nil)
                   (best-other nil))
               ;; Because quite often our args are EQUAL, it's
               ;; best to try descending both first when it's possible:
               ;; it might lead to early termination.
               (when (and (consp a) (consp b))
                 (let ((x (mcs (car a) (car b)))
                       (y (mcs (cdr a) (cdr b))))
                   (when (>= (+ (patch-score x) (patch-score y))
                             best-score)
                     (setq best (cons x y)
                           best-score (+ (patch-score x)
                                         (patch-score y)
                                         1)
                           best-other nil
                           best-type
                             (if (and (eq (patch-type x) :equal)
                                      (eq (patch-type y) :equal))
                               :equal
                               :descend)))))
               ;; Otherwise, try just one.
               (unless (eq best-type :equal)
                 (when (consp a)
                   (let ((x (mcs (car a) b)))
                     (when (> (patch-score x) best-score)
                       (setq best x
                             best-type :car-a
                             best-other (cdr a)
                             best-score (patch-score x))))
                   (let ((x (mcs (cdr a) b)))
                     (when (> (patch-score x) best-score)
                       (setq best x
                             best-type :cdr-a
                             best-other (car a)
                             best-score (patch-score x)))) )
                 (when (consp b)
                   (let ((x (mcs a (car b))))
                     (when (> (patch-score x) best-score)
                       (setq best x
                             best-type :car-b
                             best-other (cdr b)
                             best-score (patch-score x))))
                   (let ((x (mcs a (cdr b))))
                     (when (> (patch-score x) best-score)
                       (setq best x
                             best-type :cdr-b
                             best-other (car b)
                             best-score (patch-score x)))) ) )
               ;; We now know which is our best option. Do the right thing.
               (make-patch :type  best-type
                           :score best-score
                           :other best-other
                           :child
                             (if (eq best-type :equal) nil best)) )) )))
      (mcs a b) )))

;;; We could avoid some duplication of code in what follows by
;;; having a function to reverse a patch. I think it's easier
;;; to duplicate code, and it saves consing too.

;;; Turn one side of a diff into the other.

(defun treepatch (orig patch)
  (ecase (patch-type patch)
    ((:equal)   orig)
    ((:change)  (cdr (patch-other patch)))
    ((:car-a)   (treepatch (car orig) (patch-child patch)))
    ((:cdr-a)   (treepatch (cdr orig) (patch-child patch)))
    ((:car-b)   (cons (treepatch orig (patch-child patch))
                      (patch-other patch)))
    ((:cdr-b)   (cons (patch-other patch)
                      (treepatch orig (patch-child patch))))
    ((:descend) (cons (treepatch (car orig) (car (patch-child patch)))
                      (treepatch (cdr orig) (cdr (patch-child patch)))))
))

(defun trunpatch (orig patch)
  (ecase (patch-type patch)
    ((:equal)   orig)
    ((:change)  (car (patch-other patch)))
    ((:car-b)   (trunpatch (car orig) (patch-child patch)))
    ((:cdr-b)   (trunpatch (cdr orig) (patch-child patch)))
    ((:car-a)   (cons (trunpatch orig (patch-child patch))
                      (patch-other patch)))
    ((:cdr-a)   (cons (patch-other patch)
                      (trunpatch orig (patch-child patch))))
    ((:descend) (cons (trunpatch (car orig) (car (patch-child patch)))
                      (trunpatch (cdr orig) (cdr (patch-child patch)))))
))

;;; A more compact and readable description of a patch. Throws away
;;; information: this sort of patch is no longer reversible.
;;; TODO: add some "optimised" forms: for instance, a chain of
;;; CDRs would become (:DELETE . n), and :cons nil (:change . x)
;;; would become :set-cdr x, and :cons (change . x) . nil would
;;; become :set-car x, and so on. In typical cases this could
;;; save quite a bit of space.

(defun to-list (patch)
  (ecase (patch-type patch)
    ((:equal)   nil)
    ((:change)  (cons :change (cdr (patch-other patch))))
    ((:car-a)   (cons :car    (to-list (patch-child patch))))
    ((:cdr-a)   (cons :cdr    (to-list (patch-child patch))))
    ((:car-b)   (cons (cons :add-cdr (patch-other patch))
                      (to-list (patch-child patch))))
    ((:cdr-b)   (cons (cons :add-car (patch-other patch))
                      (to-list (patch-child patch))))
    ((:descend) (list* :cons (to-list (car (patch-child patch)))
                             (to-list (cdr (patch-child patch))))) ))

;;; Ditto, but for the reversed patch.

(defun to-rlist (patch)
  (ecase (patch-type patch)
    ((:equal)   nil)
    ((:change)  (cons :change (car (patch-other patch))))
    ((:car-b)   (cons :car    (to-rlist (patch-child patch))))
    ((:cdr-b)   (cons :cdr    (to-rlist (patch-child patch))))
    ((:car-a)   (cons (cons :add-cdr (patch-other patch))
                      (to-rlist (patch-child patch))))
    ((:cdr-a)   (cons (cons :add-car (patch-other patch))
                      (to-rlist (patch-child patch))))
    ((:descend) (list* :cons (to-rlist (car (patch-child patch)))
                             (to-rlist (cdr (patch-child patch))))) ))

;;; Apply a patch in list form.

(defun ltreepatch (orig lpatch)
  (if lpatch
    (let ((type (car lpatch))
          (how  (cdr lpatch)))
      (if (atom type)
        (ecase type
          ((:change) how)
          ((:car)    (ltreepatch (car orig) how))
          ((:cdr)    (ltreepatch (cdr orig) how))
          ((:cons)   (cons (ltreepatch (car orig) (car how))
                           (ltreepatch (cdr orig) (cdr how)))))
        (let ((type   (car type))
              (addend (cdr type)))
          (ecase type
            ((:add-car) (cons addend (ltreepatch orig how)))
            ((:add-cdr) (cons (ltreepatch orig how) addend)))) ))
    orig))

;;; Given two 1-d arrays of forms, try to establish some sort of
;;; correspondence between the two. Return an alist of pairs (m . n)
;;; where each m,n is either an array index or NIL, and where
;;; every index in the first array appears as just one m, and
;;; every index in the second array appears as just one n. The
;;; idea is that form m in the first array should be a bit like
;;; form n in the second. This is only heuristic, and if we get
;;; it wrong the result is inefficiency, not incorrectness.
;;; At present it's really slow. We should make more use of
;;; hash tables.

(defun match-forms (a b)
  (let ((remaining-a
          (loop for x across a for i upfrom 0 collecting (cons i x)))
        (remaining-b
          (loop for x across b for i upfrom 0 collecting (cons i x)))
        (result nil))
    ;; We do a number of tests in succession. Firstly, we look
    ;; for things that are actually equal; then for things with
    ;; the same CAR and CADR (e.g. DEFUN FOO); then for things
    ;; with just the same CADR (DEFVAR FOO and DEFPARAMETER FOO);
    ;; then for things with just the same CAR (DEFUN FOO and
    ;; DEFUN BAR). Then we pair things up in order of appearance.
    ;; If anything's left we pair it with NIL.
    ;; Note that when I refer to CAR etc above, I mean CAR etc
    ;; of the form. That's CADR etc of our records.
    (flet ((car-and-cadr (x) (cons (cadr x) (caddr x)))
           (constant (x) (declare (ignore x)) nil))
      (dolist (key (list #'cdr #'car-and-cadr #'caddr #'cadr))
        (let ((new-ra nil))
          (dolist (x remaining-a)
            (let ((y (find (funcall key x) remaining-b
                           :test #'equal :key key)))
              (if y
                (progn
                  (push (cons (car x) (car y)) result)
                  (setq remaining-b (delete (car y) remaining-b :key #'car)))
                (push x new-ra))))
          (setq remaining-a new-ra)) ))
    ;; We've paired off everything plausible. Just add the things
    ;; with NIL.
    (dolist (x remaining-a)
      (push (cons (car x) nil) result))
    (dolist (x remaining-b)
      (push (cons nil (car x)) result))
    ;; Put these into a sensible order.
    (sort result #'< :key #'car)
))

;;; An array of all the forms in FILE.

(defun file-forms (file)                     
  (with-open-file (stream file :direction :input)
    (let ((v (make-array '(10) :fill-pointer 0 :adjustable t))
          (eof-object (cons nil nil)))       
      (loop                                  
        (let ((form (read stream nil eof-object)))
          (when (eq form eof-object) (return v))
          (vector-push-extend form v))))))   

;;; The differences between the forms in two files. Given as a list
;;; of items (m n lpatch) where m,n are numbers or NIL.

(defun file-diff (name1 name2 &key (verbose nil))
  (let ((forms1 (file-forms name1))
        (forms2 (file-forms name2))
        (result nil))
    (when verbose
      (format *error-output* "~&Forms read. Computing matchings.")
      (finish-output *error-output*))
    (let ((matchings (match-forms forms1 forms2)))
      (dolist (matching matchings)
        (let ((m (car matching))
              (n (cdr matching)))
          (when verbose
            (format *error-output* "~&  Diffing ~A,~A." m n)
            (finish-output *error-output*))
          (push (list m n (to-list (treediff
                                     (if m (aref forms1 m) nil)
                                     (if n (aref forms2 m) nil))))
                result))))
    (nreverse result)))

;;; Apply a file patch to a file.

(defun file-patch (in-name fpatch out-name &key (verbose nil))
  (let ((in-forms (file-forms in-name))
        (n2 (reduce #'max fpatch :key #'second)))
    (let ((out-forms (make-array (list n2) :initial-element nil)))
      (dolist (item fpatch)
        (let ((m (first item))
              (n (second item))
              (lpatch (third item)))
          (when verbose
            (if n
              (format *error-output* "~&Patching ~A -> ~A." m n)
              (format *error-output* "~&Ignoring ~A." m))
            (finish-output *error-output*))
          (when n
            (setf (aref out-forms n)
                (ltreepatch (if m (aref in-forms m) nil)
                            lpatch)))))
      (when verbose
        (format *error-output* "~&Writing output file."))
      (with-open-file (out out-name :direction :output)
        (loop for form across out-forms do
          (write form :stream out :readably t :circle t))) )))
----------------------------- code ends -----------------------------

-- 
Gareth McCaughan       Dept. of Pure Mathematics & Mathematical Statistics,
·····@dpmms.cam.ac.uk  Cambridge University, England.

Re: common utilities in CL Raymond Toy
- Re: common utilities in CL Gareth McCaughan
Re: common utilities in CL Kelly Murray
Re: common utilities in CL ············@mediaone.net
Re: common utilities in CL Gareth McCaughan

From: Raymond Toy
Subject: Re: common utilities in CL
Date: Tue, 22 Dec 1998 00:00:00 +0000
Message-ID: <4n90g0qbwj.fsf@rtp.ericsson.se>

>>>>> "Gareth" == Gareth McCaughan <·····@dpmms.cam.ac.uk> writes:

    Gareth> Sam Steingold wrote:
    >> Someone mentioned that we need to re-implement everything in CL.
    >> Among the necessary utilities `patch' and `diff' were mentioned.
    >> The following toy code works fine with lisp code, i.e.,
    >> (equalp x (patch y (diff y x))) ==> T
    Gareth> ...
    >> comments welcome.

    Gareth> Hmm. This isn't, of course, much like "diff". I've hacked together
    Gareth> something that behaves much more like "diff" (it finds a maximal
    Gareth> common subtree and records, in effect, the differences between each
    Gareth> argument tree and this subtree); the downside is that, unlike Sam's
    Gareth> code, it's terribly inefficient (it uses, in most cases, both space
    Gareth> and time on the order of MN where M,N are the sizes of the trees
    Gareth> being compared). I can't see a simple way to apply the same technique

In the CMU Lisp archives, you can find three "diff" programs:

lang/lisp/code/tools/src_cmp:

ilisp, mkant, and watt

I have not tried any of these.

Ray

From: Gareth McCaughan
Subject: Re: common utilities in CL
Date: Tue, 22 Dec 1998 00:00:00 +0000
Message-ID: <86g1a7eshz.fsf@g.pet.cam.ac.uk>

Raymond Toy wrote:

> In the CMU Lisp archives, you can find three "diff" programs:
> 
> lang/lisp/code/tools/src_cmp:
> 
> ilisp, mkant, and watt
> 
> I have not tried any of these.

"ilisp" is in Interlisp, and according to the docs it is "very
much dependent on Interlisp", and "in its current form, it
really isn't usable". I've only looked at it very briefly,
but it doesn't look to me as if it's doing quite what we're
discussing here.

"mkant" is line-by-line; it's basically a "diff" program written
in Lisp.

"watt" only looks at the top level of each s-expression. From
the comments at the start: "If what you *really* want is to
compare inside s-expressions, then you might need to call the
thing recursively in the test function. An ultimately general
minimal nested differential comparator is really hard to write:
I tried, and gave up after spending a week on it without managing
to decide what it should do, let alone how it should do it".

                               *

In view of that last comment, perhaps I should explain more
precisely what my code does; I thought it was sort of obvious,
but perhaps it's only obvious to people whose brains are identical
to mine.

What the original "diff" does is to identify a common subsequence
of two sequences of lines; it can then describe how to get from
one sequence to the other by saying, roughly: "delete all the stuff
in sequence 1 but not in the common subsequence, then add in all
the stuff in sequence 2 but not in the common subsequence". Ideally
you want the longest possible common subsequence, since that minimises
the amount you need to add and delete.

What my code does is to identify a common *subtree* of two trees,
and to proceed from there. I'd better say what I mean by "subtree".
I'll describe it in a more general setting than appears from the
code.

Consider a tree whose internal nodes are of sundry different kinds.
Two internal nodes of the same kind have the same number of children.
(The special case considered by the code I posted has only one kind
of internal node: the cons cell, with two children called "car" and
"cdr". If you wanted to use this for diffing, say, C programs then
you would want unary operators, binary operators, function applications,
|while| loops, etc all to be different kinds of nodes.)

Say that A is a subtree of B if one of the following holds. I'll
identify trees with their roots.

1. A is identical to B.
2. A is a subtree of one of B's children.
3. A and B have roots of the same kind, and each child of A is
   a subtree of the corresponding child of B.

Note that if you delete option 3 you get another notion that
has just as much right to be given the name "subtree", but isn't
any use for "diff"-like purposes.

(If there's only one kind of internal node, and it only ever has
two children, and the left child is always a leaf, then what you
have is basically a sequence; and the notions of "subtree" and
"subsequence" roughly coincide.)

One naive way to find a maximal (i.e., greatest possible number
of nodes) subtree of two trees is by dynamic programming: working
upwards from the leaves, you work out the biggest possible common
subtree of subtrees (in that *other* sense!) of A,B. This is just
a simple recursive calculation, plus memoising to make it take
quadratic time instead of exponential.

The trouble with this is that it uses a huge amount of memory and
a lot of time. You can avoid this with "diff" because there are
ways to arrange only to need to keep a small amount of the information
around at any one time; in this more general setting, I don't yet
see how to do the same. I bet it can be done, though.

-- 
Gareth McCaughan       Dept. of Pure Mathematics & Mathematical Statistics,
·····@dpmms.cam.ac.uk  Cambridge University, England.

From: Kelly Murray
Subject: Re: common utilities in CL
Date: Wed, 23 Dec 1998 00:00:00 +0000
Message-ID: <368155A4.51E41FC0@IntelliMarket.Com>

Hmm, I'd think just knowing a top-level form changed would be
sufficient, but perhaps not if you want to merge two changes?
The answer is clearly that one needs to actually use some
kind of lisp-based "cvs" in practice to see what is 
useful and needed.

-kelly murray

From: ············@mediaone.net
Subject: Re: common utilities in CL
Date: Mon, 28 Dec 1998 00:00:00 +0000
Message-ID: <36880a07.100234880@news.ne.mediaone.net>

Beware, the mkant sc.lsp module has numerous bugs and yields the wrong answers
on occasion.  Use in any production capacity is not recommended unless you
endeavor to test/debug sc.lsp.  (But it's a great learning tool, with nice
parameterizable algorithms for sync/match parts of the diff).

My attempt to send bugfixes to mkant didn't take, I post this merely because
I've been bitten by the bugs (and have since written a production diff from
scratch for my employer which doesn't have the bugs).  Still, it's a good
starting point, and the only "traditional" text file diff utility which seems to
be available in the (mostly) public domain.

··@gauss.muc.de (Matthias H�lzl) wrote:

>Gareth McCaughan <·····@dpmms.cam.ac.uk> writes:
>
>> Hmm. This isn't, of course, much like "diff". I've hacked together
>> something that behaves much more like "diff" (it finds a maximal
>> common subtree and records, in effect, the differences between each
>> argument tree and this subtree); the downside is that, unlike Sam's
>> code, it's terribly inefficient (it uses, in most cases, both space
>> and time on the order of MN where M,N are the sizes of the trees
>> being compared). I can't see a simple way to apply the same technique
>> used in e.g. GNU diff (probably every diff on the planet, but GNU is
>> the only one I've read the code of) to reduce the space costs much.
>> 
>> So, now you have a choice between simple fast code that solves the
>> wrong problem (Sam's), or more complicated terribly inefficient code
>> that solves the right one (mine). Does anyone fancy trying to produce
>> the best of both worlds?
>
>Isn't this what you are looking for?
>
>;;; Tue Dec 25 19:59:50 1990 by Mark Kantrowitz <·····@GLINDA.OZ.CS.CMU.EDU>
>;;; source-compare.lisp
>
>;;; ****************************************************************
>;;; Source Compare: A 'diff' Program for Lisp **********************
>;;; ****************************************************************
>;;; 
>;;; Source Compare is a common-lisp portable tool for comparing 
>;;; lisp source files, similar to the unix program 'diff'. Like diff
>;;; it can ignore case, whitespace, and blank lines. In addition,
>;;; it can also ignore certain classes of lisp comments. It runs in
>;;; average-case O(m+n) time.
>;;;
>;;; Written by Mark Kantrowitz, December 1990.
>;;; Address:   School of Computer Science
>;;;            Carnegie Mellon University
>;;;            Pittsburgh, PA 15213
>;;;
>;;; Copyright (c) 1990. All rights reserved.
>;;;
>;;; See general license below.
>
>Available from the CMU AI Repository:
>
>http://www.cs.cmu.edu/afs/cs/project/ai-repository/ai/lang/lisp/code/tools/src_cmp/mkant/0.html
>
>You have to contact Mark Kantrowitz about different licensing terms if
>you want to distribute changed versions, but it seems all right for
>personal use.
>
>Regards,
>
>  Matthias

D. Tenny
············@mediaone.net - no spam please

From: Gareth McCaughan
Subject: Re: common utilities in CL
Date: Thu, 31 Dec 1998 00:00:00 +0000
Message-ID: <86iuesihw4.fsf@g.pet.cam.ac.uk>

Matthias Ho"lzl wrote:

> Isn't this what you are looking for?
> 
> ;;; Tue Dec 25 19:59:50 1990 by Mark Kantrowitz <·····@GLINDA.OZ.CS.CMU.EDU>
> ;;; source-compare.lisp

No. It is a "diff" clone, roughly: it operates line by line.
We are discussing (I thought!) code for doing a diff-like
operation on s-expressions; or, more generally, on tree-like
structures. That is not what Mark Kantrowitz's code does.

-- 
Gareth McCaughan       Dept. of Pure Mathematics & Mathematical Statistics,
·····@dpmms.cam.ac.uk  Cambridge University, England.