From: Marcus Pearce
Subject: Floating point precision
Date: 
Message-ID: <97a94f1c.0304270755.6ea43b60@posting.google.com>
Hello all, 

A program that I'm working on accepts a set of sequences composed from
some alphabet and returns an equivalent list in which each element in
the sequence is replaced by a list containing that element and an
estimated probability.

For example, 

(setq *result-set* 
'(((58 0.03624038973463649d0) (58 0.4499580346809469d0)
   (60 0.11067220469636375d0) (61 0.007680705855160729d0)
   (63 0.011736749138387884d0) (60 0.11686834229685357d0)
   (58 0.13067316812927776d0) (60 0.2560477834797001d0)
   (60 0.1763231031485672d0) (58 0.11668750294580099d0)
   (63 0.030886080334030237d0) (63 0.208020441648773d0)
   (68 0.02154128520073633d0) (67 0.09767791956836545d0)
   (63 0.11147200340850706d0) (65 0.29152028333475843d0)
   (63 0.18786180424672286d0) (63 0.1992314266844911d0)
   (68 0.26022711999370074d0) (68 0.06448325082603756d0)
   (65 0.046097426398065265d0) (67 0.07043969643909134d0)
   (68 0.045928879399430106d0) (63 0.049207319603944435d0)
   (60 0.1284375076824982d0) (58 0.40939052050169755d0)
   (58 0.37834424041064524d0) (58 0.42691113282632465d0)
   (58 0.3764587424644504d0) (60 0.19141232408483044d0)
   (61 0.3251218781803747d0) (63 0.30721117552723354d0)
   (61 0.012975285101021767d0) (60 0.048841343641367364d0)
   (58 0.22313176271483137d0) (58 0.3924561930721872d0)
   (63 0.07718679922239646d0) (65 0.09747936356614831d0)
   (63 0.14389735824332642d0) (61 0.03242996802425825d0)
   (58 0.04728077096207272d0) (60 0.2069176707379784d0)
   (58 0.17062773916899343d0))
  ((62 0.06590672354135681d0) (67 0.057678636853679965d0)
    etc... ))

There are 15 sequences in this example and the alphabet is a subset of
the postive integers. 

I now want to compute the average log liklihood of the elements
appearing
in this list of sequences using the following function: 

(defun log-liklihood (p)
  (- (log p 2)))

I've tried two ways of doing this in CMUCL 18d (x86) 

	1. append the sequences and average the log liklihood over the 
           whole set of elements; 

USER> (/ (reduce #'+ (mapcar #'(lambda (x) (reduce #'+ x :key
#'(lambda (p) (log-liklihood (nth 1 p))))) *result-set*)) 848)
2.5096585288651085d0
USER> 

	2. compute the average log liklihood individually for each sequence 
           and then average the results. 

USER> (/ (reduce #'+ (mapcar #'(lambda (x) (/ (reduce #'+ x :key
#'(lambda (p) (log-liklihood (nth 1 p)))) (list-length x)))
*result-set*)) 15)
2.555111481517115d0
USER> 

As far as I can tell the extra division in the second approach is
reducing the accuracy of the computation. For other reasons I would
rather compute the log liklihood individually for each sequence and
then combine them if necesary (approach 2). However, as far as I can
tell the extra division in the second approach is reducing the
accuracy of the computation. My question is whether this is in fact
the case and if so whether there is any way of preventing the extra
division compromising accuracy.

Thanks for your help, 

Marcus 

ps I also tried this in clisp 1997-05-03 (May 1997) (SunOS)

> (/ (reduce #'+ (mapcar #'(lambda (x) (reduce #'+ x :key #'(lambda (p) (log-liklihood (nth 1 p))))) *result-set*)) 848)
2.502023538329325d0
>
> (/ (reduce #'+ (mapcar #'(lambda (x) (/ (reduce #'+ x :key #'(lambda (p) (log-liklihood (nth 1 p)))) (list-length x))) *result-set*)) 15)
2.3504468060761665d0
> 

Why the difference?

From: Gareth McCaughan
Subject: Re: Floating point precision
Date: 
Message-ID: <slrnbaocif.on.Gareth.McCaughan@g.local>
Marcus Pearce wrote:

> There are 15 sequences in this example and the alphabet is a subset of
> the postive integers. 
> 
> I now want to compute the average log liklihood of the elements
> appearing
> in this list of sequences using the following function: 
> 
> (defun log-liklihood (p)
>   (- (log p 2)))
> 
> I've tried two ways of doing this in CMUCL 18d (x86) 
> 
> 	1. append the sequences and average the log liklihood over the 
>            whole set of elements; 
> 
> USER> (/ (reduce #'+ (mapcar #'(lambda (x) (reduce #'+ x :key
> #'(lambda (p) (log-liklihood (nth 1 p))))) *result-set*)) 848)
> 2.5096585288651085d0
> USER> 
> 
> 	2. compute the average log liklihood individually for each sequence 
>            and then average the results. 
> 
> USER> (/ (reduce #'+ (mapcar #'(lambda (x) (/ (reduce #'+ x :key
> #'(lambda (p) (log-liklihood (nth 1 p)))) (list-length x)))
> *result-set*)) 15)
> 2.555111481517115d0
> USER> 

You're doing two completely different calculations.
Unless all your sequences are of the same length,
you shouldn't expect them to give the same answer.
Here's a simpler example of the same phenomenon.

Consider the lists (1 2 3), (4 6).

The average values of the two lists are 2 and 5;
the average of these is 3.5. The average value of
all the numbers in the two lists together is 16/5 = 3.2.
These two numbers are not equal, and there's no reason
why they should be.

PS. "likelihood" is spelt thus. :-)

-- 
Gareth McCaughan  ················@pobox.com
.sig under construc
From: Marcus Pearce
Subject: Re: Floating point precision
Date: 
Message-ID: <Pine.GSO.4.02A.10304272236060.4979-100000@vega.soi.city.ac.uk>
> You're doing two completely different calculations.
> Unless all your sequences are of the same length,
> you shouldn't expect them to give the same answer.

Of course I shouldn't! For some reason I had it in my head that an average
of averages is the same as the overall average regardless of length. Rest
assured that I've given it (my head that is) an appropriate number of
slaps for its stupidity.

> PS. "likelihood" is spelt thus. :-)

And an extra couple for being rubbish at spelling as well as maths. 

Marcus