Efficiency suggestions, bioinformatics context

From: Johann Hibschman
Subject: Efficiency suggestions, bioinformatics context
Date: Sun, 15 Dec 2002 03:54:48 +0000
Message-ID: <m24r9gey2f.fsf@physics.berkeley.edu>

Hi folks,

This is a vague question, but it must be vague, since it's a stab out
into an unexplored country.

What lisp implementations (with a reasonably free trial edition) are
numerically efficient enough to be competetive with C++?

So, why do I ask a silly thing like this?

Well, I like lisp.  I don't quite know why.  It's a linguistic thing,
somehow.  Lisp reads like natural language to me, more than other
things.  Things read well, if done well.  Of course, there's the
laundry list of features, but that's not why I'm considering it.

I'm now working for a company doing work that may be classed as
bioinformatics.  We do sequencing by hybridization, which involves a
lot of searching through parameter space.  What assumed sequence
would make the observed sequence most likely?  That sort of thing.

So.  There exists some god-awful C++ code for reading a lot of data
files.  (Image analysis results, chip maps, fixed probe sets,
labelled probe sets, that sort of thing.)  It then tries to analyze
the data, to determine the most likely sequence.

Right now, we're working on improving the prediction, to go from
knowing the sequence, to knowing the exact values of everything
observed.  This means out maximum likelihood algorithm is constantly
changing, and it's driving me crazy, trying to modify C++ code quickly
to test ideas.

I'm thinking of just taking a day or two, hiding from management, and
re-implementing the whole shebang in lisp, so I can test different
algorithms faster.  I'd like to have something to show that works
fast, and I've always been more of an academic than anything else, so
I don't quite know what would work best in this context.

The real problem is that we want to implement a complex prediction for
the amplitude of the signal generated by a given DNA sequence.  This
prediction function gets called a zillion times, when we're trying all
4^11 possible 11-base sequences to find the best match, so it must be
fast.  (Dumb, sure, but none of the other methods guarantee that they
find the best solution.)  C++ lets this core function be fast, since
it's stupid and literal.  Lisp, I've seen be fast in CMUCL on Suns,
but I don't know if any of the more commercial versions are as
well-optimized for numerics.

So, any hints?

Java linkage would be a plus, since the GUI will be implemented in
Java.  That decision I think is a good one, since we want to support
Linux, Windows, and Mac OS X.

Sigh.  Vague, I know.  I suppose I'm just looking for some kind of
validation for my hunch that this all would be much easier to do, and
fast enough, in lisp before I spend the time to implement it.

Cheers,

--Johann

Re: Efficiency suggestions, bioinformatics context Paolo Amoroso
- Re: Efficiency suggestions, bioinformatics context Johann Hibschman
Re: Efficiency suggestions, bioinformatics context Nicolas Neuss
- Re: Efficiency suggestions, bioinformatics context Thomas F. Burdick
- Re: Efficiency suggestions, bioinformatics context Johann Hibschman
Re: Efficiency suggestions, bioinformatics context Jules F. Grosse

From: Paolo Amoroso
Subject: Re: Efficiency suggestions, bioinformatics context
Date: Sun, 15 Dec 2002 16:09:09 +0000
Message-ID: <A6n8PWyTQXw0EuFRQp0445rWd2OR@4ax.com>

On 14 Dec 2002 19:54:48 -0800, Johann Hibschman
<······@physics.berkeley.edu> wrote:

> I'm now working for a company doing work that may be classed as
> bioinformatics.  We do sequencing by hybridization, which involves a

You may check BioLisp.org:

  http://www.biolisp.org


> I'm thinking of just taking a day or two, hiding from management, and
> re-implementing the whole shebang in lisp, so I can test different
> algorithms faster.  I'd like to have something to show that works

If you do, please let us know management's comments on or reactions to the
project--provided you have the liberty of discussing this with us, of
course.

Good luck and congratulations on your enthusiasm,


Paolo
-- 
EncyCMUCLopedia * Extensive collection of CMU Common Lisp documentation
http://www.paoloamoroso.it/ency/README

From: Johann Hibschman
Subject: Re: Efficiency suggestions, bioinformatics context
Date: Sun, 15 Dec 2002 17:08:49 +0000
Message-ID: <m2smwzmcpp.fsf@yahoo.com>

Paolo Amoroso <·······@mclink.it> writes:

> You may check BioLisp.org:
> 
>   http://www.biolisp.org

Thanks, I'll stare at the code there a bit; I looked at it a month
ago, but I lacked the context required to make sense of it.  I'm yet
another physicist turned bioinformatics programmer, so I'm still
having a great time crawling my way up the learning curve.  There's
nothing like jumping into a field where you know next to nothing to
get those brain cells firing.

Oh, and it's been a while since I posted to usenet, and the email
address in my initial post is stale.  It should be jhibschman at yahoo
dot com; hopefully this post will have the right value.  Apologies to
anyone who tried to reply and got a bounce.

(And, just to be clear, I am expected to be programming exploratorily
to try to solve the problem, so I'm not really "hiding from
management."  I'd just rather get a working prototype going before
having to explain lisp.)

--Johann

From: Nicolas Neuss
Subject: Re: Efficiency suggestions, bioinformatics context
Date: Mon, 16 Dec 2002 09:02:50 +0000
Message-ID: <8765tubakl.fsf@ortler.iwr.uni-heidelberg.de>

Johann Hibschman <······@physics.berkeley.edu> writes:

> The real problem is that we want to implement a complex prediction for
> the amplitude of the signal generated by a given DNA sequence.  This
> prediction function gets called a zillion times, when we're trying all
> 4^11 possible 11-base sequences to find the best match, so it must be
> fast.  (Dumb, sure, but none of the other methods guarantee that they
> find the best solution.)  C++ lets this core function be fast, since
> it's stupid and literal.  Lisp, I've seen be fast in CMUCL on Suns,
> but I don't know if any of the more commercial versions are as
> well-optimized for numerics.
> 
> So, any hints?
> 
> Java linkage would be a plus, since the GUI will be implemented in
> Java.  That decision I think is a good one, since we want to support
> Linux, Windows, and Mac OS X.
> 
> Sigh.  Vague, I know.  I suppose I'm just looking for some kind of
> validation for my hunch that this all would be much easier to do, and
> fast enough, in lisp before I spend the time to implement it.
> 
> Cheers,

Here is a brief summary of my experiences: if your loop is using data
in the cache, you have to accept that CMUCL is about a factor 2 slower
than C code.  On the other hand, for many applications memory access
is the limiting factor and the difference gets smaller (up to equal
speed).  ACL is somewhat slower than CMUCL in my tests, but not much.

If you are new to Lisp, it is improbable that your code will be that
fast at the beginning, though.  If it is slower than the factor 2
mentioned above, I suggest that you post some simplified version with
the innermost loop here (maybe together with a C++-version) and let
people help you.

Nicolas.

From: Thomas F. Burdick
Subject: Re: Efficiency suggestions, bioinformatics context
Date: Mon, 16 Dec 2002 19:55:19 +0000
Message-ID: <xcvu1hdyc0o.fsf@fallingrocks.OCF.Berkeley.EDU>

Nicolas Neuss <·············@iwr.uni-heidelberg.de> writes:

> Here is a brief summary of my experiences: if your loop is using
> data in the cache, you have to accept that CMUCL is about a factor 2
> slower than C code.  On the other hand, for many applications memory
> access is the limiting factor and the difference gets smaller (up to
> equal speed).

Weren't you also working on x86?  On RISC hardware, CMUCL does better.
I'm not saying it'll necessarily match C code, but I have seen it best
gcc's code for non-memory-bound computation.  Sun's Forte C compiler
still beats CMUCL, though.

-- 
           /|_     .-----------------------.                        
         ,'  .\  / | No to Imperialist war |                        
     ,--'    _,'   | Wage class war!       |                        
    /       /      `-----------------------'                        
   (   -.  |                               
   |     ) |                               
  (`-.  '--.)                              
   `. )----'

From: Johann Hibschman
Subject: Re: Efficiency suggestions, bioinformatics context
Date: Tue, 17 Dec 2002 05:24:04 +0000
Message-ID: <m2y96przez.fsf@yahoo.com>

Nicolas Neuss <·············@iwr.uni-heidelberg.de> writes:

> Here is a brief summary of my experiences: if your loop is using data
> in the cache, you have to accept that CMUCL is about a factor 2 slower
> than C code.  On the other hand, for many applications memory access
> is the limiting factor and the difference gets smaller (up to equal
> speed).  ACL is somewhat slower than CMUCL in my tests, but not much.

I suspect that, once I dispel the demons of foolish bulk copying of
large C++ data structures, I'll be able to accept a x2 loss in the
numerics.

I'll play with it and see what I can get.

> If you are new to Lisp, it is improbable that your code will be that
> fast at the beginning, though.  If it is slower than the factor 2
> mentioned above, I suggest that you post some simplified version with
> the innermost loop here (maybe together with a C++-version) and let
> people help you.

I've done some numerical optimization in CL, mostly using CMUCL, which
gives astoundingly good error messages, and had great luck.  That was
on Sun boxen, while I'm now on x86, so I expect some slowdowns, since
I gather CMUCL is better with plentiful registers.

Cheers,

-Johann

From: Jules F. Grosse
Subject: Re: Efficiency suggestions, bioinformatics context
Date: Tue, 17 Dec 2002 18:00:12 +0000
Message-ID: <17c920a6.0212171000.66ae7c0a@posting.google.com>

> 
> The real problem is that we want to implement a complex prediction for
> the amplitude of the signal generated by a given DNA sequence.  This
> prediction function gets called a zillion times, when we're trying all

...

> Java linkage would be a plus, since the GUI will be implemented in

...


JACOL (jacol.sf.net) links Common Lisp and Java.  It's for CLISP only
(the examples shows some Swing widget manipulation, IIRC).

Also, I don't have any benchmark or anedoctal evidence handy, but I've
always heard that CLISP's numerical functions are extremely fast, even
for a bytecoded, interpreted Lisp.  But, of course, I'm only -telling
things I've heard on the hopes that someone can corroborate (or not)
that information.

Thanks,
jls