On Sun, 30 Jul 2006 12:51:50 -0700, jkc wrote:
This has been solved following a discussion with rif via email. I was
working with emacs/slime and early on had loaded the old non-ATLAS BLAS
library /usr/lib/libcblas.a. All subsequent attempts to load the optimized
ATLAS library /usr/lib/blas/atlas/libcblas.so indicated success, but in
fact were not. It was still linking to the old library.
I only obtained comparable results between the C++ and Lisp methods after
shutting down emacs and restarting it.
There is probably a way to *unload* a library, but I don't know what it is.
-jkc
jkc <·······@makewavs.com> writes:
> I can't see any reason why this would be so. Any ideas?
Can you run your test again, passing in an array of floating point
values allocated on the malloc heap instead? Is this an x86 Linux
system by any chance?
jkc <·······@makewavs.com> writes:
> I've been working with large complex matrices which led me to calling
> functions out of BLAS and ATLAS using FFI. I expected the overhead of
> calling these functions to be pretty minimal but was surprised to find it
> was instead more than 4 times slower for simple operations than had I
> programmed it in, for example, C++.
>
> I've included the two test codes that perform the same dimension complex
> matrix multiply along with their respective timings.
>
> I can't see any reason why this would be so. Any ideas?
One word: boxing.
--
__Pascal Bourguignon__ http://www.informatimago.com/
Wanna go outside.
Oh, no! Help! I got outside!
Let me back inside!
On Sun, 30 Jul 2006 23:51:22 +0200, Pascal Bourguignon wrote:
> jkc <·······@makewavs.com> writes:
>
>> I've been working with large complex matrices which led me to calling
>> functions out of BLAS and ATLAS using FFI. I expected the overhead of
>> calling these functions to be pretty minimal but was surprised to find
>> it was instead more than 4 times slower for simple operations than had
>> I programmed it in, for example, C++.
>>
>> I've included the two test codes that perform the same dimension
>> complex matrix multiply along with their respective timings.
>>
>> I can't see any reason why this would be so. Any ideas?
>
> One word: boxing.
I don't see how that could be. In order for the BLAS routine to work, it
has to be passed a hard pointer to floats in sequential memory. Are you
saying that (make-array 1000000 :element-type 'single-float) makes a
sequence of 1000000 pointers to adjacent locations?
And even if that were true, that process is outside my timing loop. It is
already allocated and filled when I make the call. All that should be
happening is passing an argument list before calling the foreign
procedure. Shouldn't it?
-jkc
jkc wrote:
> On Sun, 30 Jul 2006 23:51:22 +0200, Pascal Bourguignon wrote:
>
> > jkc <·······@makewavs.com> writes:
> >
> >> I've been working with large complex matrices which led me to calling
> >> functions out of BLAS and ATLAS using FFI. I expected the overhead of
> >> calling these functions to be pretty minimal but was surprised to find
> >> it was instead more than 4 times slower for simple operations than had
> >> I programmed it in, for example, C++.
> >>
> >> I've included the two test codes that perform the same dimension
> >> complex matrix multiply along with their respective timings.
> >>
> >> I can't see any reason why this would be so. Any ideas?
> >
> > One word: boxing.
>
> I don't see how that could be. In order for the BLAS routine to work, it
> has to be passed a hard pointer to floats in sequential memory. Are you
> saying that (make-array 1000000 :element-type 'single-float) makes a
> sequence of 1000000 pointers to adjacent locations?
>
> And even if that were true, that process is outside my timing loop. It is
> already allocated and filled when I make the call. All that should be
> happening is passing an argument list before calling the foreign
> procedure. Shouldn't it?
>
> -jkc
Yes. There is a little address arithmetic in sys:vector-sap, but no
copying.
Try using the same data in both computations. I have seen identical
floating point code vary in run time by a factor of 4 depending on the
magnitudes of data. It's possible that using integers for one test and
fractional values for the other could cause such behavior. The first
needs no normalization (for addition) in the FPU and the second does.
Gene wrote:
> jkc wrote:
> > On Sun, 30 Jul 2006 23:51:22 +0200, Pascal Bourguignon wrote:
> >
> > > jkc <·······@makewavs.com> writes:
> > >
> > >> I've been working with large complex matrices which led me to calling
> > >> functions out of BLAS and ATLAS using FFI. I expected the overhead of
> > >> calling these functions to be pretty minimal but was surprised to find
> > >> it was instead more than 4 times slower for simple operations than had
> > >> I programmed it in, for example, C++.
> > >>
> > >> I've included the two test codes that perform the same dimension
> > >> complex matrix multiply along with their respective timings.
> > >>
> > >> I can't see any reason why this would be so. Any ideas?
> > >
> > > One word: boxing.
> >
> > I don't see how that could be. In order for the BLAS routine to work, it
> > has to be passed a hard pointer to floats in sequential memory. Are you
> > saying that (make-array 1000000 :element-type 'single-float) makes a
> > sequence of 1000000 pointers to adjacent locations?
> >
> > And even if that were true, that process is outside my timing loop. It is
> > already allocated and filled when I make the call. All that should be
> > happening is passing an argument list before calling the foreign
> > procedure. Shouldn't it?
> >
> > -jkc
>
> Yes. There is a little address arithmetic in sys:vector-sap, but no
> copying.
>
> Try using the same data in both computations. I have seen identical
> floating point code vary in run time by a factor of 4 depending on the
> magnitudes of data. It's possible that using integers for one test and
> fractional values for the other could cause such behavior. The first
> needs no normalization (for addition) in the FPU and the second does.
I'm sorry. I should have said the small fractional values are less
likely to require normalization than the fixed point values.
jkc wrote:
> I can't see any reason why this would be so. Any ideas?
I translated your Lisp code to sbcl. All that was required was changing
the alien, c-call, system and sys package names to sb-alien, sb-c-call,
sb-sys, and sb-sys respectively, and change def-alien-routine to
define-alien-routine, and load-foreign to load-shared-object.
On a MacBook with 2GHz Intel dual core, the C++ and the sbcl versions
took the same amount of time, around 1.8 seconds.
There was a typo in the code you pasted, a missing close parentheses at
the end of the let that is supposed to close the dotimes. That doesn't
explain the different results, as yours could not have compiled the way
you pasted it here, so I assume it was only wrong in the posting.
In case it's a hint about why the difference, the sbcl version showed 0
conses, while yours has 40 conses.
--
Sidney Markowitz
http://www.sidney.com