From: jkc
Subject: Why isn't FFI nearly as efficient as native code?
Date: 
Message-ID: <pan.2006.07.30.19.51.49.882493@makewavs.com>
I've been working with large complex matrices which led me to calling
functions out of BLAS and ATLAS using FFI. I expected the overhead of
calling these functions to be pretty minimal but was surprised to find it
was instead more than 4 times slower for simple operations than had I
programmed it in, for example, C++.

I've included the two test codes that perform the same dimension complex
matrix multiply along with their respective timings.

I can't see any reason why this would be so. Any ideas?

-jkc


(1) The C++ code with timing:

#include <iostream>
extern "C" {
#include <cblas.h>
}
#include "sys/time.h"
using namespace std;

double urand() {
	return ((double)rand() / (double)RAND_MAX) * 2.0 - 1.0;
}

template <class T> class complexT
{
public:
  T       re, im;
  complexT(){;}
  complexT( T r, T i=0) : re(r), im(i) {;} 
  complexT(const complexT& o) { *this=o; } 
  complexT& operator=(const complexT& o) { 
	if (this != &o) {
	  re=o.re; im=o.im; } return *this; } 
  complexT operator*(const complexT& o) { 
	return complexT( re*o.re-im*o.im, re*o.im+im*o.re ); } 
  complexT
  operator+(const complexT& o) { 
	return complexT( re+o.re, im+o.im ); }
  complexT operator-(const complexT& o) { 
  	return complexT( re-o.re, im-o.im ); } 
  complexT& operator+=(const complexT& o) { re+=o.re; im+=
  o.im; return *this; }
};
typedef complexT<float>  fcplx;

class timer
{
private:
  timeval       start;
  timeval       current;
  struct timezone      tz;
public:
  timer() {
    tz.tz_minuteswest=0;
    tz.tz_dsttime=0;
    gettimeofday(&start,&tz);
  }
  void reset() { 
	gettimeofday(&start,&tz); } 
  double elapsed(bool reset=false) {
    gettimeofday(&current,&tz);
    double et=(current.tv_sec - start.tv_sec)
      + 1.e-6*(current.tv_usec - start.tv_usec);
    if (reset) start=current;
    return et;
  }
};

int main()
{
  const int N=1000;
  fcplx *A=new fcplx[N*N];
	fcplx *X=new fcplx[N*N];
	fcplx *Y=new fcplx[N*N];
  fcplx One=1;
  for (int k=0; k<N*N; k++) {
    A[k]=0;
    X[k]=fcplx( rand(), rand());
    Y[k]=fcplx( rand(), rand());
  }
  timer T;
	int k=N;
  cblas_cgemm( CblasRowMajor, CblasNoTrans, CblasNoTrans, N, N, k, &One,
  X, N, Y, N, &One, A, N); 
  cout << "elapsed time= " << T.elapsed() << endl;
  delete[] A;
  delete[] X;
  delete[] Y;
  return 0;
}

// g++ -O3 -o atlas atlas.cpp -lcblas -latlas && ./atlas // elapsed time=
3.64997


------------------------------------------------ 

(2) Here is a test using
the FFI interface to the same function with timing:

(defpackage :gemm-alien
  (:use :cl :alien :c-call)
  (:export :run-tests))
(in-package :gemm-alien)
(load-foreign "/usr/lib/blas/atlas/libcblas.so") 
(declaim (inline cblas_cgemm))

(def-alien-routine "cblas_cgemm" c-call:void
  (order  int) ; usually 101
  (transA int) ; no = 111, trans=112, conjtrans=113 (transB int)
  (dimM   int)
  (dimN   int)
  (dimK   int) ; common dim: MxK * KxN = MxN (alpha  (* single-float))
  (A      (* single-float))
  (lda    int)
  (B      (* single-float))
  (ldb    int)
  (beta   (* single-float))
  (C      (* single-float))
  (ldc    int))

(defun surv ()
  (1- (random 2e0)))

(defun run-test-2 (N M K)
  (let* ((order 101) ; rowMajor = 101, colMajor = ?
	 (tA 111)    ; no = 111, trans=112, conjtrans=113 
         (tB 111)    ; no = 111, trans=112, conjtrans=113 
         (A (make-array (* 2 M K) 
		:element-type 'single-float 
		:initial-element 0e0)) 
	 (B (make-array (* 2 K N)
	 	:element-type 'single-float 
		:initial-element 0e0)) 
	 (C (make-array (* 2 M N) 
	 	:element-type 'single-float 
		:initial-element 0e0)) 
	 (alpha (make-array 2 
		:element-type 'single-float :initial-element 0e0))
	 (beta (make-array 2 
		:element-type 'single-float :initial-element 0e0)))
    (dotimes (i (* N M))
      (let ((ir (* 2 i)))
	(setf (aref A ir) (coerce (surv) 'single-float)) 
	(setf (aref B ir) (coerce (surv) 'single-float)))
    (setf (aref alpha 0) 1e0)
    (system:without-gcing
     (time
      (let ((a-addr (sys:vector-sap A))
	    (b-addr (sys:vector-sap B))
	    (c-addr (sys:vector-sap C))
	    (alpha-addr (sys:vector-sap alpha))
	    (beta-addr (sys:vector-sap beta)))
	(cblas-cgemm order tA tB M N K alpha-addr a-addr K b-addr K beta-addr
	c-addr N) )))))

(run-test-2 1000 1000 1000)
; Evaluation took:
;   26.84 seconds of real time
;   23.973497 seconds of user run time ;   0.164011 seconds of system run
time ;   42,936,907,611 CPU cycles
;   0 page faults and
;   40 bytes consed.
;

From: jkc
Subject: Re: Why isn't FFI nearly as efficient as native code?
Date: 
Message-ID: <pan.2006.07.31.04.11.17.474392@makewavs.com>
On Sun, 30 Jul 2006 12:51:50 -0700, jkc wrote:

This has been solved following a discussion with rif via email. I was
working with emacs/slime and early on had loaded the old non-ATLAS BLAS
library /usr/lib/libcblas.a. All subsequent attempts to load the optimized
ATLAS library /usr/lib/blas/atlas/libcblas.so indicated success, but in
fact were not. It was still linking to the old library. 

I only obtained comparable results between the C++ and Lisp methods after
shutting down emacs and restarting it. 

There is probably a way to *unload* a library, but I don't know what it is. 

-jkc
From: Carl Shapiro
Subject: Re: Why isn't FFI nearly as efficient as native code?
Date: 
Message-ID: <ouyzmeqwp9f.fsf@panix3.panix.com>
jkc <·······@makewavs.com> writes:

> I can't see any reason why this would be so. Any ideas?

Can you run your test again, passing in an array of floating point
values allocated on the malloc heap instead?  Is this an x86 Linux
system by any chance?
From: Pascal Bourguignon
Subject: Re: Why isn't FFI nearly as efficient as native code?
Date: 
Message-ID: <8764heyd5x.fsf@thalassa.informatimago.com>
jkc <·······@makewavs.com> writes:

> I've been working with large complex matrices which led me to calling
> functions out of BLAS and ATLAS using FFI. I expected the overhead of
> calling these functions to be pretty minimal but was surprised to find it
> was instead more than 4 times slower for simple operations than had I
> programmed it in, for example, C++.
>
> I've included the two test codes that perform the same dimension complex
> matrix multiply along with their respective timings.
>
> I can't see any reason why this would be so. Any ideas?

One word: boxing.

-- 
__Pascal Bourguignon__                     http://www.informatimago.com/
Wanna go outside.
Oh, no! Help! I got outside!
Let me back inside!
From: jkc
Subject: Re: Why isn't FFI nearly as efficient as native code?
Date: 
Message-ID: <pan.2006.07.30.22.12.29.761416@makewavs.com>
On Sun, 30 Jul 2006 23:51:22 +0200, Pascal Bourguignon wrote:

> jkc <·······@makewavs.com> writes:
> 
>> I've been working with large complex matrices which led me to calling
>> functions out of BLAS and ATLAS using FFI. I expected the overhead of
>> calling these functions to be pretty minimal but was surprised to find
>> it was instead more than 4 times slower for simple operations than had
>> I programmed it in, for example, C++.
>>
>> I've included the two test codes that perform the same dimension
>> complex matrix multiply along with their respective timings.
>>
>> I can't see any reason why this would be so. Any ideas?
> 
> One word: boxing.

I don't see how that could be. In order for the BLAS routine to work, it
has to be passed a hard pointer to floats in sequential memory. Are you
saying that (make-array 1000000 :element-type 'single-float) makes a
sequence of 1000000 pointers to adjacent locations?

And even if that were true, that process is outside my timing loop. It is
already allocated and filled when I make the call. All that should be
happening is passing an argument list before calling the foreign
procedure. Shouldn't it?

-jkc
From: Gene
Subject: Re: Why isn't FFI nearly as efficient as native code?
Date: 
Message-ID: <1154303770.485279.227580@m73g2000cwd.googlegroups.com>
jkc wrote:
> On Sun, 30 Jul 2006 23:51:22 +0200, Pascal Bourguignon wrote:
>
> > jkc <·······@makewavs.com> writes:
> >
> >> I've been working with large complex matrices which led me to calling
> >> functions out of BLAS and ATLAS using FFI. I expected the overhead of
> >> calling these functions to be pretty minimal but was surprised to find
> >> it was instead more than 4 times slower for simple operations than had
> >> I programmed it in, for example, C++.
> >>
> >> I've included the two test codes that perform the same dimension
> >> complex matrix multiply along with their respective timings.
> >>
> >> I can't see any reason why this would be so. Any ideas?
> >
> > One word: boxing.
>
> I don't see how that could be. In order for the BLAS routine to work, it
> has to be passed a hard pointer to floats in sequential memory. Are you
> saying that (make-array 1000000 :element-type 'single-float) makes a
> sequence of 1000000 pointers to adjacent locations?
>
> And even if that were true, that process is outside my timing loop. It is
> already allocated and filled when I make the call. All that should be
> happening is passing an argument list before calling the foreign
> procedure. Shouldn't it?
>
> -jkc

Yes.  There is a little address arithmetic in sys:vector-sap, but no
copying.

Try using the same data in both computations.  I have seen identical
floating point code vary in run time by a factor of 4 depending on the
magnitudes of data.  It's possible that using integers for one test and
fractional values for the other could cause such behavior.  The first
needs no normalization (for addition) in the FPU and the second does.
From: Gene
Subject: Re: Why isn't FFI nearly as efficient as native code?
Date: 
Message-ID: <1154316182.522781.244970@i3g2000cwc.googlegroups.com>
Gene wrote:
> jkc wrote:
> > On Sun, 30 Jul 2006 23:51:22 +0200, Pascal Bourguignon wrote:
> >
> > > jkc <·······@makewavs.com> writes:
> > >
> > >> I've been working with large complex matrices which led me to calling
> > >> functions out of BLAS and ATLAS using FFI. I expected the overhead of
> > >> calling these functions to be pretty minimal but was surprised to find
> > >> it was instead more than 4 times slower for simple operations than had
> > >> I programmed it in, for example, C++.
> > >>
> > >> I've included the two test codes that perform the same dimension
> > >> complex matrix multiply along with their respective timings.
> > >>
> > >> I can't see any reason why this would be so. Any ideas?
> > >
> > > One word: boxing.
> >
> > I don't see how that could be. In order for the BLAS routine to work, it
> > has to be passed a hard pointer to floats in sequential memory. Are you
> > saying that (make-array 1000000 :element-type 'single-float) makes a
> > sequence of 1000000 pointers to adjacent locations?
> >
> > And even if that were true, that process is outside my timing loop. It is
> > already allocated and filled when I make the call. All that should be
> > happening is passing an argument list before calling the foreign
> > procedure. Shouldn't it?
> >
> > -jkc
>
> Yes.  There is a little address arithmetic in sys:vector-sap, but no
> copying.
>
> Try using the same data in both computations.  I have seen identical
> floating point code vary in run time by a factor of 4 depending on the
> magnitudes of data.  It's possible that using integers for one test and
> fractional values for the other could cause such behavior.  The first
> needs no normalization (for addition) in the FPU and the second does.

I'm sorry.  I should have said the small fractional values are less
likely to require normalization than the fixed point values.
From: Sidney Markowitz
Subject: Re: Why isn't FFI nearly as efficient as native code?
Date: 
Message-ID: <44cd8ed9$0$34567$742ec2ed@news.sonic.net>
jkc wrote:
> I can't see any reason why this would be so. Any ideas?

I translated your Lisp code to sbcl. All that was required was changing
the alien, c-call, system and sys package names to sb-alien, sb-c-call,
sb-sys, and sb-sys respectively, and change def-alien-routine to
define-alien-routine, and load-foreign to load-shared-object.

On a MacBook with 2GHz Intel dual core, the C++ and the sbcl versions
took the same amount of time, around 1.8 seconds.

There was a typo in the code you pasted, a missing close parentheses at
the end of the let that is supposed to close the dotimes. That doesn't
explain the different results, as yours could not have compiled the way
you pasted it here, so I assume it was only wrong in the posting.

In case it's a hint about why the difference, the sbcl version showed 0
conses, while yours has 40 conses.


-- 
    Sidney Markowitz
    http://www.sidney.com