From: =?UTF-8?B?QW5kcsOpIFRoaWVtZQ==?=
Subject: Porting ECL to video cards?
Date: 
Message-ID: <h040km$nai$1@news.eternal-september.org>
The GPUs in todays video cards can potentially outperform the classical
CPUs that we can buy from AMD and Intel, depending on the class of the
problem.
Could a Lisp benefit from running on such a GPU, at least for some tasks?
ECL is promoted to be very portable - could it from all CLs be the one
which has the best chances to get ported to video cards?
Is there a better way to make GPUs programmable in Lisp other than
porting an implementation to run on them?


André
-- 

From: rvirding
Subject: Re: Porting ECL to video cards?
Date: 
Message-ID: <36e511af-ac0c-4402-98f1-01964c3ef247@t21g2000yqi.googlegroups.com>
On 2 Juni, 22:04, André Thieme <address.good.until.
············@justmail.de> wrote:
> The GPUs in todays video cards can potentially outperform the classical
> CPUs that we can buy from AMD and Intel, depending on the class of the
> problem.
> Could a Lisp benefit from running on such a GPU, at least for some tasks?
> ECL is promoted to be very portable - could it from all CLs be the one
> which has the best chances to get ported to video cards?
> Is there a better way to make GPUs programmable in Lisp other than
> porting an implementation to run on them?
>
> André
> --

It's mainly floating point and array operations the GPU is especially
good at. All for producing high quality graphics fast.

Robert
From: Paul Wallich
Subject: Re: Porting ECL to video cards?
Date: 
Message-ID: <h04h9g$495$1@reader1.panix.com>
rvirding wrote:
> On 2 Juni, 22:04, Andr� Thieme <address.good.until.
> ············@justmail.de> wrote:
>> The GPUs in todays video cards can potentially outperform the classical
>> CPUs that we can buy from AMD and Intel, depending on the class of the
>> problem.
>> Could a Lisp benefit from running on such a GPU, at least for some tasks?
>> ECL is promoted to be very portable - could it from all CLs be the one
>> which has the best chances to get ported to video cards?
>> Is there a better way to make GPUs programmable in Lisp other than
>> porting an implementation to run on them?
>>
>> Andr�
>> --
> 
> It's mainly floating point and array operations the GPU is especially
> good at. All for producing high quality graphics fast.

Well, array operations in the sense of pretty much anything done in 
embarassingly parallel fashion, preferably with a fairly small working 
set of data. You have to think what interesting lispy operations you 
could recast in that form. It would also be pretty easy to implement 
something like *Lisp, the lisp for the connection machine on a GPU -- a 
single high-end board has rather more simultaneous bit-twiddling 
capacity than a first-generation Connection Machine, for example.

But running on just the GPU seems like a sorta limiting idea, when one 
of Lisp's strengths is the ability to reflect on the content of a 
computation. Once you can emit the opcodes needed to move data to hte 
GPU's memory space and the GPU opcodes to do whatever parallel 
operations you want, then it's a matter of writing the appropriate 
little language whose plumbing can divvy up the computing tasks as 
effectively as possible among the various resources you have (including 
multiple CPU cores in most modern machines). For instance, it would be 
great to write clean code that could not only take advantage of as many 
GPU cores as you've got but also do the arithmetic that told it when the 
data-movement costs exceeded the speed increase from doing the 
calculations on the GPU...

paul
From: Dimiter "malkia" Stanev
Subject: Re: Porting ECL to video cards?
Date: 
Message-ID: <4A26F972.4000107@mac.com>
André Thieme wrote:
> The GPUs in todays video cards can potentially outperform the classical
> CPUs that we can buy from AMD and Intel, depending on the class of the
> problem.
> Could a Lisp benefit from running on such a GPU, at least for some tasks?
> ECL is promoted to be very portable - could it from all CLs be the one
> which has the best chances to get ported to video cards?
> Is there a better way to make GPUs programmable in Lisp other than
> porting an implementation to run on them?
> 
> 
> André

My understanding is that the GPU works by running the same algorithm on 
different units of data - in a sense a SIMD, though NVIDIA are calling 
it SIMT (Single Instruction Multiple Threads).

But as soon as the data causes the algorithm flow to change 
(if/while/cond) points, then it can't execute them in parallel anymore. 
E.g. it can execute things in parallel as long as they follow the same 
instruction address.

Look for CUDA / OpenCL papers, and see it for yourself.

For example if someone comes up with ways to run efficiently an LZ based 
  compression for CUDA / OpenCL - then he'll be pretty much done with 
the first step implementing Common Lisp on top of it. LZ compression, 
being one of the dwarfes (A State machine) is really hard to make it 
parallel.

I won't mind though CFFI bindings for CUDA / OpenCL :) - And finally you 
would have something that would beat the crappy Jon Harrop's raycaster 
many times over, realizing that most of the performance wars are win by 
specialized hardware (GPU), and for others where it can't usually Lisp 
would be beat it.
From: Alex Mizrahi
Subject: Re: Porting ECL to video cards?
Date: 
Message-ID: <4a27dc56$0$90263$14726298@news.sunsite.dk>
 DmS> But as soon as the data causes the algorithm flow to change
 DmS> (if/while/cond) points, then it can't execute them in parallel
 DmS> anymore.

you're wrong, conditionals can be used in "shaders".

as far as i understand, nVIDIA's 8xxx series and up have more-or-less
normal CPUs, except they have some memory access quirks and
limitations. 
From: Dimiter "malkia" Stanev
Subject: Re: Porting ECL to video cards?
Date: 
Message-ID: <4A280866.4050706@mac.com>
Alex Mizrahi wrote:
>  DmS> But as soon as the data causes the algorithm flow to change
>  DmS> (if/while/cond) points, then it can't execute them in parallel
>  DmS> anymore.
> 
> you're wrong, conditionals can be used in "shaders".

Yes, I've stated it wrongly. You can have control flow, but you might 
expect big penalties if two or more units (wraps) diverge in execution. 
If they don't (for example an "if" check for out of bounds array should 
be okay).

The OpenCL implementation that I'm using right now, is more or less 
based on Nvidia CUDA.l And the CUDA 2.2 manual paragraph 5.1.1.2 states
that[1]:

"5.1.1.2 Control Flow Instructions

Any flow control instruction (if, switch, do, for, while) can 
significantly impact the effective instruction throughput by causing 
threads of the same warp to diverge, that is, to follow different 
execution paths. If this happens, the different executions paths have to 
be serialized, increasing the total number of instructions executed for 
this warp. When all the different execution paths have completed, the 
threads converge back to the same execution path."

> as far as i understand, nVIDIA's 8xxx series and up have more-or-less
> normal CPUs, except they have some memory access quirks and
> limitations. 

My understanding is that's not the case. A general CPU usually has Out 
of order execution unit, while these little but lots of GPU's do not. A 
general CPU also comes with L1, L2, even L3 cached memory, and can have 
access to the memory almost anywhere (well at least in your virtual 
space) while the memory access on the NVIDIA hardware is different - you 
have a local memory (16kb), then shared across many compute units, then 
global shared with the host - and you have different operations for that 
- that you can only read from the device (usually textures, and such). 
It looks more like the PS3/IBM SPU/SPE units, but with smaller memory, 
and lacks lots of operations (it can't initiate I/O for DMA for example).

One cool thing the OpenCL have as builtin operation is reading with 
bilinear/trilinear/(bicubic?) interpolation from 2D-array or 3D-array - 
think about (aref height-map 10.3 40.5) -> e.g. this would read that 
from your height-map but interpolated between the values of (10,40) 
(10,41), (11,40) and (11,41) with .3 and .5 weights.

And off course, I would love to see whether any Lisp or (for that matter 
any other language) is implementable on it, or at least can be used to 
unleash bigger power (even through CFFI right now, with custom 
Cuda/OpenCL code stored as strings, or even generated on the fly).

As for real implementation, I guess because of the really small local 
memory, and the fact that the compiler does not allow recursive 
functions, you have to be somehow inventive - for example construct it 
continuation passing way, I dunno...

[1] - 
http://developer.download.nvidia.com/compute/cuda/2_2/toolkit/docs/NVIDIA_CUDA_Programming_Guide_2.2.pdf 
page 77
From: Alex Mizrahi
Subject: Re: Porting ECL to video cards?
Date: 
Message-ID: <4a28fd4e$0$90271$14726298@news.sunsite.dk>
 DmS> Yes, I've stated it wrongly. You can have control flow, but you might
 DmS> expect big penalties if two or more units (wraps) diverge in
 DmS> execution. If they don't (for example an "if" check for out of bounds
 DmS> array should be okay).

aha, ok. i've read the doc.. as i understand, if you diverge, instead of
running your code on 128 stream processors you will be running only
on 16 (or 8?) of them, in worst case, as documentation says that each
 multiprocessor has 8 cores and different multiprocessors are independent.

 DmS> The OpenCL implementation that I'm using right now, is more or less
 DmS> based on Nvidia CUDA.l And the CUDA 2.2 manual paragraph 5.1.1.2
 DmS> states that[1]:

i found chapter 4 which explains hardware implementation more interesting.

ok, i think first one should what will be the use of this Lisp-on-GPU: 
projects
which are done without real goals and useful uses rarely succeed.

if GPU is efficient at running same kernels in multiple threads, then 
perhaps
it only makes sence to use GPU when you need to do something in a massively
parallel fashion. that is, you do not need to run full CL on the GPU, but 
only
accelerate parts which are massively parallel and can be run on the GPU.
so we'd have not "CL for GPU" but "CL accelerated by GPU". i think this
might be very useful for numeric stuff, and it is not that hard to implement
this accelerator in CL -- well, at least easier than for most languages out
there.

running lots of totally independent Lisp threads would be much harder,
and i'm not sure what is the goal of it -- maybe it would be easier for
lazy programmers to optimize, but it doesn't look like a good idea
in general..



 DmS> My understanding is that's not the case. A general CPU usually has Out
 DmS> of order execution unit,

um, it depends on what do you call "a general CPU". if you mean the current
generation of Intel CPUs for Desktops -- then yes. otherwise, many mobile
devices have ARM CPUs without any out-of-order execution.

but yes, i agree that stream processing units are not like general CPUs,
performance-wise. quote from nVIDIA documentation:

"For the purposes of correctness,
the programmer can essentially ignore the SIMT behavior; however, 
substantial

performance improvements can be realized by taking care that the code seldom

requires threads in a warp to diverge."



 DmS> being one of the dwarfes (A State machine) is really hard to make it 
parallel.

can you elaborate on this? i dunno about LZ compression, but a general state 
machine
working algorithm can be just one line of code without any branches:

   state = machine_table[state][input[i++]];

and i do not see why can't you run it efficiently as a kernel which does not 
diverge.
of course, in real state machines it is not that simple -- input might be 
more complex,
state might be complex, table might be too large and defined in algorithimic 
way.
but i believe a byte-code executing virtual machine can be made in this way, 
which will
execute some primitive bytecode via applicating very simple kernel without 
any branching.
while it would be sort of inefficient, you can run multiple independent 
threads this way.

but if you have 8x slowdown or more due to interpretation, it defeats its 
purpose.

so, i'm afraid, if you want unrestricted Lisp code to be run in dozens of 
threads
in an efficient manner, you'd have to wait until Intel releases Larrabee. 
From: Alex Mizrahi
Subject: Re: Porting ECL to video cards?
Date: 
Message-ID: <4a2690e3$0$90267$14726298@news.sunsite.dk>
 AT> Is there a better way to make GPUs programmable in Lisp other than
 AT> porting an implementation to run on them?

there is a very interesting project from Intel called Larrabee -- basically,
they make a videocard/GPU out of lots (dozens) of x86 cores (having some 
additional
capabilities for intensive number crunching). these x86 cores work and
 can be programmed in quite usual way, so i'd expect porting to Larrabee
would be easier than porting to any other GPU out there.