The GPUs in todays video cards can potentially outperform the classical
CPUs that we can buy from AMD and Intel, depending on the class of the
problem.
Could a Lisp benefit from running on such a GPU, at least for some tasks?
ECL is promoted to be very portable - could it from all CLs be the one
which has the best chances to get ported to video cards?
Is there a better way to make GPUs programmable in Lisp other than
porting an implementation to run on them?
André
--
On 2 Juni, 22:04, André Thieme <address.good.until.
············@justmail.de> wrote:
> The GPUs in todays video cards can potentially outperform the classical
> CPUs that we can buy from AMD and Intel, depending on the class of the
> problem.
> Could a Lisp benefit from running on such a GPU, at least for some tasks?
> ECL is promoted to be very portable - could it from all CLs be the one
> which has the best chances to get ported to video cards?
> Is there a better way to make GPUs programmable in Lisp other than
> porting an implementation to run on them?
>
> André
> --
It's mainly floating point and array operations the GPU is especially
good at. All for producing high quality graphics fast.
Robert
rvirding wrote:
> On 2 Juni, 22:04, Andr� Thieme <address.good.until.
> ············@justmail.de> wrote:
>> The GPUs in todays video cards can potentially outperform the classical
>> CPUs that we can buy from AMD and Intel, depending on the class of the
>> problem.
>> Could a Lisp benefit from running on such a GPU, at least for some tasks?
>> ECL is promoted to be very portable - could it from all CLs be the one
>> which has the best chances to get ported to video cards?
>> Is there a better way to make GPUs programmable in Lisp other than
>> porting an implementation to run on them?
>>
>> Andr�
>> --
>
> It's mainly floating point and array operations the GPU is especially
> good at. All for producing high quality graphics fast.
Well, array operations in the sense of pretty much anything done in
embarassingly parallel fashion, preferably with a fairly small working
set of data. You have to think what interesting lispy operations you
could recast in that form. It would also be pretty easy to implement
something like *Lisp, the lisp for the connection machine on a GPU -- a
single high-end board has rather more simultaneous bit-twiddling
capacity than a first-generation Connection Machine, for example.
But running on just the GPU seems like a sorta limiting idea, when one
of Lisp's strengths is the ability to reflect on the content of a
computation. Once you can emit the opcodes needed to move data to hte
GPU's memory space and the GPU opcodes to do whatever parallel
operations you want, then it's a matter of writing the appropriate
little language whose plumbing can divvy up the computing tasks as
effectively as possible among the various resources you have (including
multiple CPU cores in most modern machines). For instance, it would be
great to write clean code that could not only take advantage of as many
GPU cores as you've got but also do the arithmetic that told it when the
data-movement costs exceeded the speed increase from doing the
calculations on the GPU...
paul
From: Dimiter "malkia" Stanev
Subject: Re: Porting ECL to video cards?
Date:
Message-ID: <4A26F972.4000107@mac.com>
André Thieme wrote:
> The GPUs in todays video cards can potentially outperform the classical
> CPUs that we can buy from AMD and Intel, depending on the class of the
> problem.
> Could a Lisp benefit from running on such a GPU, at least for some tasks?
> ECL is promoted to be very portable - could it from all CLs be the one
> which has the best chances to get ported to video cards?
> Is there a better way to make GPUs programmable in Lisp other than
> porting an implementation to run on them?
>
>
> André
My understanding is that the GPU works by running the same algorithm on
different units of data - in a sense a SIMD, though NVIDIA are calling
it SIMT (Single Instruction Multiple Threads).
But as soon as the data causes the algorithm flow to change
(if/while/cond) points, then it can't execute them in parallel anymore.
E.g. it can execute things in parallel as long as they follow the same
instruction address.
Look for CUDA / OpenCL papers, and see it for yourself.
For example if someone comes up with ways to run efficiently an LZ based
compression for CUDA / OpenCL - then he'll be pretty much done with
the first step implementing Common Lisp on top of it. LZ compression,
being one of the dwarfes (A State machine) is really hard to make it
parallel.
I won't mind though CFFI bindings for CUDA / OpenCL :) - And finally you
would have something that would beat the crappy Jon Harrop's raycaster
many times over, realizing that most of the performance wars are win by
specialized hardware (GPU), and for others where it can't usually Lisp
would be beat it.
DmS> But as soon as the data causes the algorithm flow to change
DmS> (if/while/cond) points, then it can't execute them in parallel
DmS> anymore.
you're wrong, conditionals can be used in "shaders".
as far as i understand, nVIDIA's 8xxx series and up have more-or-less
normal CPUs, except they have some memory access quirks and
limitations.
From: Dimiter "malkia" Stanev
Subject: Re: Porting ECL to video cards?
Date:
Message-ID: <4A280866.4050706@mac.com>
Alex Mizrahi wrote:
> DmS> But as soon as the data causes the algorithm flow to change
> DmS> (if/while/cond) points, then it can't execute them in parallel
> DmS> anymore.
>
> you're wrong, conditionals can be used in "shaders".
Yes, I've stated it wrongly. You can have control flow, but you might
expect big penalties if two or more units (wraps) diverge in execution.
If they don't (for example an "if" check for out of bounds array should
be okay).
The OpenCL implementation that I'm using right now, is more or less
based on Nvidia CUDA.l And the CUDA 2.2 manual paragraph 5.1.1.2 states
that[1]:
"5.1.1.2 Control Flow Instructions
Any flow control instruction (if, switch, do, for, while) can
significantly impact the effective instruction throughput by causing
threads of the same warp to diverge, that is, to follow different
execution paths. If this happens, the different executions paths have to
be serialized, increasing the total number of instructions executed for
this warp. When all the different execution paths have completed, the
threads converge back to the same execution path."
> as far as i understand, nVIDIA's 8xxx series and up have more-or-less
> normal CPUs, except they have some memory access quirks and
> limitations.
My understanding is that's not the case. A general CPU usually has Out
of order execution unit, while these little but lots of GPU's do not. A
general CPU also comes with L1, L2, even L3 cached memory, and can have
access to the memory almost anywhere (well at least in your virtual
space) while the memory access on the NVIDIA hardware is different - you
have a local memory (16kb), then shared across many compute units, then
global shared with the host - and you have different operations for that
- that you can only read from the device (usually textures, and such).
It looks more like the PS3/IBM SPU/SPE units, but with smaller memory,
and lacks lots of operations (it can't initiate I/O for DMA for example).
One cool thing the OpenCL have as builtin operation is reading with
bilinear/trilinear/(bicubic?) interpolation from 2D-array or 3D-array -
think about (aref height-map 10.3 40.5) -> e.g. this would read that
from your height-map but interpolated between the values of (10,40)
(10,41), (11,40) and (11,41) with .3 and .5 weights.
And off course, I would love to see whether any Lisp or (for that matter
any other language) is implementable on it, or at least can be used to
unleash bigger power (even through CFFI right now, with custom
Cuda/OpenCL code stored as strings, or even generated on the fly).
As for real implementation, I guess because of the really small local
memory, and the fact that the compiler does not allow recursive
functions, you have to be somehow inventive - for example construct it
continuation passing way, I dunno...
[1] -
http://developer.download.nvidia.com/compute/cuda/2_2/toolkit/docs/NVIDIA_CUDA_Programming_Guide_2.2.pdf
page 77
DmS> Yes, I've stated it wrongly. You can have control flow, but you might
DmS> expect big penalties if two or more units (wraps) diverge in
DmS> execution. If they don't (for example an "if" check for out of bounds
DmS> array should be okay).
aha, ok. i've read the doc.. as i understand, if you diverge, instead of
running your code on 128 stream processors you will be running only
on 16 (or 8?) of them, in worst case, as documentation says that each
multiprocessor has 8 cores and different multiprocessors are independent.
DmS> The OpenCL implementation that I'm using right now, is more or less
DmS> based on Nvidia CUDA.l And the CUDA 2.2 manual paragraph 5.1.1.2
DmS> states that[1]:
i found chapter 4 which explains hardware implementation more interesting.
ok, i think first one should what will be the use of this Lisp-on-GPU:
projects
which are done without real goals and useful uses rarely succeed.
if GPU is efficient at running same kernels in multiple threads, then
perhaps
it only makes sence to use GPU when you need to do something in a massively
parallel fashion. that is, you do not need to run full CL on the GPU, but
only
accelerate parts which are massively parallel and can be run on the GPU.
so we'd have not "CL for GPU" but "CL accelerated by GPU". i think this
might be very useful for numeric stuff, and it is not that hard to implement
this accelerator in CL -- well, at least easier than for most languages out
there.
running lots of totally independent Lisp threads would be much harder,
and i'm not sure what is the goal of it -- maybe it would be easier for
lazy programmers to optimize, but it doesn't look like a good idea
in general..
DmS> My understanding is that's not the case. A general CPU usually has Out
DmS> of order execution unit,
um, it depends on what do you call "a general CPU". if you mean the current
generation of Intel CPUs for Desktops -- then yes. otherwise, many mobile
devices have ARM CPUs without any out-of-order execution.
but yes, i agree that stream processing units are not like general CPUs,
performance-wise. quote from nVIDIA documentation:
"For the purposes of correctness,
the programmer can essentially ignore the SIMT behavior; however,
substantial
performance improvements can be realized by taking care that the code seldom
requires threads in a warp to diverge."
DmS> being one of the dwarfes (A State machine) is really hard to make it
parallel.
can you elaborate on this? i dunno about LZ compression, but a general state
machine
working algorithm can be just one line of code without any branches:
state = machine_table[state][input[i++]];
and i do not see why can't you run it efficiently as a kernel which does not
diverge.
of course, in real state machines it is not that simple -- input might be
more complex,
state might be complex, table might be too large and defined in algorithimic
way.
but i believe a byte-code executing virtual machine can be made in this way,
which will
execute some primitive bytecode via applicating very simple kernel without
any branching.
while it would be sort of inefficient, you can run multiple independent
threads this way.
but if you have 8x slowdown or more due to interpretation, it defeats its
purpose.
so, i'm afraid, if you want unrestricted Lisp code to be run in dozens of
threads
in an efficient manner, you'd have to wait until Intel releases Larrabee.
AT> Is there a better way to make GPUs programmable in Lisp other than
AT> porting an implementation to run on them?
there is a very interesting project from Intel called Larrabee -- basically,
they make a videocard/GPU out of lots (dozens) of x86 cores (having some
additional
capabilities for intensive number crunching). these x86 cores work and
can be programmed in quite usual way, so i'd expect porting to Larrabee
would be easier than porting to any other GPU out there.