From: George Neuner
Subject: Automatic parallelization - was Re: LISP Object Oriented?
Date: 
Message-ID: <mcqtr2dnk40tuqharskl0811g6mst82sfm@4ax.com>
On 29 Jan 2007 09:06:34 -0800, ·············@gmail.com"
<············@gmail.com> wrote:

>On Jan 27, 10:42 pm, George Neuner <·········@comcast.net> wrote:
>> Well ... I don't want to start a war here but it has to be said that
>> CL's explicit sequencing and guaranteed order of evaluation for
>> function arguments is a detriment to extracting expression level
>> parallelism.  Source level solutions such as threads, futures, PCALL
>> and so on are not enough to take advantage of massively parallel
>> hardware.
>
>No flame wars here, but I am interested in a good discussion :)
>
>I don't have enough experience on the kinds of applications where that 
>kind of parallelism is useful -- my parallel apps tend to be very 
>regular.  But I'm wondering whether automatic extraction of 
>parallelism will be that useful in the long run, as communication will 
>become increasingly the dominant cost.

Keep in mind first that we are talking about shared memory machines
and second that multi-core processors are much more closely coupled
than the parallel CPUs of previous generation machines.  Multi-cores
typically share 2nd level cache and have dedicated buses between the
cores for snooping 1st level caches.

WRT automatic extraction of parallelism, studies have shown that the
vast majority of programmers have trouble with parallel programming
using source level methods.  The sad fact is that most people find it
difficult to think about simultaneous tasks, are lousy at anticipating
multiple data access problems, and are equally bad at diagnosing and
fixing them.  Experience with threading doesn't seem to help those who
don't get it in the first place.
[This I can attest to first hand.  I worked for over a decade writing
threaded, real time, 24/7 apps on various platforms from bare metal to
RTOS to (honest to God!) Linux and Wintel boxes.  The company I worked
for had enormous trouble finding and keeping good developers.  It's a
really big deal when a bug in your app takes down a factory on the
overnight shift.]

We've already got quad-core general purpose CPUs and larger n-ways on
the horizon.  Obviously some of the power will go to running multiple
programs simultaneously (or in the case of Vista, just running the
operating system).  But the only reasonable way to speed up any single
program is to spread the computations over multiple cores.  The
question is how best to do that.  I can tell you first hand that
source scheduling units on a modern DSP is tough enough - the coming
multi-core CPUs will do a lot more and effectively source scheduling
them will quickly become impossible.

Threading at the source level doesn't seem likely to go far.  Most
tasks are inherently serial at the source level - finding more than a
couple of things to do simultaneously is hard, manually splitting work
for a dozen cores will be incredibly hard.  Similarly, PCALL loses its
attraction when you consider that to be effective there can be little
or no interaction between the parallel computations.  Additionally,
both threads and PCALL are sequenced explicitly by the programmer.

Futures are more interesting.  Most people think of futures in
connection with lazy programming and demand computation, but a future
could be aggressively evaluated in parallel prior to need whenever the
future's preconditions are met - so in effect a future could be
thought of as an implicitly defined, demand scheduled thread.

However, source level futures as in Lisp/Scheme are too course grained
and evaluation is under the control of the programmer.  Because they
are thought of mostly for lazy evaluation where the preconditions are
expected to be in place at the time of evaluation, futures are not
typically written in a way that is conducive to parallel evaluation
(though the compiler could supplement the future with the necessary
precondition tests).

A lazy language may create a future implicitly whenever a computation
is non-strict in its parameters.  These implicit futures theoretically
may be as fine grained as primitive operations (they aren't in current
FPLs).  However, the current crop of lazy languages implement a strict
serial evaluation model - futures are computed on demand only when
their values are needed.


>In scientific computing we've been using "source level solutions" for 
>quite some time now -- if you count source-to-source translators as 
>"source level solutions" ;P  That solution scales to thousands of 
>processors.  Scaling is partially a system question but also partially 
>an algorithm question; programmers have to be trained to think "in 
>parallel."

Yes, but scientific programming is a small percentage of all
programming and big regular, easily decoupled problems are a minority
even there.  Problems which involve significant communication are
already difficult to program effectively for large scale, separate
memory machines.


>Potential parallelism in a function call is only worth extracting if 
>the data are sufficiently large and already distributed across the 
>processors.  Otherwise the communication costs of scattering and 
>gathering the data won't justify parallelization.

The cost of moving computations in a tightly coupled, shared memory
system is quite low.  I don't have a link handy at the moment, but I
read a study last year indicating that striping loop iterations (as
opposed to unrolling them on a single processor) can already be
profitable on some multi-core systems.

I can envision a distributed solution based on aggressive evaluation
of futures - with a runtime similar to parallel OPS5 but starting from
any language with call-by-need semantics.  My solution would work only
for a small number of cores (maybe 8..16), but it would scale a
program automatically within its limits.

George
--
for email reply remove "/" from address

From: Tim Bradshaw
Subject: Re: Automatic parallelization - was Re: LISP Object Oriented?
Date: 
Message-ID: <1170150198.920886.74780@a34g2000cwb.googlegroups.com>
On Jan 30, 8:34 am, George Neuner <·········@comcast.net> wrote:
> On 29 Jan 2007 09:06:34 -0800, ·············@gmail.com"
>
> We've already got quad-core general purpose CPUs and larger n-ways on
> the horizon.

We already have 8-core CPUs in general use, outside of x86, and have 
had for around a year.

I think it's fairly clear that pretty much any vanilla 4-core CPU will 
be starving for memory almost all the time (in particular the kinds of 
things that intel and AMD will be pushing this year).
From: George Neuner
Subject: Re: Automatic parallelization - was Re: LISP Object Oriented?
Date: 
Message-ID: <abn1s21dh2u5k1roh89rlm5gmdd04hdk33@4ax.com>
On 30 Jan 2007 01:43:18 -0800, "Tim Bradshaw" <··········@tfeb.org>
wrote:

>On Jan 30, 8:34 am, George Neuner <·········@comcast.net> wrote:
>> On 29 Jan 2007 09:06:34 -0800, ·············@gmail.com"
>>
>> We've already got quad-core general purpose CPUs and larger n-ways on
>> the horizon.
>
>We already have 8-core CPUs in general use, outside of x86, and have 
>had for around a year.

I would hardly call UltraSparc a "general use" CPU - it has what now?
0.05% of the market.  I concede that it is "general purpose".

The vast majority of multicore machines are dual or quad.


>I think it's fairly clear that pretty much any vanilla 4-core CPU will 
>be starving for memory almost all the time (in particular the kinds of 
>things that intel and AMD will be pushing this year).

I don't think it's clear at all.

Starvation characteristics depend on the cache performance (frequency
of stall) and the CPU pipeline depth (how long to recover) - there is
nothing "vanilla" about it.

Studies done in the early 90's showed that, with appropriate caching,
from 12-16 CPUs could be supported a from shared memory system without
starvation.

In the late 90's CPUs had made such a huge leap in performance
relative to memory that the number of CPUs that could be served
without starvation dropped to under 4.  But this is no longer the
case.

Memory has made a huge comeback in the past few years while core
speeds have remained stable.  A modern 2GHz core running over PC-6400
actually has more memory bandwith available than did a 500Mhz R3000
over FPM in 1990.

The impact of a stall is greater on the modern core, but multiple
issue mitigates this somewhat because only the dependent pipeline(s)
need be stalled - unrelated processing can continue.

I am not aware of any recent academic studies on starvation in close
couple shared memory systems - nearly everyone lost interest after the
"Memory Wall" paper was published in the mid 90's.

George
--
for email reply remove "/" from address
From: Tim Bradshaw
Subject: Re: Automatic parallelization - was Re: LISP Object Oriented?
Date: 
Message-ID: <1170321875.953002.82490@l53g2000cwa.googlegroups.com>
On Jan 31, 7:26 pm, George Neuner <·········@comcast.net> wrote:

>
> I would hardly call UltraSparc a "general use" CPU - it has what now?
> 0.05% of the market.  I concede that it is "general purpose".

I dunno how many T1000s and T2000s Sun have sold, but it's a fair
number.  But why does this matter?  The issue is that you can go out
and buy one for not a huge amount of money, and you don't need some
weird OS for it (they'll run linux I'm sure, if you wanted to).

>
> I don't think it's clear at all.

I think it is, actually.

> Studies done in the early 90's showed that, with appropriate caching,
> from 12-16 CPUs could be supported a from shared memory system without
> starvation.

Of course.  Indeed you can support many more processors than that, and
plenty of boxes do so.  The point is that you need to do relatively
fancy things to do that.  The interconnect on these machines costs
much more than the processors do.  If you spend no money on
interconnect then you'll starve for most interesting loads.

> I am not aware of any recent academic studies on starvation in close
> couple shared memory systems - nearly everyone lost interest after the
> "Memory Wall" paper was published in the mid 90's.

I don't think the people who actually build systems did.  Of course,
they're not academics.  Likewise the people who use such systems in
anger are fairly interested in this.  And likewise, they are not
academics.

--tim
From: ············@gmail.com
Subject: Re: Automatic parallelization - was Re: LISP Object Oriented?
Date: 
Message-ID: <1170354041.750463.107080@k78g2000cwa.googlegroups.com>
On Feb 1, 1:24 am, "Tim Bradshaw" <··········@tfeb.org> wrote:
> Of course.  Indeed you can support many more processors than that, and
> plenty of boxes do so.  The point is that you need to do relatively
> fancy things to do that.  The interconnect on these machines costs
> much more than the processors do.  If you spend no money on
> interconnect then you'll starve for most interesting loads.

Yes, actually this is a problem on a couple of the clusters at NERSC
(www.nersc.gov).  For many jobs, the dual-core Opterons of the
Jacquard cluster don't have the memory bandwidth to feed both
processors; people actually end up wasting one of the processors by
running only one process per node.  On the Seaborg cluster (16
processor per node IBM SP), people rarely run 16 processes per node;
if they do they get terrible performance (though I think part of the
problem is that the OS tends to occupy one processor with its duties;
AIX is a notoriously heavy-weight OS).

The specialized boxen of which you speak like the modern Crays (e.g.
the X1 and ilk) and the SGI Altix have specialized hardware to support
a shared-memory programming model even though they have network
hardware to connect the nodes, just like any other cluster.

Intel's 80-core proposal looks an awful lot like a distributed memory
system, from what I understand, though I'm sure they will include the
hardware to support a shared-memory model.  In order to get sufficient
memory bandwidth to feed the cores, they split up the DRAM into per-
core components.  They were even proposing a 2-D grid as one possible
inter-core communication linkup.

mfh
From: Tim Bradshaw
Subject: Re: Automatic parallelization - was Re: LISP Object Oriented?
Date: 
Message-ID: <eptbkv$jg8$1$830fa7b3@news.demon.co.uk>
On 2007-02-01 18:20:41 +0000, ·············@gmail.com" 
<············@gmail.com> said:

> The specialized boxen of which you speak like the modern Crays (e.g.
> the X1 and ilk) and the SGI Altix have specialized hardware to support
> a shared-memory programming model even though they have network
> hardware to connect the nodes, just like any other cluster.

Actually I was thinking mostly of bigger Sun kit, but it's the same 
deal. (And I suspect they can quite easily starve for many loads as 
well).

> 
> Intel's 80-core proposal looks an awful lot like a distributed memory
> system, from what I understand, though I'm sure they will include the
> hardware to support a shared-memory model.  In order to get sufficient
> memory bandwidth to feed the cores, they split up the DRAM into per-
> core components.  They were even proposing a 2-D grid as one possible
> inter-core communication linkup.

I'm sure that's correct.
From: George Neuner
Subject: Re: Automatic parallelization - was Re: LISP Object Oriented?
Date: 
Message-ID: <jjn8s217vm0ikb1nhb54qu02f0tcodbsti@4ax.com>
On 1 Feb 2007 10:20:41 -0800, ·············@gmail.com"
<············@gmail.com> wrote:

>On Feb 1, 1:24 am, "Tim Bradshaw" <··········@tfeb.org> wrote:
>> Of course.  Indeed you can support many more processors than that, and
>> plenty of boxes do so.  The point is that you need to do relatively
>> fancy things to do that.  The interconnect on these machines costs
>> much more than the processors do.  If you spend no money on
>> interconnect then you'll starve for most interesting loads.
>

>The specialized boxen of which you speak like the modern Crays (e.g.
>the X1 and ilk) and the SGI Altix have specialized hardware to support
>a shared-memory programming model even though they have network
>hardware to connect the nodes, just like any other cluster.

Yes. They are ccNUMA designs - distributed memory that looks/acts like
it is shared.

George
--
for email reply remove "/" from address
From: George Neuner
Subject: Re: Automatic parallelization - was Re: LISP Object Oriented?
Date: 
Message-ID: <cbs4s2pjjouhdlu954c9g7aghurur3rpef@4ax.com>
On 1 Feb 2007 01:24:36 -0800, "Tim Bradshaw" <··········@tfeb.org>
wrote:

>On Jan 31, 7:26 pm, George Neuner <·········@comcast.net> wrote:
>
>> Studies done in the early 90's showed that, with appropriate caching,
>> from 12-16 CPUs could be supported a from shared memory system without
>> starvation.
>
>Of course.  Indeed you can support many more processors than that, and
>plenty of boxes do so.  The point is that you need to do relatively
>fancy things to do that.  The interconnect on these machines costs
>much more than the processors do.  If you spend no money on
>interconnect then you'll starve for most interesting loads.

Ok, I agree with you that the interconnect is important - specifically
the cache coherence system.  However, you seem to be assuming that the
interconnects of all these new multi-core chip are naively
implemented.  I am assuming the opposite for exactly the reasons you
give below - the chip vendors and their customers have a vested
interest in efficiency.  Additionally, putting the interconnect
on-chip should be much cheaper and significantly faster than the
equivalent mechanism for separate CPUs.

>
>> I am not aware of any recent academic studies on starvation in close
>> couple shared memory systems - nearly everyone lost interest after the
>> "Memory Wall" paper was published in the mid 90's.
>
>I don't think the people who actually build systems did.  Of course,
>they're not academics.  Likewise the people who use such systems in
>anger are fairly interested in this.  And likewise, they are not
>academics.
>

By "academic" I mean "less likely to be commercially biased".
Certainly the chip vendors have done research to guide their
architectural decisions, but their studies are generally not available
to the public.

George
--
for email reply remove "/" from address
From: Tim Bradshaw
Subject: Re: Automatic parallelization - was Re: LISP Object Oriented?
Date: 
Message-ID: <1170410254.105223.228900@k78g2000cwa.googlegroups.com>
On Feb 1, 11:16 pm, George Neuner <·········@comcast.net> wrote:

> Ok, I agree with you that the interconnect is important - specifically
> the cache coherence system.  However, you seem to be assuming that the
> interconnects of all these new multi-core chip are naively
> implemented.

Not at all.  I'm assuming (to put it simple-mindedly) that there is
only so much you can do with the number of wires you can get out of a
single package, and (critically) with the interconnect available from
that package to main memory and the design of that memory system, and
that dealing with that problem is an expensive issue which has
ramifications far beyond the design of the package alone.

Think about something like disks.  If I produce a disk which can
transfer data four times as fast as my previous design, what does that
mean to you as an integrator?

> I am assuming the opposite for exactly the reasons you
> give below - the chip vendors and their customers have a vested
> interest in efficiency.

Well, actually they have a vested interest in *sales*.  Their
customers may be interested in efficiency.  However I suggest (back to
the basis of this thread) that on the desktop most customers have no
clue at all beyond counting cores and GHz.

> Additionally, putting the interconnect
> on-chip should be much cheaper and significantly faster than the
> equivalent mechanism for separate CPUs.

If only it were that simple.  Again think of the disk analogy: what
does a typical desktop disk system look like? What does a typical (non-
redundant, for the sake of argument) enterprise system of the same
capacity look like?  Why?

--tim
From: George Neuner
Subject: Re: Automatic parallelization - was Re: LISP Object Oriented?
Date: 
Message-ID: <ou88s2l0q3m5lpmuqjhgnddv2fuaeobu2u@4ax.com>
On 2 Feb 2007 01:57:34 -0800, "Tim Bradshaw" <··········@tfeb.org>
wrote:

>On Feb 1, 11:16 pm, George Neuner <·········@comcast.net> wrote:
>
>> Ok, I agree with you that the interconnect is important - specifically
>> the cache coherence system.  However, you seem to be assuming that the
>> interconnects of all these new multi-core chip are naively
>> implemented.
>
>Not at all.  I'm assuming (to put it simple-mindedly) that there is
>only so much you can do with the number of wires you can get out of a
>single package, and (critically) with the interconnect available from
>that package to main memory and the design of that memory system, and
>that dealing with that problem is an expensive issue which has
>ramifications far beyond the design of the package alone.

That's true.  But again, mid-level desktops today are sporting memory
systems which are equal to 1990's level differentials between CPU and
memory.  High end units have relatively more memory bandwidth than
their old counterparts.  I persist in bringing up the 90's because
that's when most of the publicly available SMP performance studies
were done.

Now, the situation changes as you add CPUs, but the hit each CPU takes
is critically dependent on it's stall characteristics, cache
performance and on the width of it's memory path.  Using Intel's
numbers for the Pentium 4, a snoopy design running over PC6400 should
be able to support up to 8 2GHz CPUs.  As you noted before, there are
other, more complex, coherence systems that are more performant and
other CPUs better suited to multiprocessor use - I am deliberately
picking on a relatively poor choice.

The situation can improve with cores.  With separate CPUs, the on-chip
L1 caches can't be snooped - the snoop circuitry is on (or above) the
external L2 cache, requiring that the L1 cache be write-through so it
can be (indirectly) monitored and lengthening each coherent write
cycle because of the transmission delay in going off chip.  With
multiple cores, the snoop circuitry can be on-chip where it can
monitor the L1 caches directly and performance can be improved by
making the L1 caches write-back.


>Think about something like disks.  If I produce a disk which can
>transfer data four times as fast as my previous design, what does that
>mean to you as an integrator?

It means nothing unless I have to upgrade the bus and chip set to
handle it.


>> Additionally, putting the interconnect
>> on-chip should be much cheaper and significantly faster than the
>> equivalent mechanism for separate CPUs.
>
>If only it were that simple.  Again think of the disk analogy: what
>does a typical desktop disk system look like? What does a typical (non-
>redundant, for the sake of argument) enterprise system of the same
>capacity look like?  Why?

Are you kidding?  A typical desktop has connections for 4 SATA drives
with hardware RAID 0 and RAID 1, plus 4 ATA-133 drives and 2 floppies
- the fact that nobody uses all of it is irrelevant.  Leaving aside
hot swap capability, today's desktop disk systems don't look all that
much different from high performance rack systems.  Despite what you
may have heard, SOL storage systems are quite rare except in
supercomputer centers - most enterprise storage is on UltraSCSI or
IEEE 1394 (Firewire), both of which are only moderately more
performant than 3Gb/s SATA.  

No drives can sustain transfers at anywhere near the speed of any of
these interfaces.  Not to mention that the correcting RAID systems
used in enterprise storage are slightly less performant than
non-correcting ones.


I come from a high performance background ... I have worked on and
with multiprocessors for many years - mostly SMP, but I've also worked
with DM-SIMD (CM-2) and DM-MIMD (CM-5) and with clusters.  In one of
my former lives, I was part of a team that designed and programmed a
proprietary image processing board using a DSP (used as a FP
microcontroller) to configure and sequence 4 FPGA processing elements
with symmetric point-to-point links, running over a dual ported, 320
MB/s _sustained_ throughput memory system built out of 100Mhz SDRAM.

I do know something of the subject ... I'm not just regurgitating the
hype from PC Week.

George
--
for email reply remove "/" from address
From: Marc Battyani
Subject: Re: Automatic parallelization - was Re: LISP Object Oriented?
Date: 
Message-ID: <DsKdnXh45Y8k91nYnZ2dnUVZ8ternZ2d@giganews.com>
"George Neuner" <·········@comcast.net> wrote

> That's true.  But again, mid-level desktops today are sporting memory
> systems which are equal to 1990's level differentials between CPU and
> memory.  High end units have relatively more memory bandwidth than
> their old counterparts.  I persist in bringing up the 90's because
> that's when most of the publicly available SMP performance studies
> were done.

Memory bandwidth has improved and will continue to improve but latency is 
getting worse and the memory performance is pathetic when you don't access 
it sequencially or mix reads with writes.

[...]
> I come from a high performance background ... I have worked on and
> with multiprocessors for many years - mostly SMP, but I've also worked
> with DM-SIMD (CM-2) and DM-MIMD (CM-5) and with clusters.  In one of
> my former lives, I was part of a team that designed and programmed a
> proprietary image processing board using a DSP (used as a FP
> microcontroller) to configure and sequence 4 FPGA processing elements
> with symmetric point-to-point links, running over a dual ported, 320
> MB/s _sustained_ throughput memory system built out of 100Mhz SDRAM.

Yep, Lisp + FPGA is a really cool mix for HPC.
Anyway as usual the performance of each kind of hardware (CPU, n-cores, 
memory systems, etc.) is only good for some problems and very bad at others. 
This seems trivial to say but it seems that currently HPC equals grids of 
standard CPU and when we see the exponential growth of the grid systems in 
terms of cores it seems obvious that there is a problem here.

(Note that I'm rather biased on this as I'm currently designing an FPGA 
based HPC system with a Lisp to hardware compiler. :)

Marc
From: Tim Bradshaw
Subject: Re: Automatic parallelization - was Re: LISP Object Oriented?
Date: 
Message-ID: <eq1tv7$25c$2$8302bc10@news.demon.co.uk>
On 2007-02-03 11:01:05 +0000, "Marc Battyani" 
<·············@fractalconcept.com> said:

> Memory bandwidth has improved and will continue to improve but latency is
> getting worse and the memory performance is pathetic when you don't access
> it sequencially or mix reads with writes.

Right, precisely that.
From: Petter Gustad
Subject: Re: Automatic parallelization - was Re: LISP Object Oriented?
Date: 
Message-ID: <7dwt2xzrv8.fsf@www.gratismegler.no>
"Marc Battyani" <·············@fractalconcept.com> writes:

> (Note that I'm rather biased on this as I'm currently designing an
> FPGA based HPC system with a Lisp to hardware compiler. :)

Cool. Are you using Lisp as your hardware description language? Do you
have your own simulator, or is that any CL compiler? Did you write
your own synthesis tool in CL? What about place and route?

Petter
-- 
A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?
A: Top-posting.
Q: What is the most annoying thing on usenet and in e-mail?
From: Marc Battyani
Subject: Re: Automatic parallelization - was Re: LISP Object Oriented?
Date: 
Message-ID: <096dnUMqA5XuLVrYRVnyuwA@giganews.com>
"Petter Gustad" <·············@gustad.com> wrote
> "Marc Battyani" <·············@fractalconcept.com> writes:
>
>> (Note that I'm rather biased on this as I'm currently designing an
>> FPGA based HPC system with a Lisp to hardware compiler. :)
>
> Cool. Are you using Lisp as your hardware description language? Do you
> have your own simulator, or is that any CL compiler? Did you write
> your own synthesis tool in CL? What about place and route?

Well, P&R is mostly Altera's and Xilinx's job. I generate the design as VHDL 
or EDIF and use the existing FPGA back-ends. This is the way used by most 
higher level language compilers, like System C or Impulse C for instance.

Marc
From: Petter Gustad
Subject: Re: Automatic parallelization - was Re: LISP Object Oriented?
Date: 
Message-ID: <7dejp3ofdf.fsf@www.gratismegler.no>
"Marc Battyani" <·············@fractalconcept.com> writes:

> Well, P&R is mostly Altera's and Xilinx's job. I generate the design
> as VHDL or EDIF and use the existing FPGA back-ends.

So you generate VHDL and EDIF from CL? 

Petter
-- 
A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?
A: Top-posting.
Q: What is the most annoying thing on usenet and in e-mail?
From: Marc Battyani
Subject: Re: Automatic parallelization - was Re: LISP Object Oriented?
Date: 
Message-ID: <fJqdnW8Xu7NrlFTYnZ2dnUVZ8q-mnZ2d@giganews.com>
"Petter Gustad" <·············@gustad.com> wrote
> "Marc Battyani" <·············@fractalconcept.com> writes:
>
>> Well, P&R is mostly Altera's and Xilinx's job. I generate the design
>> as VHDL or EDIF and use the existing FPGA back-ends.
>
> So you generate VHDL and EDIF from CL?

Yes. I also have a C syntax so that people can type expressions in a more 
familiar way (for them ;-).

Marc
From: Tim Bradshaw
Subject: Re: Automatic parallelization - was Re: LISP Object Oriented?
Date: 
Message-ID: <eq1tsq$25c$1$8302bc10@news.demon.co.uk>
On 2007-02-03 10:04:50 +0000, George Neuner <·········@comcast.net> said:

> 
> Are you kidding?  A typical desktop has connections for 4 SATA drives
> with hardware RAID 0 and RAID 1, plus 4 ATA-133 drives and 2 floppies
> - the fact that nobody uses all of it is irrelevant.

The point I was making, which you missed, is that a typical desktop 
disk system is a single huge disk, which can probably manage a few 
hundred IOPS.  The very best disk system given the interconnect above 
would probably be 4 drives in RAID 1 which should sustain twice that 
under good conditions.  A mid-range enterprise storage system, 
configured properly, might sustain 4-8000 IOPS (I don't know what the 
best-case benchmarks are: a database system I used to manage sustained 
that, basically 24 hours a day).  It does that by having many more, 
smaller, disks, a lot of battery-backed cache (I dunno what big systems 
have, the one I got these figures from had 1GB) in the array &c &c.

>  Leaving aside
> hot swap capability, today's desktop disk systems don't look all that
> much different from high performance rack systems.

Oh yes, they do.

>  Despite what you
> may have heard, SOL storage systems are quite rare except in
> supercomputer centers - most enterprise storage is on UltraSCSI or
> IEEE 1394 (Firewire), both of which are only moderately more
> performant than 3Gb/s SATA.

I've never seen enterprise storage on firewire!  Almost everything I 
deal with is either fibre or SCSI for smaller systems.  SAS I suspect 
will (or has perhaps) destroy parallel SCSI.

--tim
From: Juan R.
Subject: Re: Automatic parallelization - was Re: LISP Object Oriented?
Date: 
Message-ID: <1171114507.997423.316740@v33g2000cwv.googlegroups.com>
On Feb 3, 11:04 am, George Neuner <·········@comcast.net> wrote:
> Now, the situation changes as you add CPUs, but the hit each CPU takes
> is critically dependent on it's stall characteristics, cache
> performance and on the width of it's memory path.  Using Intel's
> numbers for the Pentium 4, a snoopy design running over PC6400 should
> be able to support up to 8 2GHz CPUs.

With what improvement? Runing as a hypotetical single core at 6GHz?
Less?
From: Juan R.
Subject: Re: Automatic parallelization - was Re: LISP Object Oriented?
Date: 
Message-ID: <1171114387.623278.311960@v33g2000cwv.googlegroups.com>
On Feb 2, 10:57 am, "Tim Bradshaw" <··········@tfeb.org> wrote:
> However I suggest (back to
> the basis of this thread) that on the desktop most customers have no
> clue at all beyond counting cores and GHz.

I agree. I remember a friend from mine, she was so disturbed that my
Pentium 75 was faster that her Pentium 100.
From: Petter Gustad
Subject: Re: Automatic parallelization - was Re: LISP Object Oriented?
Date: 
Message-ID: <7direkor07.fsf@www.gratismegler.no>
George Neuner <·········@comcast.net> writes:

> Ok, I agree with you that the interconnect is important - specifically
> the cache coherence system.  However, you seem to be assuming that the
> interconnects of all these new multi-core chip are naively
> implemented.  I am assuming the opposite for exactly the reasons you
> give below - the chip vendors and their customers have a vested
> interest in efficiency.  Additionally, putting the interconnect
> on-chip should be much cheaper and significantly faster than the
> equivalent mechanism for separate CPUs.

The data rates you archive on-chip is superior to what you get
externally, especially over cables.

However, I a majority of the the current multi-core CPU's use snooping
for cache coherency. Each CPU has to snoop/listen on the shared bus
for access to cached data. This does not scale very well. Distributed
cache coherency protocols with distributed directories scale much
better, but are more complex.

This is an interesting thread. Are there any parallel Lisps in use,
especially NUMA aware implementations?  I once tried to get NetCLOS
running without very much success. What about the *Lisp which ran on
the Connection Machine, is it still available?

Petter
-- 
A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?
A: Top-posting.
Q: What is the most annoying thing on usenet and in e-mail?
From: ············@gmail.com
Subject: Re: Automatic parallelization - was Re: LISP Object Oriented?
Date: 
Message-ID: <1170435966.846946.248890@j27g2000cwj.googlegroups.com>
On Feb 2, 1:20 am, Petter Gustad <·············@gustad.com> wrote:
> running without very much success. What about the *Lisp which ran on
> the Connection Machine, is it still available?

I haven't looked at *Lisp but I have looked at C* (another data-
parallel language for the Connection Machine).  It looks somewhat
primitive as compared with more modern data-parallel C-like
languages.

*Lisp could be interesting from a language design perspective, but
it's from the early days of parallel computing (so the performance
models were different) and the CM libraries probably aren't in use
anymore (so having the source code won't help a whole lot).

I'm thinking we'll want an incremental approach -- get some message -
passing into CL ASAP, then start focusing on language design.  The
smartest language designers in the world are working on the design
problem and it's taking them a loooooong time ;P

Message passing needs at least the following:

1. Pickling (Python - style) -- so that we can feed stuff into
MPI_Send, etc.
2. A smart interface to MPI 1.1 - level functions.

#1 is partly done (cl-pickle? I forget what it's called).  Pickling
closures is thorny, especially if you're working on a heterogeneous
cluster (if you compile to machine code on machine X and try to run it
on machine Y).  Full support for that would need heavy CL compiler
support.

Once you have pickling, #2 is an exercise in CFFI and CLOS.

Actually I was at a conference this week where a mathematician was
using Lisp and PVM for parallel code.  His Lisp is custom and
simplified, not a CL.  He actually uses his custom garbage collector
to do the work of a pickler (it knows how to traverse data structures
and compact them) -- not a bad idea if you want to commit to an open-
source CL with hooks into the GC.

mfh
From: Fred Gilham
Subject: Re: Automatic parallelization - was Re: LISP Object Oriented?
Date: 
Message-ID: <u7abzwxuf6.fsf@snapdragon.csl.sri.com>
The *Lisp simulator (starsim) is on the Franz examples web site:

  http://examples.franz.com/index.html

I have, at one time or another, gotten it to work under CMUCL.
J. P. Massar, the author, updated it some during the last few years.

-- 
Fred Gilham                                  ······@csl.sri.com
Our original rights freed us from the state's power. Nowadays, most
alleged "rights" increase the state's power over us. -- Joseph Sobran
From: Petter Gustad
Subject: Re: Automatic parallelization - was Re: LISP Object Oriented?
Date: 
Message-ID: <7d1wl522kz.fsf@www.gratismegler.no>
·············@gmail.com" <············@gmail.com> writes:

> 2. A smart interface to MPI 1.1 - level functions.

Isn't MPI ported to CL, even through some kind of FFI yet?


Petter
-- 
A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?
A: Top-posting.
Q: What is the most annoying thing on usenet and in e-mail?
From: ············@gmail.com
Subject: Re: Automatic parallelization - was Re: LISP Object Oriented?
Date: 
Message-ID: <1170697257.968524.193630@a34g2000cwb.googlegroups.com>
On Feb 4, 10:35 am, Petter Gustad <·············@gustad.com> wrote:
> ·············@gmail.com" <············@gmail.com> writes:
> > 2. A smart interface to MPI 1.1 - level functions.
>
> Isn't MPI ported to CL, even through some kind of FFI yet?

Yes:  for example, GCL comes with an MPI interface:

http://www.gnu.org/software/gcl/

Here is the original paper:

http://citeseer.ist.psu.edu/star95starmpi.html

and there have been other projects, as googling "common lisp mpi" will
reveal.  I think ParGCL has been more maintained; we might do well to
start there and port the ideas to other CL systems.

If we link to MPI, we have an option of either pickling (which incurs
the memory cost of a character buffer for pickling / unpickling) or
send/recv in place (which requires that arrays have unboxed objects
and be represented in memory as a C pointer).  In place is useful if
we're sending a large amount of data.

I need to look at ParGCL more closely to see their approach --

mfh
From: Marcus Breiing
Subject: Re: Automatic parallelization - was Re: LISP Object Oriented?
Date: 
Message-ID: <epn4ek$p2o$1@chessie.cirr.com>
George Neuner <·········@comcast.net> writes:

> I can envision a distributed solution based on aggressive evaluation
> of futures ... starting from any language with call-by-need
> semantics.

Programs written for lazy languages will typically serve up lots of
computations that are _never_ used. If you aggressively evaluate them
anyway, doing it in parallel without starving the original threads may
avoid nontermination, but could nevertheless be a horribly inefficient
use of those cores.

-- 
Marcus Breiing
(Cologne, Germany)
From: John Thingstad
Subject: Re: Automatic parallelization - was Re: LISP Object Oriented?
Date: 
Message-ID: <op.tmy5koqcpqzri1@pandora.upc.no>
On Tue, 30 Jan 2007 10:51:11 +0100, Marcus Breiing  
<······@2007w04.mail.breiing.com> wrote:

> George Neuner <·········@comcast.net> writes:
>
>> I can envision a distributed solution based on aggressive evaluation
>> of futures ... starting from any language with call-by-need
>> semantics.
>
> Programs written for lazy languages will typically serve up lots of
> computations that are _never_ used. If you aggressively evaluate them
> anyway, doing it in parallel without starving the original threads may
> avoid nontermination, but could nevertheless be a horribly inefficient
> use of those cores.
>

Where is Duane Retting.
His first project for Franz was developing a ACL version for Cray (I think,
a supercomputer anyway).
I learnt Amdahl's law from him. (wikipedia)
What else can he teach me?
He should be in this discussion!

-- 
Using Opera's revolutionary e-mail client: http://www.opera.com/mail/
From: Paul Wallich
Subject: Re: Automatic parallelization - was Re: LISP Object Oriented?
Date: 
Message-ID: <epo12e$73f$5@reader2.panix.com>
John Thingstad wrote:
> On Tue, 30 Jan 2007 10:51:11 +0100, Marcus Breiing 
> <······@2007w04.mail.breiing.com> wrote:
> 
>> George Neuner <·········@comcast.net> writes:
>>
>>> I can envision a distributed solution based on aggressive evaluation
>>> of futures ... starting from any language with call-by-need
>>> semantics.
>>
>> Programs written for lazy languages will typically serve up lots of
>> computations that are _never_ used. If you aggressively evaluate them
>> anyway, doing it in parallel without starving the original threads may
>> avoid nontermination, but could nevertheless be a horribly inefficient
>> use of those cores.
>>
> 
> Where is Duane Retting.
> His first project for Franz was developing a ACL version for Cray (I think,
> a supercomputer anyway).
> I learnt Amdahl's law from him. (wikipedia)
> What else can he teach me?
> He should be in this discussion!

The big question is whether inefficient use of cores is a bad thing. If 
bandwidth to big nonshared memory is the bottleneck, then it may be less 
important (except perhaps for power consumption) what those core are 
doing. The ultimate question will be more about a programming model that 
gets you peak speed when it's possible.

paul
From: fortunatus
Subject: Re: Automatic parallelization - was Re: LISP Object Oriented?
Date: 
Message-ID: <1170184669.496805.225930@v33g2000cwv.googlegroups.com>
Has anyone thought about implementing a Software
Transactional Memory system in Lisp?  I'm curious
about thoughts on whether it can be done at the
language level (with macros, etc) or whether under-
the-hood support would be needed from the
implementation...  assuming of course that the
implementation already provides threads.

This would add an (ATOMIC ...) form, for instance.
(Yes, I am inspired by this month's Queue magazine
article!  http://www.acmqueue.com/modules.php?
name=Content&pa=showpage&pid=444)

And how about thread dispatch onto these 8-way
cores?  Are implementations working with the
multicore SPARC chip in the Sun T1000, for
instance?
From: Eric Marsden
Subject: Re: Automatic parallelization - was Re: LISP Object Oriented?
Date: 
Message-ID: <87zm80b1fs.fsf@free.fr>
>>>>> "de" == fortunatus  <··············@excite.com> writes:

  de> Has anyone thought about implementing a Software Transactional
  de> Memory system in Lisp? I'm curious about thoughts on whether it
  de> can be done at the language level (with macros, etc) or whether
  de> under- the-hood support would be needed from the
  de> implementation... assuming of course that the implementation
  de> already provides threads.

  <http://common-lisp.net/project/cl-stm/>
  
-- 
Eric Marsden
From: Tim Bradshaw
Subject: Re: Automatic parallelization - was Re: LISP Object Oriented?
Date: 
Message-ID: <1170239362.638279.199070@h3g2000cwc.googlegroups.com>
On Jan 30, 5:59 pm, Paul Wallich <····@panix.com> wrote:

>
> The big question is whether inefficient use of cores is a bad thing.

It's a bad thing because it's making poor use of expensive silicon.
The only valid (as opposed to marketing) reason to out more cores on a
chip is that it is (or is hoped to be) the best way of getting more
performance out of the resources (silicon in other words) available.
If the cores are starving for memory then a better use of the
resources would be to deal with that problem.

(This is similar to the reason why processors typically don't
speculatively execute both sides of a branch: the maximum possible
utilization of that is 50%, so it's generally better to predict and
then speculatively execute the branch you predict to be taken, which
can do much better than 50% on typical code.)

--tim
From: Rob Warnock
Subject: Re: Automatic parallelization - was Re: LISP Object Oriented?
Date: 
Message-ID: <vtCdndXOhO27XV3YnZ2dnUVZ_hynnZ2d@speakeasy.net>
Tim Bradshaw <··········@tfeb.org> wrote:
+---------------
| Paul Wallich <····@panix.com> wrote:
| > The big question is whether inefficient use of cores is a bad thing.
| 
| It's a bad thing because it's making poor use of expensive silicon.
+---------------

And even poorer use of all that extra leakage current
[excess heat, a.k.a. wasted electricity]. As the cores get
smaller & smaller, the idle power becomes a larger and larger
fraction of the full-out running power. So fewer cores used
more efficiently means more computations/kW-hr than many
cores used less efficiently, even if the computations/hr
happens to be the same.


-Rob

-----
Rob Warnock			<····@rpw3.org>
627 26th Avenue			<URL:http://rpw3.org/>
San Mateo, CA 94403		(650)572-2607
From: John Thingstad
Subject: Re: Automatic parallelization - was Re: LISP Object Oriented?
Date: 
Message-ID: <op.tm008fu0pqzri1@pandora.upc.no>
On Wed, 31 Jan 2007 17:13:26 +0100, Rob Warnock <····@rpw3.org> wrote:

>
> And even poorer use of all that extra leakage current
> [excess heat, a.k.a. wasted electricity]. As the cores get
> smaller & smaller, the idle power becomes a larger and larger
> fraction of the full-out running power. So fewer cores used
> more efficiently means more computations/kW-hr than many
> cores used less efficiently, even if the computations/hr
> happens to be the same.
>

The Intel 940 quad core is actually more energy efficient than the
800 series. Partly the 65nm technology uses less energy than the 90nm.
Partly better power management. It shuts down parts of the processor that
are not in use.

Far worse are todays high end GPU's.
A dual 8800 NVIDEA GPU uses 4 PCI power connections, uses about 1KW of
power and produces heat like a toaster. You need a CPU fan two GPU fans, a  
fan to pull
the air in, another to push it out and a third to push it past the GPU's.
Not exactly quiet then. More like having a fridge in you office except that
it runs all the time..

-- 
Using Opera's revolutionary e-mail client: http://www.opera.com/mail/
From: Tim Bradshaw
Subject: Re: Automatic parallelization - was Re: LISP Object Oriented?
Date: 
Message-ID: <1170263699.133916.273130@h3g2000cwc.googlegroups.com>
On Jan 31, 4:25 pm, "John Thingstad" <··············@chello.no> wrote:
>
> The Intel 940 quad core is actually more energy efficient than the
> 800 series. Partly the 65nm technology uses less energy than the 90nm.
> Partly better power management. It shuts down parts of the processor that
> are not in use.

It's not really safe to generalise from a single instance, of course.
Clearly as the number of cores go up various tradeoffs have to be
made.  Of course, one thing that I don't think Rob mentioned is that
one thing you can do, *if* you have enough pllism to exploit, is to
have a large number of cores each clocked relatively slowly.  Power
goes as some quite nasty power of clock speed (2? 4?), so you win this
way.  That's a big if, of course, but there are very commercially
important application areas which are essentially embarrassingly
parallel, the obvious one being web servers, and Sun, say, are
targeting this fairly well: I think that the current Niagara chips run
at 1.1 or 1.5 GHz or something (of course there never have been SPARCs
which were clocked very fast).

As an aside: is the Intel quad core processor actually quad core?  The
only one I'm aware of is two dual-core processors strapped onto one
package.

--tim
From: George Neuner
Subject: Re: Automatic parallelization - was Re: LISP Object Oriented?
Date: 
Message-ID: <8dr1s2hnv1gru2fpm9l5enusbfmn9pktq9@4ax.com>
On 31 Jan 2007 09:14:59 -0800, "Tim Bradshaw" <··········@tfeb.org>
wrote:

>
>As an aside: is the Intel quad core processor actually quad core?  The
>only one I'm aware of is two dual-core processors strapped onto one
>package.
>
>--tim

You're correct.  The real quad core chip is due late this year.

George
--
for email reply remove "/" from address
From: John Thingstad
Subject: Re: Automatic parallelization - was Re: LISP Object Oriented?
Date: 
Message-ID: <op.tm1bjaa5pqzri1@pandora.upc.no>
On Wed, 31 Jan 2007 18:14:59 +0100, Tim Bradshaw <··········@tfeb.org>  
wrote:

> On Jan 31, 4:25 pm, "John Thingstad" <··············@chello.no> wrote:
>>
>> The Intel 940 quad core is actually more energy efficient than the
>> 800 series. Partly the 65nm technology uses less energy than the 90nm.
>> Partly better power management. It shuts down parts of the processor  
>> that
>> are not in use.
>
> It's not really safe to generalise from a single instance, of course.
> Clearly as the number of cores go up various tradeoffs have to be
> made.  Of course, one thing that I don't think Rob mentioned is that
> one thing you can do, *if* you have enough pllism to exploit, is to
> have a large number of cores each clocked relatively slowly.  Power
> goes as some quite nasty power of clock speed (2? 4?), so you win this
> way.  That's a big if, of course, but there are very commercially
> important application areas which are essentially embarrassingly
> parallel, the obvious one being web servers, and Sun, say, are
> targeting this fairly well: I think that the current Niagara chips run
> at 1.1 or 1.5 GHz or something (of course there never have been SPARCs
> which were clocked very fast).
>
> As an aside: is the Intel quad core processor actually quad core?  The
> only one I'm aware of is two dual-core processors strapped onto one
> package.
>
> --tim
>

http://www.intel.com/quad-core/index.htm?iid=homepage_quad-core

-- 
Using Opera's revolutionary e-mail client: http://www.opera.com/mail/
From: Tim Bradshaw
Subject: Re: Automatic parallelization - was Re: LISP Object Oriented?
Date: 
Message-ID: <epqu9t$187$1$8300dec7@news.demon.co.uk>
On 2007-01-31 20:07:48 +0000, "John Thingstad" <··············@chello.no> said:
> http://www.intel.com/quad-core/index.htm?iid=homepage_quad-core
> 

That's the one I thought: two dual-core dies in one package.  It's 
basically a denser way of packaging a pair of dual-core CPUs: in 
particular I don't think they share much cache.

--tim
From: Rob Warnock
Subject: Re: Automatic parallelization - was Re: LISP Object Oriented?
Date: 
Message-ID: <k5mdnWn97PoFIlzYnZ2dnUVZ_rylnZ2d@speakeasy.net>
Tim Bradshaw <··········@tfeb.org> wrote:
+---------------
| It's not really safe to generalise from a single instance, of course.
| Clearly as the number of cores go up various tradeoffs have to be
| made.  Of course, one thing that I don't think Rob mentioned is that
| one thing you can do, *if* you have enough pllism to exploit, is to
| have a large number of cores each clocked relatively slowly.  Power
| goes as some quite nasty power of clock speed (2? 4?), so you win
| this way.
+---------------

Actually, my point was that as transistor geometries get ever smaller,
the leakage current has become a serious source of power dissipation
even when running at *zero* clock speed -- and will continue to grow
even worse, unless something changes drastically[1], and become a
larger & larger percentage of the total power budget. [Some claim
it's *already* as much as 50% of the power!]


-Rob

[1] I'm not sure the recent announcements by Intel & IBM on "high-K"
    dialectrics counts as a "drastic" change, though it certainly has
    the possibility to push back the "leakage wall" a little bit longer.

    http://www.computerworld.com/blogs/node/4462
    http://arstechnica.com/news.ars/post/20070127-8716.html
    http://www.betanews.com/article/IBM_Also_Reinvents_the_Transistor/1170200107
    http://www.playfuls.com/news_05981_IBM_AMD_and_Intel_Fight_for_Supremacy_in_45nm_Chip_Technology.html

-----
Rob Warnock			<····@rpw3.org>
627 26th Avenue			<URL:http://rpw3.org/>
San Mateo, CA 94403		(650)572-2607
From: Tim Bradshaw
Subject: Re: Automatic parallelization - was Re: LISP Object Oriented?
Date: 
Message-ID: <1170327149.712471.194990@p10g2000cwp.googlegroups.com>
On Feb 1, 10:23 am, ····@rpw3.org (Rob Warnock) wrote:
> Tim Bradshaw <··········@tfeb.org> wrote:

>
> Actually, my point was that as transistor geometries get ever smaller,
> the leakage current has become a serious source of power dissipation
> even when running at *zero* clock speed -- and will continue to grow
> even worse, unless something changes drastically[1], and become a
> larger & larger percentage of the total power budget. [Some claim
> it's *already* as much as 50% of the power!]
>

Yes, sorry, I agree with that of course.  The underlying issue must be
that throwing more cores at something is no more a solution than
ramping clock speed, or throwing cache at it.  You have to actually
measure the tradeoffs on loads that matter to you.

--tim
From: Vassil Nikolov
Subject: Re: Automatic parallelization - was Re: LISP Object Oriented?
Date: 
Message-ID: <yy8vhctdy0ig.fsf@eskimo.com>
| Tim Bradshaw <··········@tfeb.org> wrote:
| +---------------
| | ...
| | one thing you can do, *if* you have enough pllism to exploit, is to
| | have a large number of cores each clocked relatively slowly.

  And of course, we know that along such lines we can have about ten
  billion processing units running at about 100 degrees Fahrenheit,
  with power consumption only in the double digits (in watts)...

  ---Vassil.


-- 
Our programs do not have bugs; it is just that the users' expectations
differ from the way they are implemented.
From: Paul Wallich
Subject: Re: Automatic parallelization - was Re: LISP Object Oriented?
Date: 
Message-ID: <epqsg0$eql$1@reader2.panix.com>
Tim Bradshaw wrote:
> On Jan 30, 5:59 pm, Paul Wallich <····@panix.com> wrote:
> 
>> The big question is whether inefficient use of cores is a bad thing.
> 
> It's a bad thing because it's making poor use of expensive silicon.
> The only valid (as opposed to marketing) reason to out more cores on a
> chip is that it is (or is hoped to be) the best way of getting more
> performance out of the resources (silicon in other words) available.
> If the cores are starving for memory then a better use of the
> resources would be to deal with that problem.

If you can. Once you get off-chip, the cost of increasing memory 
bandwidth can make the cost of putting another few cores on the die look 
like small change. So the question is whether a bit more cache is gong 
to do you better than more cores, and that's (imo) a tradeoff that 
should be looked at. If more cores or more functional units will buy you 
more speed, even though not as much more as you'd like, you may still be 
happy.

(I may have misspoken when I said "inefficient use of cores" -- by 
definition, inefficiency is worse than efficiency, but in many cases 
partial utilization of resources is going to be the best you can do. 
See, for example, the terrible inefficiency of RAM chips, where only a 
few transistors out of millions are doing anything useful on a given 
clock cycle.)

> (This is similar to the reason why processors typically don't
> speculatively execute both sides of a branch: the maximum possible
> utilization of that is 50%, so it's generally better to predict and
> then speculatively execute the branch you predict to be taken, which
> can do much better than 50% on typical code.)

Absolutely. Then take the example someone gave of striping certain loops 
across cores rather than unrolling them -- the prolog and postlog 
sequences will almost certainly involve less-efficient utilization of 
cores than a single-core loop, but you'll still be getting better speed.

In the short run (if things can be properly handled) it seems that the 
kindness of strangers should (ha!) keep multiple cores occupied fairly 
well (currently 76 processes visible on this machine, albeit not all 
simultaneously contending for CPU). But in the longer run, yeah, we're 
going to need different notations and ways of handling parallelism at 
many different levels of granularity (duh!)

And applications that are computation-intensive but not embarassingly 
parallel will probably be socially ostracized.

paul
From: George Neuner
Subject: Re: Automatic parallelization - was Re: LISP Object Oriented?
Date: 
Message-ID: <f2s1s293hm13pvkul9fhgfpb40emht2421@4ax.com>
On 30 Jan 2007 10:51:11 +0100, Marcus Breiing
<······@2007w04.mail.breiing.com> wrote:

>George Neuner <·········@comcast.net> writes:
>
>> I can envision a distributed solution based on aggressive evaluation
>> of futures ... starting from any language with call-by-need
>> semantics.
>
>Programs written for lazy languages will typically serve up lots of
>computations that are _never_ used.

This is true.  However, I would make the case that the programmer is
ill equipped to enumerate the computations that _will_ be used in any
particular run of the program, and therefore source level methods can
only be partially effective.

>If you aggressively evaluate them anyway, doing it in parallel 
>without starving the original threads may avoid nontermination,

Backward chaining computations ensures that data supporting the
computation is available when needed, but it is a strictly serial
process.  Forward chaining allows any partial computation to be
performed whenever its supporting data becomes available, possibly
ahead of its need.

Futures from program paths that aren't taken may never be evaluated
because of unsatisfied preconditions.  Some useless computations will
slip through that though.  The ideal would be some method of deciding
which futures are guaranteed to be used and only computing those
futures in parallel - leaving everything else to be computed lazily.  


>but could nevertheless be a horribly inefficient use of those 
>cores.

As opposed to running serial programs that leave them idle?  There is
a lot of inherent parallelism in programs which is currently not
exploited.  I agree, in principle, that maximal aggression is overkill
and a waste of resources ... but the current situation - serial
plodding with ad hoc programmer driven parallelization - is not
desirable either.

Do you have some other idea?

George
--
for email reply remove "/" from address
From: George Neuner
Subject: Re: Automatic parallelization - was Re: LISP Object Oriented?
Date: 
Message-ID: <bc62s21aqat1d93sbiteqd57d6ki2tmup4@4ax.com>
On Wed, 31 Jan 2007 15:24:45 -0500, George Neuner
<·········@comcast.net> wrote:

>Backward chaining computations ensures that data supporting the
>computation is available when needed, but it is a strictly serial
>process.  Forward chaining allows any partial computation to be
>performed whenever its supporting data becomes available, possibly
>ahead of its need.

I should have said "serial in the general case".  Pure lazy languages
can certainly evaluate subproblem futures in parallel.  For languages
with side effects it is not clear that this can be done.

But even assuming that subproblems are evaluated in parallel, the best
a lazy language can do is to have runtime proportional to the longest
path.  A language with aggressive future evaluation has the potential
to do better by computing subproblems before they are needed.

George

--
for email reply remove "/" from address
From: Juan R.
Subject: Re: Automatic parallelization - was Re: LISP Object Oriented?
Date: 
Message-ID: <1171114239.318125.325350@l53g2000cwa.googlegroups.com>
On Jan 30, 9:34 am, George Neuner <·········@comcast.net> wrote:
> Keep in mind first that we are talking about shared memory machines
> and second that multi-core processors are much more closely coupled
> than the parallel CPUs of previous generation machines.

On the desktop or in general?

> We've already got quad-core general purpose CPUs and larger n-ways on
> the horizon.  Obviously some of the power will go to running multiple
> programs simultaneously (or in the case of Vista, just running the
> operating system).  But the only reasonable way to speed up any single
> program is to spread the computations over multiple cores.

If the single program is parallelizable, yes, if it is not then will
probably run poor.

I know of an industrial research predicting would need more than 3
years of supercomputer time for a liquid simulation in parallel way
and contacted with Supercomputation Center people for the optimization
of code. They reduced time to 2 months runnig in a SVG with AMD
Opteron _without_ parallelism because checked the simulation was not
really parallelizable at large degree.

> >In scientific computing we've been using "source level solutions" for
> >quite some time now -- if you count source-to-source translators as
> >"source level solutions" ;P  That solution scales to thousands of
> >processors.  Scaling is partially a system question but also partially
> >an algorithm question; programmers have to be trained to think "in
> >parallel."
>
> Yes, but scientific programming is a small percentage of all
> programming and big regular,

What do you mean by "big regular"?

> Problems which involve significant communication are
> already difficult to program effectively for large scale, separate
> memory machines.

Was not HPF 93 specifically designed for large scale scientific
computing in separate memory machines?
From: ············@gmail.com
Subject: Re: Automatic parallelization - was Re: LISP Object Oriented?
Date: 
Message-ID: <1171213624.131213.140180@v45g2000cwv.googlegroups.com>
On Feb 10, 5:30 am, "Juan R." <··············@canonicalscience.com>
wrote:
> I know of an industrial research predicting would need more than 3
> years of supercomputer time for a liquid simulation in parallel way
> and contacted with Supercomputation Center people for the optimization
> of code. They reduced time to 2 months runnig in a SVG with AMD
> Opteron _without_ parallelism because checked the simulation was not
> really parallelizable at large degree.

Hm, from my experience that suggests they were doing something wrong
in the algorithm.

> > Problems which involve significant communication are
> > already difficult to program effectively for large scale, separate
> > memory machines.
>
> Was not HPF 93 specifically designed for large scale scientific
> computing in separate memory machines?

HPF didn't quite take off as planned, perhaps because it tried to do
too much (though I'm not certain about that).  The next release of
Fortran (2008) will include Co-Array Fortran, which presents a
distributed shared-memory parallel model.

"Significant communication" should be replaced with "irregular
patterns of communication" :)

mfh
From: Juan R.
Subject: Re: Automatic parallelization - was Re: LISP Object Oriented?
Date: 
Message-ID: <1171715803.261968.195120@p10g2000cwp.googlegroups.com>
On Feb 11, 6:07 pm, ·············@gmail.com" <············@gmail.com>
wrote:
> On Feb 10, 5:30 am, "Juan R." <··············@canonicalscience.com>
> wrote:
>
> > I know of an industrial research predicting would need more than 3
> > years of supercomputer time for a liquid simulation in parallel way
> > and contacted with Supercomputation Center people for the optimization
> > of code. They reduced time to 2 months runnig in a SVG with AMD
> > Opteron _without_ parallelism because checked the simulation was not
> > really parallelizable at large degree.
>
> Hm, from my experience that suggests they were doing something wrong
> in the algorithm.

What exactly do you mean?
From: ············@gmail.com
Subject: Re: Automatic parallelization - was Re: LISP Object Oriented?
Date: 
Message-ID: <1171831264.620151.283380@s48g2000cws.googlegroups.com>
On Feb 17, 4:36 am, "Juan R." <··············@canonicalscience.com>
wrote:
> > > I know of an industrial research predicting would need more than 3
> > > years of supercomputer time for a liquid simulation in parallel way
> > > and contacted with Supercomputation Center people for the optimization
> > > of code. They reduced time to 2 months runnig in a SVG with AMD
> > > Opteron _without_ parallelism because checked the simulation was not
> > > really parallelizable at large degree.
>
> > Hm, from my experience that suggests they were doing something wrong
> > in the algorithm.
>
> What exactly do you mean?


1. Google "parallel Navier-Stokes solver" and observe the many hits ;P

2. If they chose to parallelize a discrete state machine based
approach -- that involves mostly local communication and so they
should get pretty decent speedups, especially if they use domain
decomposition techniques.  In fact the *Lisp documentation includes a
simplified example, which I imagine they wouldn't include it it
couldn't be effectively parallelized on their machine (it looks bad
for marketing if your documented example is slower when run in
parallel ;P ).

mfh
From: Juan R.
Subject: Re: Automatic parallelization - was Re: LISP Object Oriented?
Date: 
Message-ID: <1172332454.940265.325870@j27g2000cwj.googlegroups.com>
On Feb 18, 9:41 pm, ·············@gmail.com" <············@gmail.com>
wrote:
> On Feb 17, 4:36 am, "Juan R." <··············@canonicalscience.com>
> wrote:
>
> > > > I know of an industrial research predicting would need more than 3
> > > > years of supercomputer time for a liquid simulation in parallel way
> > > > and contacted with Supercomputation Center people for the optimization
> > > > of code. They reduced time to 2 months runnig in a SVG with AMD
> > > > Opteron _without_ parallelism because checked the simulation was not
> > > > really parallelizable at large degree.
>
> > > Hm, from my experience that suggests they were doing something wrong
> > > in the algorithm.
>
> > What exactly do you mean?
>
> 1. Google "parallel Navier-Stokes solver" and observe the many hits ;P

During a while I thought that you were talking seriously.

> 2. If they chose to parallelize a discrete state machine based
> approach -- that involves mostly local communication and so they
> should get pretty decent speedups, especially if they use domain
> decomposition techniques.  In fact the *Lisp documentation includes a
> simplified example, which I imagine they wouldn't include it it
> couldn't be effectively parallelized on their machine (it looks bad
> for marketing if your documented example is slower when run in
> parallel ;P ).

Simulation was presented in a recent congress. I think is this

[http://www.icpf.cas.cz/theory/liblice/2006/posters/Gonzales-
Salgado_D_VLE_simulations_of_1-methyl-naphtalene.jpg]
From: ············@gmail.com
Subject: Re: Automatic parallelization - was Re: LISP Object Oriented?
Date: 
Message-ID: <1172352791.903878.170930@j27g2000cwj.googlegroups.com>
On Feb 1, 10:20 am, ·············@gmail.com" <············@gmail.com>
wrote:
> > > > Hm, from my experience that suggests they were doing something wrong
> > > > in the algorithm.
>
> > > What exactly do you mean?
>
> > 1. Google "parallel Navier-Stokes solver" and observe the many hits ;P
>
> During a while I thought that you were talking seriously.

Oh, seeing the poster makes sense now, it's an MD code.  When you said
"liquid simulation" I thought you meant fluid flow (for which a
parallel Navier-Stokes solver would be useful).  But there are ways to
parallelize MD simulations.  Can one use tree codes for the
electrostatic forces?

mfh
From: Juan R.
Subject: Re: Automatic parallelization - was Re: LISP Object Oriented?
Date: 
Message-ID: <1172416846.264046.154690@8g2000cwh.googlegroups.com>
On Feb 24, 10:33 pm, ·············@gmail.com" <············@gmail.com>
wrote:
> On Feb 1, 10:20 am, ·············@gmail.com" <············@gmail.com>
> wrote:
>
> > > > > Hm, from my experience that suggests they were doing something wrong
> > > > > in the algorithm.
>
> > > > What exactly do you mean?
>
> > > 1. Google "parallel Navier-Stokes solver" and observe the many hits ;P
>
> > During a while I thought that you were talking seriously.
>
> Oh, seeing the poster makes sense now, it's an MD code.

>From the data i got from the Supercomputation center, I think is not a
MD code (MD = Molecular dynamics). I have no further data.

> When you said
> "liquid simulation" I thought you meant fluid flow (for which a
> parallel Navier-Stokes solver would be useful).

The problem is not that you did confound hidrodynamics with liquid
simulation; a liquid simulation can be done at hidrodynamic level (NS)
at mesoscopic level (kinetics) at molecular level... the problem is
that you did appear rather sure that people at supercomputation center
had not done their work, when you did not know they had done.

> But there are ways to
> parallelize MD simulations.

Sure, but is a MD simulation? I do not know.

> Can one use tree codes for the
> electrostatic forces?

No idea, can one? For what dinamical regimes?