From: John Thingstad
Subject: optimation, a black art?
Date: 
Message-ID: <op.udoqxkzwut4oq5@pandora.alfanett.no>
I have lately been discovering the weird and wonderful world of modern  
micro processor architecture.
One of the best impulses came last week. (Can't find the thread thad  
suggested this one. sorry!)
http://www.agner.org/optimize/
As a preliminary I recommend Donald Knuth mmix
http://www-cs-faculty.stanford.edu/~knuth/mmix.html.
It is not rocket science to predict moder processor efficiency.
Most interesting insights.. Recommended!

--------------
John Thingstad

From: Vassil Nikolov
Subject: Re: optimation, a black art?
Date: 
Message-ID: <snzfxqrojva.fsf@luna.vassil.nikolov.name>
On Thu, 03 Jul 2008 00:18:46 +0200, "John Thingstad" <·······@online.no> said:
| ...
| http://www-cs-faculty.stanford.edu/~knuth/mmix.html.

  RISC is fun.  Much like what happens after getting through the
  looking glass [*], the return from a subroutine precedes the loading
  of the return value into the appropriate register...

  _________
  [*] "Hand it round first, and cut it afterwards."

  ---Vassil.


-- 
Peius melius est.  ---Ricardus Gabriel.
From: D Herring
Subject: Re: optimation, a black art?
Date: 
Message-ID: <yKCdnbos_PaH2PHVnZ2dnUVZ_oninZ2d@comcast.com>
John Thingstad wrote:
> I have lately been discovering the weird and wonderful world of modern 
> micro processor architecture.
> One of the best impulses came last week. (Can't find the thread thad 
> suggested this one. sorry!)
> http://www.agner.org/optimize/
> As a preliminary I recommend Donald Knuth mmix
> http://www-cs-faculty.stanford.edu/~knuth/mmix.html.
> It is not rocket science to predict moder processor efficiency.
> Most interesting insights.. Recommended!

See also
http://llvm.org/
A Lisp frontend might be a fun project, though layering tagged data 
onto their system might be difficult.

- Daniel
From: =?ISO-8859-15?Q?Pertti_Kellom=E4ki?=
Subject: Re: optimation, a black art?
Date: 
Message-ID: <g4kkfl$9gq$1@news.cc.tut.fi>
D Herring kirjoitti:
> See also
> http://llvm.org/
> A Lisp frontend might be a fun project, though layering tagged data onto 
> their system might be difficult.

I don't think it would pose too much trouble. Plus there is some
infrastructure for garbage collection as well, so a simple Lisp
compiler should be relatively easy to do. I'm not sure how much
of the optimization stuff in LLVM (constant propagation etc.)
would be relevant to a Lisp compiler though .
-- 
Pertti
From: George Neuner
Subject: Re: optimation, a black art?
Date: 
Message-ID: <3gpe74pojjhamdlesad5v81frfpi7bo5h1@4ax.com>
On Thu, 03 Jul 2008 00:18:46 +0200, "John Thingstad"
<·······@online.no> wrote:

>I have lately been discovering the weird and wonderful world of modern  
>micro processor architecture.
>One of the best impulses came last week. (Can't find the thread thad  
>suggested this one. sorry!)
>http://www.agner.org/optimize/
>As a preliminary I recommend Donald Knuth mmix
>http://www-cs-faculty.stanford.edu/~knuth/mmix.html.
>It is not rocket science to predict moder processor efficiency.
>Most interesting insights.. Recommended!
>
>--------------
>John Thingstad

I've done a lot of low level optimization for real time software.

Optimization at the assembler level has certainly gotten harder.  Most
architectures today have multiple pipelines and fast running
instructions may be completely subsumed by slower running ones.  Many
chip manufacturers no longer document the cycle count for instructions
because it is variable depending on the sequencing, so peephole
optimization becomes a cycle of trial and error.

George
--
for email reply remove "/" from address
From: Robert Maas, http://tinyurl.com/uh3t
Subject: Re: optimation, a black art?
Date: 
Message-ID: <rem-2008jul11-003@yahoo.com>
I've never done (at a practical level) any of the kinds of analysis
you describe below, but I believe I have a basic understanding of
the logic involved, so I'll offer some guesses:
> From: George Neuner <·········@/comcast.net>
> I've done a lot of low level optimization for real time software.
> Optimization at the assembler level has certainly gotten harder.
> Most architectures today have multiple pipelines and fast running
> instructions may be completely subsumed by slower running ones.

So presumably for each block of code, such as the body of a
subroutine or a clause of a conditional or loop, you have a matrix
showing which operations need to be completed before which other
operations can start, the essential nature of a PERT chart, right?

> Many chip manufacturers no longer document the cycle count for
> instructions because it is variable depending on the sequencing,
> so peephole optimization becomes a cycle of trial and error.

Computers are so fast nowadays that I imagine it's trivial, given
the dependency matrix, to generate all permutations of the
operations that are consistent with that matrix, then for each such
permutation generate the actual sequence of instructions following
that permutation in a block of memory, then for each such block
make a million copies and put a system-clock-read before the first
copy and after the last copy, then execute each such mega-block of
code and collect the clock-time-difference, repeat ten or twenty
times to get a good statistical sample, compare the statistics from
those various permutations, and report which sequences vie for
fastest, and the whole analysis and report generation can be
completed within a few seconds, with no operator intervention
needed from start to finish, right? So from your viewpoint, you
edit the matrix, click the permute-analyze button, wait a few
seconds as the progress bar scrolls from 0% to 100%, and up comes
the report showing you which permutation was selected and how it
compares with the other, and most of the time you don't even look
at the details of the report, you just click on the OK button to
automatically copy that permutation to your peephole optimization
database, and you can check off one more micro-algorithm-pattern
that has been optimized, right?

(Following to the previous poster whose ISP is ComCast:)
By the way, I believe the ComCast ad on TV regarding next February's
switch to digital TV broadcasting tells a blatant lie. It says that
if you use a rabbit ears or other antenna to get TV, it won't work
after the switchover date, so you'll need to subscribe to ComCast cable-TV
after the switchover date or cease watching TV.

In truth, I believe, correct me if I'm wrong: Whether you get the
signal from rabbit ears or rooftoop antenna is irrelevant. What is
important is whether you have a way to convert broadcast digital TV
to whatever your TV set needs. If you have a digital convertor box
between the rabbit ears or rooftop antenna and your analog TV set,
your system will *still* work despite the fact you're still using
those rabbit ears etc. Or you can get ComCast to do the conversion
for you and feed an analog signal to you over their CATV. Or you
might already be using a recently manufactured digital TV with your
rabbit ears or rooftop antenna, which doesn't need the conversion
in the first place.

Any of those three ways works. You don't *need* to abandon your
rabbit ears or rooftop antenna and switch to CATV. You can continue
using your rabbit ears or rooftop antenna if you pick either of the
other two choices, not ComCast/CATV.

I hate ComCast for multiple reasons:
- Flooding me with several ads per week at a time when I couldn't
   even afford the rent much less food or CATV, and consequently
   had to borrow $64,000 on credit cards just to pay the rent to
   stay alive, and get all my food from a food bank. I had already
   maxed out all but one of my credit cards and was just days from
   being homeless and suffering severe hypothemia before I was
   allowed into subsidized low-income housing where my SSI could
   pay the rent, and I was still getting flooded with ComCast ads
   almost daily in the snailmailbox.
- Allowing their gazillions of DSL users to have spambot-infected
   computers, flooding my mailbox with hundreds of spam per day,
   flooding just about everyone else who has an e-mail address with
   a similar amount of spam.
- Running these fraudulent scare-tactic ads about digital TV.
From: D Herring
Subject: Re: optimation, a black art?
Date: 
Message-ID: <F7SdnbQnz9xEj-XVnZ2dnUVZ_sninZ2d@comcast.com>
Robert Maas, http://tinyurl.com/uh3t wrote:

> I hate ComCast for multiple reasons:

And I have a few more to add to your list (including selective 
throttling, "power boost", prohibiting servers, not even trying to 
match the advertised performance, hounding those who "overuse their 
unlimited bandwidth", etc.); but they're the only valid option in my 
apartment complex.  I keep hoping for real competition.

Every time they do a TV advert, I hear "Comcast high-speed internet -- 
ts Craptastic!"  But it works reliably enough.

- Daniel
From: George Neuner
Subject: Re: optimation, a black art?
Date: 
Message-ID: <hj9g74dmoj9cliof4d06ujp2edl0c8qr5t@4ax.com>
On Fri, 11 Jul 2008 10:57:52 -0700,
··················@spamgourmet.com.remove (Robert Maas,
http://tinyurl.com/uh3t) wrote:

>I've never done (at a practical level) any of the kinds of analysis
>you describe below, but I believe I have a basic understanding of
>the logic involved, so I'll offer some guesses:
>> From: George Neuner <·········@/comcast.net>
>> I've done a lot of low level optimization for real time software.
>> Optimization at the assembler level has certainly gotten harder.
>> Most architectures today have multiple pipelines and fast running
>> instructions may be completely subsumed by slower running ones.
>
>So presumably for each block of code, such as the body of a
>subroutine or a clause of a conditional or loop, you have a matrix
>showing which operations need to be completed before which other
>operations can start, the essential nature of a PERT chart, right?

Only partially.  Trace [*] scheduling (also called Basic Block
scheduling or List scheduling) is analogous to PERTing - the problem
for the compiler is figuring out what instructions to include in each
of the traces.  Remember that the register set is finite and a fetch
from memory may take 50..10,000 times longer than fetching from a
register.  A fetch of a value not in a register should be placed far
enough in advance of the value's use so as not to stall the pipeline.
To do this the compiler has to estimate where the value lies in the
memory/cache hierarchy.

The end result is that what appears to be a simple logical instruction
sequence may really end up as many small sequences separated by
dozens, or even hundreds, of other instructions which are competing
for registers, functional units and memory bandwidth.


[*] A trace or basic block is a list of instructions to be executed
sequentially, with no internal branches and, at most, one branch or
jump at the end.


>> Many chip manufacturers no longer document the cycle count for
>> instructions because it is variable depending on the sequencing,
>> so peephole optimization becomes a cycle of trial and error.
>
>Computers are so fast nowadays that I imagine it's trivial, given
>the dependency matrix, to generate all permutations of the
>operations that are consistent with that matrix, then for each such
>permutation generate the actual sequence of instructions following
>that permutation in a block of memory, then for each such block
>make a million copies and put a system-clock-read before the first
>copy and after the last copy, then execute each such mega-block of
>code and collect the clock-time-difference, repeat ten or twenty
>times to get a good statistical sample, compare the statistics from
>those various permutations, and report which sequences vie for
>fastest, and the whole analysis and report generation can be
>completed within a few seconds, with no operator intervention
>needed from start to finish, right? So from your viewpoint, you
>edit the matrix, click the permute-analyze button, wait a few
>seconds as the progress bar scrolls from 0% to 100%, and up comes
>the report showing you which permutation was selected and how it
>compares with the other, and most of the time you don't even look
>at the details of the report, you just click on the OK button to
>automatically copy that permutation to your peephole optimization
>database, and you can check off one more micro-algorithm-pattern
>that has been optimized, right?

No it's not trivial.  Depending on the instruction set, for any
particular high level construct, there may be several templates for
implementing it.  When high level constructs are combined, the
combinatorial explosion quickly makes trying them all impractical.
Add a finite register set and memory fetch sequencing and the number
of interacting traces quickly becomes unmanageable.  Add cache
pollution from a multitasking OS and the problem becomes NP hard - ie.
essentially unsolvable.

Scheduling for a modern, multiple issue processor is, in fact, so
complicated that compilers usually cannot do an optimal job.  Almost
all modern CPUs support out-of-order execution - the CPU scans ahead
in the instruction stream looking for things to do to fill latency
holes left by the compiler.  The hardware figures out what can be done
and when based on the actual state of the pipelines at the time -
something the compiler can only estimate and then only for sequential
programming - not for multitasking.

I suggest you read up on modern chip architectures and get yourself a
good book on compilers.

George
--
for email reply remove "/" from address
From: Vend
Subject: Re: optimation, a black art?
Date: 
Message-ID: <12b45eb0-c79b-49e1-b4e3-67fa9c0abda5@59g2000hsb.googlegroups.com>
On 12 Lug, 06:41, George Neuner <·········@/comcast.net> wrote:
> On Fri, 11 Jul 2008 10:57:52 -0700,
> ··················@spamgourmet.com.remove (Robert Maas,
>
> http://tinyurl.com/uh3t) wrote:
> >I've never done (at a practical level) any of the kinds of analysis
> >you describe below, but I believe I have a basic understanding of
> >the logic involved, so I'll offer some guesses:
> >> From: George Neuner <·········@/comcast.net>
> >> I've done a lot of low level optimization for real time software.
> >> Optimization at the assembler level has certainly gotten harder.
> >> Most architectures today have multiple pipelines and fast running
> >> instructions may be completely subsumed by slower running ones.
>
> >So presumably for each block of code, such as the body of a
> >subroutine or a clause of a conditional or loop, you have a matrix
> >showing which operations need to be completed before which other
> >operations can start, the essential nature of a PERT chart, right?
>
> Only partially.  Trace [*] scheduling (also called Basic Block
> scheduling or List scheduling) is analogous to PERTing - the problem
> for the compiler is figuring out what instructions to include in each
> of the traces.  Remember that the register set is finite and a fetch
> from memory may take 50..10,000 times longer than fetching from a
> register.  A fetch of a value not in a register should be placed far
> enough in advance of the value's use so as not to stall the pipeline.
> To do this the compiler has to estimate where the value lies in the
> memory/cache hierarchy.
>
> The end result is that what appears to be a simple logical instruction
> sequence may really end up as many small sequences separated by
> dozens, or even hundreds, of other instructions which are competing
> for registers, functional units and memory bandwidth.
>
> [*] A trace or basic block is a list of instructions to be executed
> sequentially, with no internal branches and, at most, one branch or
> jump at the end.
>
>
>
> >> Many chip manufacturers no longer document the cycle count for
> >> instructions because it is variable depending on the sequencing,
> >> so peephole optimization becomes a cycle of trial and error.
>
> >Computers are so fast nowadays that I imagine it's trivial, given
> >the dependency matrix, to generate all permutations of the
> >operations that are consistent with that matrix, then for each such
> >permutation generate the actual sequence of instructions following
> >that permutation in a block of memory, then for each such block
> >make a million copies and put a system-clock-read before the first
> >copy and after the last copy, then execute each such mega-block of
> >code and collect the clock-time-difference, repeat ten or twenty
> >times to get a good statistical sample, compare the statistics from
> >those various permutations, and report which sequences vie for
> >fastest, and the whole analysis and report generation can be
> >completed within a few seconds, with no operator intervention
> >needed from start to finish, right? So from your viewpoint, you
> >edit the matrix, click the permute-analyze button, wait a few
> >seconds as the progress bar scrolls from 0% to 100%, and up comes
> >the report showing you which permutation was selected and how it
> >compares with the other, and most of the time you don't even look
> >at the details of the report, you just click on the OK button to
> >automatically copy that permutation to your peephole optimization
> >database, and you can check off one more micro-algorithm-pattern
> >that has been optimized, right?
>
> No it's not trivial.  Depending on the instruction set, for any
> particular high level construct, there may be several templates for
> implementing it.  When high level constructs are combined, the
> combinatorial explosion quickly makes trying them all impractical.
> Add a finite register set and memory fetch sequencing and the number
> of interacting traces quickly becomes unmanageable.  Add cache
> pollution from a multitasking OS and the problem becomes NP hard - ie.
> essentially unsolvable.
>
> Scheduling for a modern, multiple issue processor is, in fact, so
> complicated that compilers usually cannot do an optimal job.  Almost
> all modern CPUs support out-of-order execution - the CPU scans ahead
> in the instruction stream looking for things to do to fill latency
> holes left by the compiler.  The hardware figures out what can be done
> and when based on the actual state of the pipelines at the time -
> something the compiler can only estimate and then only for sequential
> programming - not for multitasking.
>
> I suggest you read up on modern chip architectures and get yourself a
> good book on compilers.

Yet the IA-64 architecture requires the compiler to explicitely
schedule the instructions, IIRC.

Most of the complication in modern x86 processors was due to the
binary backwards compatibility constraint: they expose a CISC
instruction set implemented on a RISC-like internal architecture.
From: George Neuner
Subject: Re: optimation, a black art?
Date: 
Message-ID: <g8sl74deq44p2e2cgupmhirrgqvbaiguqk@4ax.com>
On Sat, 12 Jul 2008 01:06:32 -0700 (PDT), Vend <······@virgilio.it>
wrote:

>On 12 Lug, 06:41, George Neuner <·········@/comcast.net> wrote:
>> On Fri, 11 Jul 2008 10:57:52 -0700,
>> ··················@spamgourmet.com.remove (Robert Maas,
>>
>> http://tinyurl.com/uh3t) wrote:
>> >I've never done (at a practical level) any of the kinds of analysis
>> >you describe below, but I believe I have a basic understanding of
>> >the logic involved, so I'll offer some guesses:
>> >> From: George Neuner <·········@/comcast.net>
>> >> I've done a lot of low level optimization for real time software.
>> >> Optimization at the assembler level has certainly gotten harder.
>> >> Most architectures today have multiple pipelines and fast running
>> >> instructions may be completely subsumed by slower running ones.
>>
>> >So presumably for each block of code, such as the body of a
>> >subroutine or a clause of a conditional or loop, you have a matrix
>> >showing which operations need to be completed before which other
>> >operations can start, the essential nature of a PERT chart, right?
>>
>> Only partially.  Trace [*] scheduling (also called Basic Block
>> scheduling or List scheduling) is analogous to PERTing - the problem
>> for the compiler is figuring out what instructions to include in each
>> of the traces.  Remember that the register set is finite and a fetch
>> from memory may take 50..10,000 times longer than fetching from a
>> register.  A fetch of a value not in a register should be placed far
>> enough in advance of the value's use so as not to stall the pipeline.
>> To do this the compiler has to estimate where the value lies in the
>> memory/cache hierarchy.
>>
>> The end result is that what appears to be a simple logical instruction
>> sequence may really end up as many small sequences separated by
>> dozens, or even hundreds, of other instructions which are competing
>> for registers, functional units and memory bandwidth.
>>
>> [*] A trace or basic block is a list of instructions to be executed
>> sequentially, with no internal branches and, at most, one branch or
>> jump at the end.
>>
>>
>>
>> >> Many chip manufacturers no longer document the cycle count for
>> >> instructions because it is variable depending on the sequencing,
>> >> so peephole optimization becomes a cycle of trial and error.
>>
>> >Computers are so fast nowadays that I imagine it's trivial, given
>> >the dependency matrix, to generate all permutations of the
>> >operations that are consistent with that matrix, then for each such
>> >permutation generate the actual sequence of instructions following
>> >that permutation in a block of memory, then for each such block
>> >make a million copies and put a system-clock-read before the first
>> >copy and after the last copy, then execute each such mega-block of
>> >code and collect the clock-time-difference, repeat ten or twenty
>> >times to get a good statistical sample, compare the statistics from
>> >those various permutations, and report which sequences vie for
>> >fastest, and the whole analysis and report generation can be
>> >completed within a few seconds, with no operator intervention
>> >needed from start to finish, right? So from your viewpoint, you
>> >edit the matrix, click the permute-analyze button, wait a few
>> >seconds as the progress bar scrolls from 0% to 100%, and up comes
>> >the report showing you which permutation was selected and how it
>> >compares with the other, and most of the time you don't even look
>> >at the details of the report, you just click on the OK button to
>> >automatically copy that permutation to your peephole optimization
>> >database, and you can check off one more micro-algorithm-pattern
>> >that has been optimized, right?
>>
>> No it's not trivial.  Depending on the instruction set, for any
>> particular high level construct, there may be several templates for
>> implementing it.  When high level constructs are combined, the
>> combinatorial explosion quickly makes trying them all impractical.
>> Add a finite register set and memory fetch sequencing and the number
>> of interacting traces quickly becomes unmanageable.  Add cache
>> pollution from a multitasking OS and the problem becomes NP hard - ie.
>> essentially unsolvable.
>>
>> Scheduling for a modern, multiple issue processor is, in fact, so
>> complicated that compilers usually cannot do an optimal job.  Almost
>> all modern CPUs support out-of-order execution - the CPU scans ahead
>> in the instruction stream looking for things to do to fill latency
>> holes left by the compiler.  The hardware figures out what can be done
>> and when based on the actual state of the pipelines at the time -
>> something the compiler can only estimate and then only for sequential
>> programming - not for multitasking.
>>
>> I suggest you read up on modern chip architectures and get yourself a
>> good book on compilers.
>
>Yet the IA-64 architecture requires the compiler to explicitely
>schedule the instructions, IIRC.

All chips require some scheduling - you can't just throw in a bag of
unordered instructions and expect a program to emerge.

The point is that what the compiler does amounts to a "suggested"
scheduling - the actual final execution order is left up to the chip
and may be different for each run of the program.


>Most of the complication in modern x86 processors was due to the
>binary backwards compatibility constraint: they expose a CISC
>instruction set implemented on a RISC-like internal architecture.

RISC is no less problematic - the MIPS architecture almost failed
initially because it depended too much on the compiler for scheduling
and did not include pipeline interlocks to delay operations when
operands were not ready.  Fortunately the designers realized their
mistake and included interlocking pipelines in subsequent generations
of the chip.

Intel made a similar mistake with the i860.  The i860 was VLIW with 3
pipelines and depended entirely on the compiler to find and pack
together non-conflicting instructions to be issued simultaneously.
The i860 was a failure.  The i960 was an integer-only version which
found a niche in embedded processing - removing the FPU pipeline made
it much easier to program than the i860.

Multiple issue RISCs like the m88K and PPC are quite difficult to
generate high quality code for.  They are somewhat easier than the x86
to generate adequate code for ... but when highly optimized code is
needed, all platforms are a bitch and the differences are only in the
degree.

George
--
for email reply remove "/" from address
From: Vend
Subject: Re: optimation, a black art?
Date: 
Message-ID: <8644c77e-389d-461f-91ec-c36c36f262a7@a1g2000hsb.googlegroups.com>
On 14 Lug, 09:00, George Neuner <·········@/comcast.net> wrote:
> On Sat, 12 Jul 2008 01:06:32 -0700 (PDT), Vend <······@virgilio.it>
> wrote:
>
>
>
> >On 12 Lug, 06:41, George Neuner <·········@/comcast.net> wrote:
> >> On Fri, 11 Jul 2008 10:57:52 -0700,
> >> ··················@spamgourmet.com.remove (Robert Maas,
>
> >>http://tinyurl.com/uh3t) wrote:
> >> >I've never done (at a practical level) any of the kinds of analysis
> >> >you describe below, but I believe I have a basic understanding of
> >> >the logic involved, so I'll offer some guesses:
> >> >> From: George Neuner <·········@/comcast.net>
> >> >> I've done a lot of low level optimization for real time software.
> >> >> Optimization at the assembler level has certainly gotten harder.
> >> >> Most architectures today have multiple pipelines and fast running
> >> >> instructions may be completely subsumed by slower running ones.
>
> >> >So presumably for each block of code, such as the body of a
> >> >subroutine or a clause of a conditional or loop, you have a matrix
> >> >showing which operations need to be completed before which other
> >> >operations can start, the essential nature of a PERT chart, right?
>
> >> Only partially.  Trace [*] scheduling (also called Basic Block
> >> scheduling or List scheduling) is analogous to PERTing - the problem
> >> for the compiler is figuring out what instructions to include in each
> >> of the traces.  Remember that the register set is finite and a fetch
> >> from memory may take 50..10,000 times longer than fetching from a
> >> register.  A fetch of a value not in a register should be placed far
> >> enough in advance of the value's use so as not to stall the pipeline.
> >> To do this the compiler has to estimate where the value lies in the
> >> memory/cache hierarchy.
>
> >> The end result is that what appears to be a simple logical instruction
> >> sequence may really end up as many small sequences separated by
> >> dozens, or even hundreds, of other instructions which are competing
> >> for registers, functional units and memory bandwidth.
>
> >> [*] A trace or basic block is a list of instructions to be executed
> >> sequentially, with no internal branches and, at most, one branch or
> >> jump at the end.
>
> >> >> Many chip manufacturers no longer document the cycle count for
> >> >> instructions because it is variable depending on the sequencing,
> >> >> so peephole optimization becomes a cycle of trial and error.
>
> >> >Computers are so fast nowadays that I imagine it's trivial, given
> >> >the dependency matrix, to generate all permutations of the
> >> >operations that are consistent with that matrix, then for each such
> >> >permutation generate the actual sequence of instructions following
> >> >that permutation in a block of memory, then for each such block
> >> >make a million copies and put a system-clock-read before the first
> >> >copy and after the last copy, then execute each such mega-block of
> >> >code and collect the clock-time-difference, repeat ten or twenty
> >> >times to get a good statistical sample, compare the statistics from
> >> >those various permutations, and report which sequences vie for
> >> >fastest, and the whole analysis and report generation can be
> >> >completed within a few seconds, with no operator intervention
> >> >needed from start to finish, right? So from your viewpoint, you
> >> >edit the matrix, click the permute-analyze button, wait a few
> >> >seconds as the progress bar scrolls from 0% to 100%, and up comes
> >> >the report showing you which permutation was selected and how it
> >> >compares with the other, and most of the time you don't even look
> >> >at the details of the report, you just click on the OK button to
> >> >automatically copy that permutation to your peephole optimization
> >> >database, and you can check off one more micro-algorithm-pattern
> >> >that has been optimized, right?
>
> >> No it's not trivial.  Depending on the instruction set, for any
> >> particular high level construct, there may be several templates for
> >> implementing it.  When high level constructs are combined, the
> >> combinatorial explosion quickly makes trying them all impractical.
> >> Add a finite register set and memory fetch sequencing and the number
> >> of interacting traces quickly becomes unmanageable.  Add cache
> >> pollution from a multitasking OS and the problem becomes NP hard - ie.
> >> essentially unsolvable.
>
> >> Scheduling for a modern, multiple issue processor is, in fact, so
> >> complicated that compilers usually cannot do an optimal job.  Almost
> >> all modern CPUs support out-of-order execution - the CPU scans ahead
> >> in the instruction stream looking for things to do to fill latency
> >> holes left by the compiler.  The hardware figures out what can be done
> >> and when based on the actual state of the pipelines at the time -
> >> something the compiler can only estimate and then only for sequential
> >> programming - not for multitasking.
>
> >> I suggest you read up on modern chip architectures and get yourself a
> >> good book on compilers.
>
> >Yet the IA-64 architecture requires the compiler to explicitely
> >schedule the instructions, IIRC.
>
> All chips require some scheduling - you can't just throw in a bag of
> unordered instructions and expect a program to emerge.

Yes. The point is that most modern superscalar processors expose an
interface of a sequential processor and internally schedule the
sequential instructions to multiple parallel execution units. Most of
the complications come form the need to ensure that the semantics of
the instruction sequence is preserved.

The IA-64 architecture does kind of the contrary: it exposes a fairly
large number of logical parallel execution units, which may be greater
than the number of physical execution units, and lets the compiler
explicit schedule instructions for these logical units and take care
of any dependancy constraints.

> The point is that what the compiler does amounts to a "suggested"
> scheduling - the actual final execution order is left up to the chip
> and may be different for each run of the program.
>
> >Most of the complication in modern x86 processors was due to the
> >binary backwards compatibility constraint: they expose a CISC
> >instruction set implemented on a RISC-like internal architecture.
>
> RISC is no less problematic - the MIPS architecture almost failed
> initially because it depended too much on the compiler for scheduling
> and did not include pipeline interlocks to delay operations when
> operands were not ready.  Fortunately the designers realized their
> mistake and included interlocking pipelines in subsequent generations
> of the chip.

If I remember correctly it still requires NOPs to fill the delays in
jumps, doesn't it?

> Intel made a similar mistake with the i860.  The i860 was VLIW with 3
> pipelines and depended entirely on the compiler to find and pack
> together non-conflicting instructions to be issued simultaneously.
> The i860 was a failure.  The i960 was an integer-only version which
> found a niche in embedded processing - removing the FPU pipeline made
> it much easier to program than the i860.
>
> Multiple issue RISCs like the m88K and PPC are quite difficult to
> generate high quality code for.  They are somewhat easier than the x86
> to generate adequate code for ... but when highly optimized code is
> needed, all platforms are a bitch and the differences are only in the
> degree.

An internal FPGA exposed to the programmer would be the optimal
solution maybe.
From: George Neuner
Subject: Re: optimation, a black art?
Date: 
Message-ID: <ea7o7457v666mecghhd1ndim3jst7qkbuq@4ax.com>
On Mon, 14 Jul 2008 03:20:30 -0700 (PDT), Vend <······@virgilio.it>
wrote:

>On 14 Lug, 09:00, George Neuner <·········@/comcast.net> wrote:
>> On Sat, 12 Jul 2008 01:06:32 -0700 (PDT), Vend <······@virgilio.it>
>> wrote:
>>
>> >Most of the complication in modern x86 processors was due to the
>> >binary backwards compatibility constraint: they expose a CISC
>> >instruction set implemented on a RISC-like internal architecture.
>>
>> RISC is no less problematic - the MIPS architecture almost failed
>> initially because it depended too much on the compiler for scheduling
>> and did not include pipeline interlocks to delay operations when
>> operands were not ready.  Fortunately the designers realized their
>> mistake and included interlocking pipelines in subsequent generations
>> of the chip.
>
>If I remember correctly it still requires NOPs to fill the delays in
>jumps, doesn't it?

AFAIK, no RISC ever _required_ NOPs, but it was left up to the
compiler to find useful instructions that could be executed in the
delay slot.

The real problem was that many early designs executed the delay slot
instructions regardless of whether the branch was taken.  Finding
useful instructions that would be valid in both cases proved to be
very difficult in general.  RISC designer responded by introducing
speculative execution and by allowing compilers to tag delay slot
instructions as being conditional or not.  Conditional instructions
would be executed but their results would be held until the branch
target was resolved and thrown away if the branch went against them.


>> Intel made a similar mistake with the i860.  The i860 was VLIW with 3
>> pipelines and depended entirely on the compiler to find and pack
>> together non-conflicting instructions to be issued simultaneously.
>> The i860 was a failure.  The i960 was an integer-only version which
>> found a niche in embedded processing - removing the FPU pipeline made
>> it much easier to program than the i860.
>>
>> Multiple issue RISCs like the m88K and PPC are quite difficult to
>> generate high quality code for.  They are somewhat easier than the x86
>> to generate adequate code for ... but when highly optimized code is
>> needed, all platforms are a bitch and the differences are only in the
>> degree.
>
>An internal FPGA exposed to the programmer would be the optimal
>solution maybe.

Long term that might be a good suggestion, but IMO in the short term
it would probably make things worse because few programmers can work
effectively in VHDL and current compilers from high level languages
into VHDL, or directly to configuration binaries, are still pretty bad
efficiency wise as compared to a hand design.


Years ago I worked on a compiler for a board level processor that
consisted of a DSP with attached FPGAs.  Knowing that compiling HL
code into VHDL was a losing proposition, we took a different approach.
We provided direct support for parallel array, matrix and convolution
ops (1,2 & 3D) in the compiler and implemented them with a library of
hand optimized configurations.  The programmer hardly needed to be
aware of the FPGAs - the compiler constructed configuration parameter
blocks and scheduled execution.  The FPGAs were interrupt devices so
the DSP program continued to run in parallel (we used the DSP for
control, serial integer code, and some floating point support because
the FPGAs at the time were too small for complicated FP codes).  FPGA
ops could be executed single step with return to the DSP code each
time, or could be chained together for serial execution.  In either
case, the compiler handled almost all the details behind the scenes.
There was only one issue for chained operation that we could not make
transparent and that was where the data buffers were allocated - all
buffers for an operation had to be in separate memory banks for
maximum speed and sometimes the programmer had to intervene to achieve
this.  We probably could have solved that, but we never got around to
it.

The project was canceled when MMX Pentiums became fast enough to do
real time vision applications with OTS hardware (around 2001).  Our
DSP/FPGA board was 10..50 times faster than Pentiums of that era, but
most of our applications didn't need such blinding speed and the
customers would not pay premium for custom hardware that would easily
outpace their requirements.  So the whole thing died.

AFAIK, only one other project has taken a similar approach - a
university effort from Stanford.  But IMNSHO, our system was better -
not to mention 10 years earlier and commercial.

George
[btw: I'm permitted to discuss general aspects of the system software
and compiler, but many particulars remain confidential and I have no
code available.]

--
for email reply remove "/" from address
From: Oisín Mac Fhearaí
Subject: Re: optimation, a black art?
Date: 
Message-ID: <1be8b8f8-5229-4abe-965e-b57f7329648b@b1g2000hsg.googlegroups.com>
On Jul 12, 5:41 am, George Neuner <·········@/comcast.net> wrote:
> No it's not trivial.  Depending on the instruction set, for any
> particular high level construct, there may be several templates for
> implementing it.  When high level constructs are combined, the
> combinatorial explosion quickly makes trying them all impractical.
> Add a finite register set and memory fetch sequencing and the number
> of interacting traces quickly becomes unmanageable.  Add cache
> pollution from a multitasking OS and the problem becomes NP hard - ie.
> essentially unsolvable.

You're forgetting that we often need to solve NP-hard problems (i.e.
n>2-satisfiability checking, travelling salesman, graph colouring).
Since compilers already use heuristics (an indispensable and often
overlooked tool) in their optimisations, there's no reason I can see
not to use further heuristics and, better yet, heuristic search
methods to tackle the exploding problems you mention.

Certainly, this is a kind of non-Markov problem, where a good portion
of the information is not available to the compiler like the RAM on
the target machine and of course all of the runtime information (how
much physical RAM is available? How many processes are waiting for CPU
on average right now?) - this is one of the areas where an optimising
virtual machine like the JVM has an advantage.

> Scheduling for a modern, multiple issue processor is, in fact, so
> complicated that compilers usually cannot do an optimal job.  Almost
> all modern CPUs support out-of-order execution - the CPU scans ahead
> in the instruction stream looking for things to do to fill latency
> holes left by the compiler.  The hardware figures out what can be done
> and when based on the actual state of the pipelines at the time -
> something the compiler can only estimate and then only for sequential
> programming - not for multitasking.

Yes, but the compiler doesn't need to do an optimal job. It just needs
to do a better job (if possible) than the code it started with.

Oisín
From: Vend
Subject: Re: optimation, a black art?
Date: 
Message-ID: <ffef3379-e401-4151-a762-a7ba7221f2c1@26g2000hsk.googlegroups.com>
On 12 Lug, 15:37, Oisín Mac Fhearaí <···········@gmail.com> wrote:
> On Jul 12, 5:41 am, George Neuner <·········@/comcast.net> wrote:
>
> > No it's not trivial.  Depending on the instruction set, for any
> > particular high level construct, there may be several templates for
> > implementing it.  When high level constructs are combined, the
> > combinatorial explosion quickly makes trying them all impractical.
> > Add a finite register set and memory fetch sequencing and the number
> > of interacting traces quickly becomes unmanageable.  Add cache
> > pollution from a multitasking OS and the problem becomes NP hard - ie.
> > essentially unsolvable.
>
> You're forgetting that we often need to solve NP-hard problems (i.e.
> n>2-satisfiability checking, travelling salesman, graph colouring).
> Since compilers already use heuristics (an indispensable and often
> overlooked tool) in their optimisations, there's no reason I can see
> not to use further heuristics and, better yet, heuristic search
> methods to tackle the exploding problems you mention.
>
> Certainly, this is a kind of non-Markov problem, where a good portion
> of the information is not available to the compiler like the RAM on
> the target machine and of course all of the runtime information (how
> much physical RAM is available? How many processes are waiting for CPU
> on average right now?) - this is one of the areas where an optimising
> virtual machine like the JVM has an advantage.

It's not just a matter of the machine details. Some information about
the program is just not available at compile-time while it's available
at run-time.

Consider conditional jump instruction for instance: in most
architectures, taking the jump takes more cycles than not taking it,
hence compilers should try to put the most frequently executed code in
the faster branch of the conditional. But which is the most frequently
executed branch of a conditional? In most cases it's impossible to
tell at compile-time.

Sometimes it's possible to use heuristics: in a conditional used to
control a loop construct is usually safe to assume that the branch
that goes inside the loop will be taken more frequently than the one
which exits the loop.

In other cases it's hard to tell: the most frequent branch might
depend on external input or on the result of potentially non-
terminating computations.
Programmers usually have an idea about the frequency of different
branches but as far as I know no language allows such clues to be
included. Some programmers list exceptional cases at the beginning of
a conditional statement while others list them at the end.

A virtual machine with a profiling interpeter and a JIT compiler
instead can benefit take statistics of branch frequency during
interpretation and then use them to optimize the code during
compilation.

<snip>
From: Oisín Mac Fhearaí
Subject: Re: optimation, a black art?
Date: 
Message-ID: <2a366b74-0335-4b4e-ac85-f21be836eef7@27g2000hsf.googlegroups.com>
On Jul 13, 12:38 am, Vend <······@virgilio.it> wrote:
> It's not just a matter of the machine details. Some information about
> the program is just not available at compile-time while it's available
> at run-time.

Indeed, I mentioned this in the post you quoted.

> Programmers usually have an idea about the frequency of different
> branches but as far as I know no language allows such clues to be
> included. Some programmers list exceptional cases at the beginning of
> a conditional statement while others list them at the end.

Some instruction sets (ARM?) include "branch-probably" style
instructions, but as you say, higher-level languages don't provide for
such hints.

However, some optimising compilers (I've seen some assembly output
from Intel C/C++ demonstrating it) actually will make guesses about
how many times a loop will be executed and will order branches
accordingly - for such architectures they could emit branch-probably
mnemonics as well.
For a good time, check out [0] for a description of some static branch
prediction heuristics for compilers.

Oisín

[0] http://www.crhc.uiuc.edu/IMPACT/ftp/conference/pact-98-branchpred.ps
From: George Neuner
Subject: Re: optimation, a black art?
Date: 
Message-ID: <jggn7418v0v95ahd4340h93tu74abqb7v2@4ax.com>
On Sat, 12 Jul 2008 06:37:03 -0700 (PDT), Ois�n Mac Fheara�
<···········@gmail.com> wrote:

>On Jul 12, 5:41�am, George Neuner <·········@/comcast.net> wrote:
>> No it's not trivial. �Depending on the instruction set, for any
>> particular high level construct, there may be several templates for
>> implementing it. �When high level constructs are combined, the
>> combinatorial explosion quickly makes trying them all impractical.
>> Add a finite register set and memory fetch sequencing and the number
>> of interacting traces quickly becomes unmanageable. �Add cache
>> pollution from a multitasking OS and the problem becomes NP hard - ie.
>> essentially unsolvable.
>
>You're forgetting that we often need to solve NP-hard problems (i.e.
>n>2-satisfiability checking, travelling salesman, graph colouring).
>Since compilers already use heuristics (an indispensable and often
>overlooked tool) in their optimisations, there's no reason I can see
>not to use further heuristics and, better yet, heuristic search
>methods to tackle the exploding problems you mention.

I'm not forgetting anything.  Yes, NP problems are routinely solved -
but with the realization that the solution is not likely to be
optimal.  Most solutions to very hard problems are only meant to be
"workable", not optimal.

The poster I was responding to seemed to think it is easy enough for a
compiler to try all instruction sequence permutations and select the
best one.  That is only true for short sequences and is not the
general case.  Beyond that, the compiler can only estimate the
resource usage of a hypothetical CPU - not a real one under load.


>Certainly, this is a kind of non-Markov problem, where a good portion
>of the information is not available to the compiler like the RAM on
>the target machine and of course all of the runtime information (how
>much physical RAM is available? How many processes are waiting for CPU
>on average right now?) - this is one of the areas where an optimising
>virtual machine like the JVM has an advantage.
>
>> Scheduling for a modern, multiple issue processor is, in fact, so
>> complicated that compilers usually cannot do an optimal job. �Almost
>> all modern CPUs support out-of-order execution - the CPU scans ahead
>> in the instruction stream looking for things to do to fill latency
>> holes left by the compiler. �The hardware figures out what can be done
>> and when based on the actual state of the pipelines at the time -
>> something the compiler can only estimate and then only for sequential
>> programming - not for multitasking.
>
>Yes, but the compiler doesn't need to do an optimal job. It just needs
>to do a better job (if possible) than the code it started with.

Exactly.  That is also the point of OoO execution - the compiler only
has to do a proper ordering of instructions rather than an optimal
one.

George
--
for email reply remove "/" from address
From: Oisín Mac Fhearaí
Subject: Re: optimation, a black art?
Date: 
Message-ID: <1260439d-97fb-4d8e-b44e-137cf90a68ca@j22g2000hsf.googlegroups.com>
On Jul 11, 6:57 pm, ··················@spamgourmet.com.remove (Robert
Maas, http://tinyurl.com/uh3t) wrote:
> > From: George Neuner <·········@/comcast.net>
> > I've done a lot of low level optimization for real time software.
> > Optimization at the assembler level has certainly gotten harder.
> > Most architectures today have multiple pipelines and fast running
> > instructions may be completely subsumed by slower running ones.
>
> So presumably for each block of code, such as the body of a
> subroutine or a clause of a conditional or loop, you have a matrix
> showing which operations need to be completed before which other
> operations can start, the essential nature of a PERT chart, right?
>
> > Many chip manufacturers no longer document the cycle count for
> > instructions because it is variable depending on the sequencing,
> > so peephole optimization becomes a cycle of trial and error.
>
> Computers are so fast nowadays that I imagine it's trivial, given
> the dependency matrix, to generate all permutations of the
> operations that are consistent with that matrix, then for each such
> permutation generate the actual sequence of instructions following
> that permutation in a block of memory, then for each such block
> make a million copies and put a system-clock-read before the first
> copy and after the last copy, then execute each such mega-block of
> code and collect the clock-time-difference, repeat ten or twenty
> times to get a good statistical sample, compare the statistics from
> those various permutations, and report which sequences vie for
> fastest, and the whole analysis and report generation can be
> completed within a few seconds, with no operator intervention
> needed from start to finish, right? So from your viewpoint, you
> edit the matrix, click the permute-analyze button, wait a few
> seconds as the progress bar scrolls from 0% to 100%, and up comes
> the report showing you which permutation was selected and how it
> compares with the other, and most of the time you don't even look
> at the details of the report, you just click on the OK button to
> automatically copy that permutation to your peephole optimization
> database, and you can check off one more micro-algorithm-pattern
> that has been optimized, right?

The better solution is to determine the actual behaviour of the target
system (it _will_ be documented somewhere, somehow, or maybe an
understanding can be reverse-engineered in the bizarre case that no
good information is available), and come up with a simulator that can
take some of your algorithms/bottlenecks and step through them,
visually displaying which instructions cause pipeline stalls, how and
where, and other hazards.

I had an idea a couple of years ago to try something along the lines
of what you suggest via genetic programming, with possible solutions
competing to find a relatively-minimum cycle count. You'd need a
couple of things though: the simulator which can run the subprograms
and return the cycle count, and a generator/mutator which can
construct valid programs which are provably semantically equivalent to
the hand-written original... (perhaps a series of small-scale
transformations, like those in peephole optimisation) no mean feat?
Anyway, it's the kind of over-the-top ridiculous idea that I decide
I'd love to do, but never really try because it's hard and I get
daunted :D

Oisín