Question for Cameron MacKinnon if he's around

From: David Buchan
Subject: Question for Cameron MacKinnon if he's around
Date: Wed, 12 Jan 2005 01:23:11 +0000
Message-ID: <MPG.1c4e460814abf06a9896a5@news.ca.inter.net>

Hi Cameron,

Sorry to bother you, but I wonder if I could ask you a
question. I recently posted on comp.lang.lisp and you
responded.

Here's the excerpt:
*******************
Subject: Re: Another newbie question:
if-elseif-elseif-elseif..etc.
From: Cameron MacKinnon <··········@clearspot.net>
Newsgroups: comp.lang.lisp

David Buchan wrote:
> In the C version of my program I have:
> 
> for (i=0; i<nchar; i=i+1) {
>     if (tmp[i]>=0 && tmp[i]<=31) {
>       dat[i]=tmp[i]+128;}
>     else if (tmp[i]>=32 && tmp[i]<=63) {
>       dat[i]=tmp[i];}
>     else if (tmp[i]>=64 && tmp[i]<=95) {
>       dat[i]=tmp[i]-64;}
>     else if (tmp[i]>=128 && tmp[i]<=159) {
>       dat[i]=tmp[i]+64;}
>     else if (tmp[i]>=160 && tmp[i]<=191) {
>       dat[i]=tmp[i]-64;}
>     else if (tmp[i]>=192 && tmp[i]<=223) {
>       dat[i]=tmp[i]-128;
>     }
>   }
> 
> I'd like to achieve the same thing in Common Lisp
> (Clisp).

If by "the same thing" you mean a slow program, others
here have pointed the way.

However, I'd suggest using essentially the same code
structure to populate an array of values to add to the character.
Then, while reading characters from the file, use the value of each
character as an offset into the array, fetch the array element and add 
it to
the character.

You see, on modern x86 processors (especially the
Pentium 4), mispredicting a branch costs a lot, and your loop is
likely to mispredict at least once per iteration.
*******************

I understand exactly what you are suggesting with
using an array to accomplish the task above, but I was
especially intrigued by your comment about speed.

Specifically, I didn't understand what you meant by,
"You see, on modern x86 processors (especially the
Pentium 4), mispredicting a branch costs a lot, and your loop is
likely to mispredict at least once per iteration."

When branching occurs due to a test (an if statement),
does it try test values until it gets a true result or
something? What is actually going on in the gubbins of
the processor?

Thanks,
Dave Buchan

Re: Question for Cameron MacKinnon if he's around Cameron MacKinnon
- Re: Question for Cameron MacKinnon if he's around David Buchan
- Re: Question for Cameron MacKinnon if he's around Alexander Kjeldaas

From: Cameron MacKinnon
Subject: Re: Question for Cameron MacKinnon if he's around
Date: Wed, 12 Jan 2005 05:17:57 +0000
Message-ID: <I4CdnR7loJ6YLnncRVn-2Q@golden.net>

David Buchan wrote:
> Hi Cameron,
> 
> Sorry to bother you, but I wonder if I could ask you a
> question. I recently posted on comp.lang.lisp and you
> responded.
> 
> Here's the excerpt:
> *******************
> Subject: Re: Another newbie question:
> if-elseif-elseif-elseif..etc.
> From: Cameron MacKinnon <··········@clearspot.net>
> Newsgroups: comp.lang.lisp
> 
> David Buchan wrote:
> 
>>In the C version of my program I have:
>>
>>for (i=0; i<nchar; i=i+1) {
>>    if (tmp[i]>=0 && tmp[i]<=31) {
>>      dat[i]=tmp[i]+128;}
>>    else if (tmp[i]>=32 && tmp[i]<=63) {
>>      dat[i]=tmp[i];}
>>    else if (tmp[i]>=64 && tmp[i]<=95) {
>>      dat[i]=tmp[i]-64;}
>>    else if (tmp[i]>=128 && tmp[i]<=159) {
>>      dat[i]=tmp[i]+64;}
>>    else if (tmp[i]>=160 && tmp[i]<=191) {
>>      dat[i]=tmp[i]-64;}
>>    else if (tmp[i]>=192 && tmp[i]<=223) {
>>      dat[i]=tmp[i]-128;
>>    }
>>  }
>>
>>I'd like to achieve the same thing in Common Lisp
>>(Clisp).
> 
> 
> If by "the same thing" you mean a slow program, others
> here have pointed the way.
> 
> However, I'd suggest using essentially the same code
> structure to populate an array of values to add to the character.
> Then, while reading characters from the file, use the value of each
> character as an offset into the array, fetch the array element and add 
> it to
> the character.
> 
> You see, on modern x86 processors (especially the
> Pentium 4), mispredicting a branch costs a lot, and your loop is
> likely to mispredict at least once per iteration.
> *******************
> 
> I understand exactly what you are suggesting with
> using an array to accomplish the task above, but I was
> especially intrigued by your comment about speed.
> 
> Specifically, I didn't understand what you meant by,
> "You see, on modern x86 processors (especially the
> Pentium 4), mispredicting a branch costs a lot, and your loop is
> likely to mispredict at least once per iteration."

After I posted, it occurred to me that what I meant to say was 
"...likely to mispredict at least once ON AVERAGE per iteration."

> When branching occurs due to a test (an if statement),
> does it try test values until it gets a true result or
> something? What is actually going on in the gubbins of
> the processor?

Most branches are binary, so if the CPU discovers that its prediction 
was wrong, it knows that the correct target must have been the other one.

When microprocessors were first becoming popular (around the 1970s), 
each instruction would be executed completely before the next one was 
started (or even fetched from memory). As CPUs became faster, newer 
designs began overlapping operations -- as instruction one was being 
executed, instruction two was being fetched from memory. Call this a two 
stage pipeline.

Let's look at these instructions:
|	add	$1, %eax	; add one to the eax register
|	je	ovflow		; overflow? branch to overflow routine
|	sub	$2, %ebx	; subtract two from ebx register

As the add instruction is being executed, the jump instruction is being 
fetched. As the conditional jump is being executed, the subtract 
instruction is being fetched (old CPUs always assumed that the branch 
would not be taken). But what if overflow occurs, and the jump 
instruction is taken? The CPU will have fetched the wrong instruction, 
and will have to spend a cycle (or more, in the case of a cache miss) 
doing nothing while the correct instruction is fetched from memory.

To solve this problem, enter branch prediction logic: Transistors in the 
CPU are dedicated to some basic heuristics about which branches are 
likely to be taken, and to keeping track of which branches have been 
taken in the past, all to avoid that wasted cycle due to what is now 
called a "mispredicted" branch. Intel introduced this with the Pentium, 
in 1993. Compiler writers structured the code that compilers produced so 
that it accorded with the branch predictor's heuristics.

But CPUs continued to get faster, and microprocessor designers began 
making longer and longer pipelines in their efforts to increase clock 
speed. Longer pipelines mean that a branch misprediction results in more 
useless work being discarded.

As the internals of x86 CPUs have become more and more baroque, their 
documentation has become less illuminating, to avoid giving away what 
are now considered trade secrets. But it is my understanding that the 
Pentium 4 has a 22 stage pipeline.

Think of an automobile assembly line: If a car takes one hour to build 
and a finished car comes off the line every minute, then there are sixty 
cars in progress at any given time. If we discover a flaw in the 
process, we discard those sixty cars and start over, and we're going to 
have to wait an hour, not a minute, for the next car to emerge. That's 
branch misprediction.

I've glossed over bits and pieces, and simplified here and there, but 
that's the gist of it. Unpredictable conditional branches are the kiss 
of death for performance on modern CPUs. There's no real reason why a 
Sufficiently Smart Compiler couldn't convert your original code to a 
table lookup, possibly with a range check, but I don't think current C 
compilers do this. Modern CPU manufacturers, Intel in particular, have 
designed CPUs which are very difficult to program efficiently by hand, 
requiring quite smart compilers, often beyond today's state of the art. 
I'm under the impression that that is one of the things which doomed 
Intel's recent effort (which was spelled 'Itanium' but pronounced 
'Itanic'). It sucks that we have to be aware of these issues because our 
compilers aren't smart enough, but that's the lot of a coder.

For more from Intel on the subject, try this:

http://www.google.com/search?q=site%3Aintel.com+%22branch+prediction%22

From: David Buchan
Subject: Re: Question for Cameron MacKinnon if he's around
Date: Wed, 12 Jan 2005 09:25:01 +0000
Message-ID: <MPG.1c4eb6f855e6141b9896a6@news.ca.inter.net>

In article <······················@golden.net>, ··········@clearspot.net 
says...
> David Buchan wrote:
> > Hi Cameron,
> > 
> > Sorry to bother you, but I wonder if I could ask you a
> > question. I recently posted on comp.lang.lisp and you
> > responded.
> > 
> > Here's the excerpt:
> > *******************
> > Subject: Re: Another newbie question:
> > if-elseif-elseif-elseif..etc.
> > From: Cameron MacKinnon <··········@clearspot.net>
> > Newsgroups: comp.lang.lisp
> > 
> > David Buchan wrote:
> > 
> >>In the C version of my program I have:
> >>
> >>for (i=0; i<nchar; i=i+1) {
> >>    if (tmp[i]>=0 && tmp[i]<=31) {
> >>      dat[i]=tmp[i]+128;}
> >>    else if (tmp[i]>=32 && tmp[i]<=63) {
> >>      dat[i]=tmp[i];}
> >>    else if (tmp[i]>=64 && tmp[i]<=95) {
> >>      dat[i]=tmp[i]-64;}
> >>    else if (tmp[i]>=128 && tmp[i]<=159) {
> >>      dat[i]=tmp[i]+64;}
> >>    else if (tmp[i]>=160 && tmp[i]<=191) {
> >>      dat[i]=tmp[i]-64;}
> >>    else if (tmp[i]>=192 && tmp[i]<=223) {
> >>      dat[i]=tmp[i]-128;
> >>    }
> >>  }
> >>
> >>I'd like to achieve the same thing in Common Lisp
> >>(Clisp).
> > 
> > 
> > If by "the same thing" you mean a slow program, others
> > here have pointed the way.
> > 
> > However, I'd suggest using essentially the same code
> > structure to populate an array of values to add to the character.
> > Then, while reading characters from the file, use the value of each
> > character as an offset into the array, fetch the array element and add 
> > it to
> > the character.
> > 
> > You see, on modern x86 processors (especially the
> > Pentium 4), mispredicting a branch costs a lot, and your loop is

That was superb.

Thank you.

Dave

From: Alexander Kjeldaas
Subject: Re: Question for Cameron MacKinnon if he's around
Date: Wed, 12 Jan 2005 11:26:40 +0000
Message-ID: <41e508f8@news.broadpark.no>

Cameron MacKinnon wrote:
> 
> To solve this problem, enter branch prediction logic: Transistors in the 
> CPU are dedicated to some basic heuristics about which branches are 
> likely to be taken, and to keeping track of which branches have been 
> taken in the past, all to avoid that wasted cycle due to what is now 
> called a "mispredicted" branch. Intel introduced this with the Pentium, 
> in 1993. Compiler writers structured the code that compilers produced so 
> that it accorded with the branch predictor's heuristics.

You also have the situation where the CPU detects that a branch 
mispredicts a lot.  In that case, Pentium Pro and later processors will, 
instead of choosing one branch over the other, execute _both_ branches 
at the same time, thus avoiding the pipeline stall. The throughput will 
be lower since some instructions will be executed and later discarded, 
but often the CPU has execution units available to lessen the overhead.

What I find interesting is that Intel and other introduced conditional 
move instructions, cmovxx, so that the programmer could avoid pipeline 
stalls.  But on newer processors, these conditional move instructions 
have been (temporarily?) obsoleted.  It is better to do a small jump 
because the conditinal move instruction introduces data dependencies 
between registers that the branching version of the same code does not 
have.

So the CPUs have tricks to avoid meltdown on unpredictable conditional 
branches as well.

astor