Hi Cameron,
Sorry to bother you, but I wonder if I could ask you a
question. I recently posted on comp.lang.lisp and you
responded.
Here's the excerpt:
*******************
Subject: Re: Another newbie question:
if-elseif-elseif-elseif..etc.
From: Cameron MacKinnon <··········@clearspot.net>
Newsgroups: comp.lang.lisp
David Buchan wrote:
> In the C version of my program I have:
>
> for (i=0; i<nchar; i=i+1) {
> if (tmp[i]>=0 && tmp[i]<=31) {
> dat[i]=tmp[i]+128;}
> else if (tmp[i]>=32 && tmp[i]<=63) {
> dat[i]=tmp[i];}
> else if (tmp[i]>=64 && tmp[i]<=95) {
> dat[i]=tmp[i]-64;}
> else if (tmp[i]>=128 && tmp[i]<=159) {
> dat[i]=tmp[i]+64;}
> else if (tmp[i]>=160 && tmp[i]<=191) {
> dat[i]=tmp[i]-64;}
> else if (tmp[i]>=192 && tmp[i]<=223) {
> dat[i]=tmp[i]-128;
> }
> }
>
> I'd like to achieve the same thing in Common Lisp
> (Clisp).
If by "the same thing" you mean a slow program, others
here have pointed the way.
However, I'd suggest using essentially the same code
structure to populate an array of values to add to the character.
Then, while reading characters from the file, use the value of each
character as an offset into the array, fetch the array element and add
it to
the character.
You see, on modern x86 processors (especially the
Pentium 4), mispredicting a branch costs a lot, and your loop is
likely to mispredict at least once per iteration.
*******************
I understand exactly what you are suggesting with
using an array to accomplish the task above, but I was
especially intrigued by your comment about speed.
Specifically, I didn't understand what you meant by,
"You see, on modern x86 processors (especially the
Pentium 4), mispredicting a branch costs a lot, and your loop is
likely to mispredict at least once per iteration."
When branching occurs due to a test (an if statement),
does it try test values until it gets a true result or
something? What is actually going on in the gubbins of
the processor?
Thanks,
Dave Buchan
David Buchan wrote:
> Hi Cameron,
>
> Sorry to bother you, but I wonder if I could ask you a
> question. I recently posted on comp.lang.lisp and you
> responded.
>
> Here's the excerpt:
> *******************
> Subject: Re: Another newbie question:
> if-elseif-elseif-elseif..etc.
> From: Cameron MacKinnon <··········@clearspot.net>
> Newsgroups: comp.lang.lisp
>
> David Buchan wrote:
>
>>In the C version of my program I have:
>>
>>for (i=0; i<nchar; i=i+1) {
>> if (tmp[i]>=0 && tmp[i]<=31) {
>> dat[i]=tmp[i]+128;}
>> else if (tmp[i]>=32 && tmp[i]<=63) {
>> dat[i]=tmp[i];}
>> else if (tmp[i]>=64 && tmp[i]<=95) {
>> dat[i]=tmp[i]-64;}
>> else if (tmp[i]>=128 && tmp[i]<=159) {
>> dat[i]=tmp[i]+64;}
>> else if (tmp[i]>=160 && tmp[i]<=191) {
>> dat[i]=tmp[i]-64;}
>> else if (tmp[i]>=192 && tmp[i]<=223) {
>> dat[i]=tmp[i]-128;
>> }
>> }
>>
>>I'd like to achieve the same thing in Common Lisp
>>(Clisp).
>
>
> If by "the same thing" you mean a slow program, others
> here have pointed the way.
>
> However, I'd suggest using essentially the same code
> structure to populate an array of values to add to the character.
> Then, while reading characters from the file, use the value of each
> character as an offset into the array, fetch the array element and add
> it to
> the character.
>
> You see, on modern x86 processors (especially the
> Pentium 4), mispredicting a branch costs a lot, and your loop is
> likely to mispredict at least once per iteration.
> *******************
>
> I understand exactly what you are suggesting with
> using an array to accomplish the task above, but I was
> especially intrigued by your comment about speed.
>
> Specifically, I didn't understand what you meant by,
> "You see, on modern x86 processors (especially the
> Pentium 4), mispredicting a branch costs a lot, and your loop is
> likely to mispredict at least once per iteration."
After I posted, it occurred to me that what I meant to say was
"...likely to mispredict at least once ON AVERAGE per iteration."
> When branching occurs due to a test (an if statement),
> does it try test values until it gets a true result or
> something? What is actually going on in the gubbins of
> the processor?
Most branches are binary, so if the CPU discovers that its prediction
was wrong, it knows that the correct target must have been the other one.
When microprocessors were first becoming popular (around the 1970s),
each instruction would be executed completely before the next one was
started (or even fetched from memory). As CPUs became faster, newer
designs began overlapping operations -- as instruction one was being
executed, instruction two was being fetched from memory. Call this a two
stage pipeline.
Let's look at these instructions:
| add $1, %eax ; add one to the eax register
| je ovflow ; overflow? branch to overflow routine
| sub $2, %ebx ; subtract two from ebx register
As the add instruction is being executed, the jump instruction is being
fetched. As the conditional jump is being executed, the subtract
instruction is being fetched (old CPUs always assumed that the branch
would not be taken). But what if overflow occurs, and the jump
instruction is taken? The CPU will have fetched the wrong instruction,
and will have to spend a cycle (or more, in the case of a cache miss)
doing nothing while the correct instruction is fetched from memory.
To solve this problem, enter branch prediction logic: Transistors in the
CPU are dedicated to some basic heuristics about which branches are
likely to be taken, and to keeping track of which branches have been
taken in the past, all to avoid that wasted cycle due to what is now
called a "mispredicted" branch. Intel introduced this with the Pentium,
in 1993. Compiler writers structured the code that compilers produced so
that it accorded with the branch predictor's heuristics.
But CPUs continued to get faster, and microprocessor designers began
making longer and longer pipelines in their efforts to increase clock
speed. Longer pipelines mean that a branch misprediction results in more
useless work being discarded.
As the internals of x86 CPUs have become more and more baroque, their
documentation has become less illuminating, to avoid giving away what
are now considered trade secrets. But it is my understanding that the
Pentium 4 has a 22 stage pipeline.
Think of an automobile assembly line: If a car takes one hour to build
and a finished car comes off the line every minute, then there are sixty
cars in progress at any given time. If we discover a flaw in the
process, we discard those sixty cars and start over, and we're going to
have to wait an hour, not a minute, for the next car to emerge. That's
branch misprediction.
I've glossed over bits and pieces, and simplified here and there, but
that's the gist of it. Unpredictable conditional branches are the kiss
of death for performance on modern CPUs. There's no real reason why a
Sufficiently Smart Compiler couldn't convert your original code to a
table lookup, possibly with a range check, but I don't think current C
compilers do this. Modern CPU manufacturers, Intel in particular, have
designed CPUs which are very difficult to program efficiently by hand,
requiring quite smart compilers, often beyond today's state of the art.
I'm under the impression that that is one of the things which doomed
Intel's recent effort (which was spelled 'Itanium' but pronounced
'Itanic'). It sucks that we have to be aware of these issues because our
compilers aren't smart enough, but that's the lot of a coder.
For more from Intel on the subject, try this:
http://www.google.com/search?q=site%3Aintel.com+%22branch+prediction%22
In article <······················@golden.net>, ··········@clearspot.net
says...
> David Buchan wrote:
> > Hi Cameron,
> >
> > Sorry to bother you, but I wonder if I could ask you a
> > question. I recently posted on comp.lang.lisp and you
> > responded.
> >
> > Here's the excerpt:
> > *******************
> > Subject: Re: Another newbie question:
> > if-elseif-elseif-elseif..etc.
> > From: Cameron MacKinnon <··········@clearspot.net>
> > Newsgroups: comp.lang.lisp
> >
> > David Buchan wrote:
> >
> >>In the C version of my program I have:
> >>
> >>for (i=0; i<nchar; i=i+1) {
> >> if (tmp[i]>=0 && tmp[i]<=31) {
> >> dat[i]=tmp[i]+128;}
> >> else if (tmp[i]>=32 && tmp[i]<=63) {
> >> dat[i]=tmp[i];}
> >> else if (tmp[i]>=64 && tmp[i]<=95) {
> >> dat[i]=tmp[i]-64;}
> >> else if (tmp[i]>=128 && tmp[i]<=159) {
> >> dat[i]=tmp[i]+64;}
> >> else if (tmp[i]>=160 && tmp[i]<=191) {
> >> dat[i]=tmp[i]-64;}
> >> else if (tmp[i]>=192 && tmp[i]<=223) {
> >> dat[i]=tmp[i]-128;
> >> }
> >> }
> >>
> >>I'd like to achieve the same thing in Common Lisp
> >>(Clisp).
> >
> >
> > If by "the same thing" you mean a slow program, others
> > here have pointed the way.
> >
> > However, I'd suggest using essentially the same code
> > structure to populate an array of values to add to the character.
> > Then, while reading characters from the file, use the value of each
> > character as an offset into the array, fetch the array element and add
> > it to
> > the character.
> >
> > You see, on modern x86 processors (especially the
> > Pentium 4), mispredicting a branch costs a lot, and your loop is
That was superb.
Thank you.
Dave
From: Alexander Kjeldaas
Subject: Re: Question for Cameron MacKinnon if he's around
Date:
Message-ID: <41e508f8@news.broadpark.no>
Cameron MacKinnon wrote:
>
> To solve this problem, enter branch prediction logic: Transistors in the
> CPU are dedicated to some basic heuristics about which branches are
> likely to be taken, and to keeping track of which branches have been
> taken in the past, all to avoid that wasted cycle due to what is now
> called a "mispredicted" branch. Intel introduced this with the Pentium,
> in 1993. Compiler writers structured the code that compilers produced so
> that it accorded with the branch predictor's heuristics.
You also have the situation where the CPU detects that a branch
mispredicts a lot. In that case, Pentium Pro and later processors will,
instead of choosing one branch over the other, execute _both_ branches
at the same time, thus avoiding the pipeline stall. The throughput will
be lower since some instructions will be executed and later discarded,
but often the CPU has execution units available to lessen the overhead.
What I find interesting is that Intel and other introduced conditional
move instructions, cmovxx, so that the programmer could avoid pipeline
stalls. But on newer processors, these conditional move instructions
have been (temporarily?) obsoleted. It is better to do a small jump
because the conditinal move instruction introduces data dependencies
between registers that the branching version of the same code does not
have.
So the CPUs have tricks to avoid meltdown on unpredictable conditional
branches as well.
astor