Type checking using the FPU

From: Peter Ammon
Subject: Type checking using the FPU
Date: Sun, 07 Dec 2003 20:45:28 +0000
Message-ID: <IjMAb.66890$se1.41225@newssvr25.news.prodigy.com>

I've heard of a method for fast dynamic type checking that uses the FPU. 
   This approach seems pretty novel and interesting to me. Can anyone 
give a link or reference to provide more information?  Thanks.

-Peter

From: Rayiner Hashem
Subject: Re: Type checking using the FPU
Date: Tue, 09 Dec 2003 00:54:53 +0000
Message-ID: <a3995c0d.0312081654.e7ebd24@posting.google.com>

That wouldn't be a great idea. The FPU isn't a general purpose unit.
In general, the FPU generally has higher latencies than the integer
unit (on the P4, its 1/2 cycle for simple integer instructions vs > 1
cycle for simple FPU instructions). It also tends to have a longer
pipeline, which also increases latency. Most CPUs even have a higher
cache latency for integer vs floating point loads. On the P4, an
integer load from the L1 cache takes 2 clock cycles, while a
floating-point load takes 6! All of this is because the FPU is
designed for streaming code. It performs best when you're performing
simple operations on a large amount of data, with few branches. In
contrast, type checking or generic dispatch is classic integer code.
Its a bunch of simple integer operations accessing data that is spread
out all over memory and containing a large percentage of branches. The
integer units of current CPUs already perform poorly at this task*,
and the FPU unit is even worse at it.

Besides, you've probably already got enough parallelism. Most current
CPUs have from 3-4 integer units**. Most code has trouble keeping just
3 units busy. It is highly likely that any typechecking code will find
a free unit, especially since typechecking depends only on simple
integer operations, and can likely be executed speculatively.

*: By my benchmarks, a simple mono-dispatch using a switch statement
takes over two-dozen clock cycles on a P4.
**: The P4 only has two IUs, but they're double-clocked, so they work
as fast as 4.