With some more tweaking, LLVM comes within ~10% of GCC. Still, there's
a difference. Here's a test case, which does a 128-bit x 128-bit to
256-bit multiply.
define private i192 @mac(i192 %accum, i64 %x, i64 %y) alwaysinline {
entry:
%xx = sext i64 %x to i128
%yy = sext i64 %y to i128
%zz = mul i128 %xx, %yy
%zzz = zext i128 %zz to i192
%out = add i192 %accum, %zzz
ret i192 %out
}
define [4 x i64] @fullmul([2 x i64] %x, [2 x i64] %y) {
entry:
%xlo = extractvalue [2 x i64] %x, 0
%xhi = extractvalue [2 x i64] %x, 1
%ylo = extractvalue [2 x i64] %y, 0
%yhi = extractvalue [2 x i64] %y, 1
%m0 = call i192 @mac(i192 0, i64 %xlo, i64 %ylo)
%o0 = trunc i192 %m0 to i64
%out0 = insertvalue [4 x i64] undef, i64 %o0, 0
%m1 = lshr i192 %m0, 64
%m10 = call i192 @mac(i192 %m1, i64 %xhi, i64 %ylo)
%m11 = call i192 @mac(i192 %m10, i64 %xlo, i64 %yhi)
%o1 = trunc i192 %m11 to i64
%out1 = insertvalue [4 x i64] %out0, i64 %o1, 1
%m2 = lshr i192 %m11, 64
%m20 = call i192 @mac(i192 %m2, i64 %xhi, i64 %yhi)
%o2 = trunc i192 %m20 to i64
%out2 = insertvalue [4 x i64] %out1, i64 %o2, 2
%m3 = lshr i192 %m20, 64
%o3 = trunc i192 %m3 to i64
%out3 = insertvalue [4 x i64] %out2, i64 %o3, 3
ret [4x i64] %out3
}
#include <sys/types.h>
extern void fullmul(u_int64_t out[4], u_int64_t a, u_int64_t b,
u_int64_t c, u_int64_t d);
int main(int argc, char **argv) {
u_int64_t x[4]={1,2,3,4}, n;
for (n=0; n<100000000; n++)
fullmul(x,x[0],x[1],x[2],x[3]);
}
The assembly is:
movq %rdx, %r9
movq %r9, %rax
imulq %rcx
movq %rax, %r10
movq %rdx, %r11
movq %rsi, %rax
imulq %rcx
movq %rax, (%rdi)
movq %rdx, %rcx ;;; extraneous
addq %r10, %rcx
adcq $0, %r11
sbbq %r10, %r10
andq $1, %r10
movq %rsi, %rax
imulq %r8
addq %rcx, %rax
movq %rax, 8(%rdi)
movq %rdx, %rcx ;;; extraneous
adcq %r11, %rcx
adcq $0, %r10
movq %r9, %rax
imulq %r8
addq %rcx, %rax
movq %rax, 16(%rdi)
adcq %r10, %rdx
movq %rdx, 24(%rdi)
ret
... which has 2 extraneous mov instructions. On my laptop (1.66GHz
Conroe), removing the extra movs and renaming registers drops the time
from 1.95s to 1.75s, of which 0.4s is loop overhead. The difference is
less significant on newer hardware (Nehalem).
It's hard to say what impact such extraneous mov's have on a real
program. My current testing code uses 4x4 Montgomery multiplies, which
are longer but have many more extraneous mov's. So far I've only tested
the GCC version against the LLVM version, and the GCC version is 5-10%
faster, but this time the difference is more significant on newer
hardware, perhaps because the mul's and adc's are less of a bottleneck.
The difference is further skewed by the fact that the GCC and LLVM
versions use a slightly different algorithm, and the LLVM code seems to
optimize slightly better in other ways... so it may be that the extra
mov's are hurting it more than 5-10%.
Cheers,
Mike
On 03/30/2011 11:24 AM, Mike Hamburg wrote:> Hello all!
>
> I'm working on a library with bignum support, and I wanted to try LLVM
> as an apparently simpler and more portable system to my current design
> (a Haskell script which spits out mixed C and assembly). Porting the
> script to use the LLVM bindings instead of the current hack was pretty
> easy. But I have a few remaining questions:
>
> (1) Are bignums exposed to any higher-level language? It would be nice
> to be able to write all the code in C and compile it with Clang... but I
> need 256-bit integer support.
>
> (2) Is there a way to convince LLVM's register allocator to do the
right
> thing on x86? I'm getting swaths of code like:
> movq 88(%rsp), %rax
> mulq 112(%rsp)
> movq %rax, %r15
> addq %r11, %r15
> movq %rdx, %r14
> adcq %rcx, %r14
> adcq $0, %r9
> (that's a 64x64 -> 128-bit multiply with 192-bit accumulate.) The
> problem is, %r11 and %rcx are dead here. It should have just added %rax
> and %rdx into them. This results in more movs, more spills, more code
> and less performance.
>
> (3) Is there a way to avoid this whole mess? I'm using a script to
spit
> out ugly code in large part because GCC and Clang both refuse to unroll
> loops that look like
>
> int i,j;
> // accumulator uses inline asm to do the 64x64->128->192-bit
accumulate
> above
> accumulator acc(a[0], b[0]);
> tmp[0] = acc.shift();
> for (j=1; j<n; j++) {
> for (i=0; i<=j; i++)
> acc.mac(a[i], b[j-i]);
> tmp[j] = acc.shift();
> }
>
> where n is a template parameter, and thus known at compile time. Is
> there some clang pass which will unroll this properly? -O3
> -funroll-loops doesn't seem to (in fact, -funroll-loops makes it
> worse... it tries to unroll the loops by a factor of 4 instead of
> completely unwinding them). Is there some opt pass which can fix it?
> This is more relevant if clang can do sane things with the registers
> (its performance is half that of GCC right now), but it'd be nice to
know.
>
> Thanks for your time!
> -- Mike Hamburg
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev