thr3ads.net - llvm dev - [LLVMdev] Optimizing math code [Feb 2014]

If this information is useful, please help other people find it:
Share via:

Michael Hamburg

2014-Feb-18 01:10 UTC

[LLVMdev] Optimizing math code

Hello LLVM-dev,

I’m writing some crypto math code in C, and trying to make it as simple and
portable as possible, so that there won’t be any unforeseen platform bugs. 
Ideally, I’d like the compiler to as much work as possible for me, and I’m
wondering if you know how to make clang (or gcc, for that matter) do some of
this.  So I have a couple of questions.  If this is the wrong list to ask for
LLVM+Clang questions, please point me to a better one :-)


First, addition.  I have multiprecision integer objects, and I’d like to add
them component-wise (likewise, subtract, negate, mask…).  For example:

struct mp {
	int limb[8];
} __attribute__((aligned(32))) ;

void add(struct mp *a, const struct mp *b, const struct mp *c) {
    for (int i=0; i<8; i++) a->limb[i] = b->limb[i] + c->limb[i];
}

When compiling this code on an architecture with a vector unit, I’d ideally like
to see it get vectorized.  With AVX2, the whole function can be done in 3
instructions (load, add, store), or as little as 1 instruction if it’s inlined
and the inputs are already in registers.  But clang doesn’t vectorize it.  Is
there a simple way to get it to vectorize portably?  I’d like to port to ARM
NEON as well as SSE and AVX, and I’d like to be compatible with GCC, so I’d
rather not use ext_vector_length if I can avoid it.  If I can’t avoid it, I can
cobble something together with a big vector_intrinsics.h file or something.

You might think that vectorizing a 4-long loop wouldn’t matter, but of course
field addition is inlined and called all the time.

GCC 4.7.2 vectorizes the above code, but bails to scalar code if a==b or a==c,
which isn’t necessary.  It doesn’t check this if you declare the arrays
restrict, but in fact the function is called with a=b=c, so restrict would be
wrong.


A second issue is multiplication.  Even in a trivial case, I can’t convince
either clang-3.3 or gcc-4.7.2 to unroll a nested loop with a fixed trip count:

void mul(struct foo *a, const struct foo *b, const struct foo *c) {
    for (int i=0; i<2; i++) {
      for (int j=0; j<=i; j++) {
      	a->limb[i] += b->limb[j] * c->limb[j-i];
      }
    }
}

On GCC, -funroll-loops makes the code much worse by duplicating the inner loop
body many times.  Clang doesn’t do that, but it still doesn’t unroll the inner
loop.  Is there a way to automatically unroll small, fixed, nested loops?

Thanks for your time,
— Mike

Stephen Checkoway

2014-Feb-18 02:38 UTC

head link

[LLVMdev] Optimizing math code

On Feb 17, 2014, at 8:10 PM, Michael Hamburg <mike at shiftleft.org>
wrote:
> First, addition.  I have multiprecision integer objects, and I’d like to
add them component-wise (likewise, subtract, negate, mask…).  For example:
> 
> struct mp {
> 	int limb[8];
> } __attribute__((aligned(32))) ;
> 
> void add(struct mp *a, const struct mp *b, const struct mp *c) {
>    for (int i=0; i<8; i++) a->limb[i] = b->limb[i] +
c->limb[i];
> }
If the struct represents an integer, shouldn't the add take the carry into
account?

-- 
Stephen Checkoway

Michael Hamburg

2014-Feb-18 03:13 UTC

head link

[LLVMdev] Optimizing math code

On Feb 17, 2014, at 6:38 PM, Stephen Checkoway <s at pahtak.org> wrote:
> 
> On Feb 17, 2014, at 8:10 PM, Michael Hamburg <mike at shiftleft.org>
wrote:
> 
>> First, addition.  I have multiprecision integer objects, and I’d like
to add them component-wise (likewise, subtract, negate, mask…).  For example:
>> 
>> struct mp {
>> 	int limb[8];
>> } __attribute__((aligned(32))) ;
>> 
>> void add(struct mp *a, const struct mp *b, const struct mp *c) {
>>   for (int i=0; i<8; i++) a->limb[i] = b->limb[i] +
c->limb[i];
>> }
> 
> If the struct represents an integer, shouldn't the add take the carry
into account?
Yes.  However, I’m using a redundant representation, with 56-bit digits in a
64-bit limb.  (The real code uses uint64_t’s instead of ints.)  That way up to 8
bits of carries can accumulate, and I can propagate them all at a more opportune
time.

Cheers,
— Mike

Nadav Rotem

2014-Feb-18 05:19 UTC

head link

[LLVMdev] Optimizing math code

On Feb 17, 2014, at 5:10 PM, Michael Hamburg <mike at shiftleft.org>
wrote:
>  Even in a trivial case, I can’t convince either clang-3.3 or

Clang3.3 is really old. Please try the latest LLVM. 
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20140217/54584075/attachment.html>

Mike Hamburg

2014-Feb-18 06:16 UTC

head link

[LLVMdev] Optimizing math code

On Feb 17, 2014, at 9:19 PM, Nadav Rotem <nrotem at apple.com> wrote:
> 
> On Feb 17, 2014, at 5:10 PM, Michael Hamburg <mike at shiftleft.org>
wrote:
> 
>>  Even in a trivial case, I can’t convince either clang-3.3 or
> 
> 
> Clang3.3 is really old. Please try the latest LLVM. 
Same result on clang version 3.5 (trunk 201542).

-- Mike

llvm dev - Feb 2014 - [LLVMdev] Optimizing math code

[LLVMdev] Optimizing math code

[LLVMdev] Optimizing math code

[LLVMdev] Optimizing math code

[LLVMdev] Optimizing math code

[LLVMdev] Optimizing math code