thr3ads.net - llvm dev - [LLVMdev] Optimizing math code [Feb 2014]

If this information is useful, please help other people find it:
Share via:

Michael Hamburg

2014-Feb-18 03:13 UTC

[LLVMdev] Optimizing math code

On Feb 17, 2014, at 6:38 PM, Stephen Checkoway <s at pahtak.org> wrote:
> 
> On Feb 17, 2014, at 8:10 PM, Michael Hamburg <mike at shiftleft.org>
wrote:
> 
>> First, addition.  I have multiprecision integer objects, and I’d like
to add them component-wise (likewise, subtract, negate, mask…).  For example:
>> 
>> struct mp {
>> 	int limb[8];
>> } __attribute__((aligned(32))) ;
>> 
>> void add(struct mp *a, const struct mp *b, const struct mp *c) {
>>   for (int i=0; i<8; i++) a->limb[i] = b->limb[i] +
c->limb[i];
>> }
> 
> If the struct represents an integer, shouldn't the add take the carry
into account?
Yes.  However, I’m using a redundant representation, with 56-bit digits in a
64-bit limb.  (The real code uses uint64_t’s instead of ints.)  That way up to 8
bits of carries can accumulate, and I can propagate them all at a more opportune
time.

Cheers,
— Mike

David Chisnall

2014-Feb-18 08:43 UTC

head link

[LLVMdev] Optimizing math code

On 18 Feb 2014, at 03:13, Michael Hamburg <mike at shiftleft.org> wrote:
> Yes.  However, I’m using a redundant representation, with 56-bit digits in
a 64-bit limb.  (The real code uses uint64_t’s instead of ints.)  That way up to
8 bits of carries can accumulate, and I can propagate them all at a more
opportune time.
Two things to note here:

1) Your representation makes it very hard to vectorise.  If you were doing the
accumulate on every cycle, then you wouldn't even need to treat them as a
vector and they could simply be folded to the largest integer type that the
target supports, so if the vector registers can contain 128-bit integers that
would work too.

2) By deferring the carry calculation, there is a significant chance that you
are inserting a timing attack into your crypto code.  I'm finding it quite
difficult to think of ways that it is possible to compile code using this
representation that doesn't leak information about the key to sufficiently
patient attackers.  You might like to look up some of the problems with 64-bit
multiply on older ARM systems for an idea why - many of those theoretical
attacks are now practical for a remote attacker in a couple of hours.

David
(Who is glad he doesn't have to write crypto code)

Mike Hamburg

2014-Feb-18 17:28 UTC

head link

[LLVMdev] Optimizing math code

On Feb 18, 2014, at 12:43 AM, David Chisnall <David.Chisnall at
cl.cam.ac.uk> wrote:
> On 18 Feb 2014, at 03:13, Michael Hamburg <mike at shiftleft.org>
wrote:
> 
>> Yes.  However, I’m using a redundant representation, with 56-bit digits
in a 64-bit limb.  (The real code uses uint64_t’s instead of ints.)  That way up
to 8 bits of carries can accumulate, and I can propagate them all at a more
opportune time.
> 
> Two things to note here:
> 
> 1) Your representation makes it very hard to vectorise.  If you were doing
the accumulate on every cycle, then you wouldn't even need to treat them as
a vector and they could simply be folded to the largest integer type that the
target supports, so if the vector registers can contain 128-bit integers that
would work too.
On the contrary, if I were doing the accumulate on every cycle, it would be
nearly impossible to vectorize.  As it is, the code is easy to vectorize:
it's running fine with __attribute__((ext_vector_length(...))).  I just want
something more portable.  Like a "for" loop which is equivalent to a
vector op.

I'm not aware of any platform which supports 128-bit integer ops in its
vector registers... what target were you thinking of?
> 2) By deferring the carry calculation, there is a significant chance that
you are inserting a timing attack into your crypto code.  I'm finding it
quite difficult to think of ways that it is possible to compile code using this
representation that doesn't leak information about the key to sufficiently
patient attackers.
Redundant encodings are especially popular with implementors who are worried
about side-channel attacks.  This is because with a standard encoding, you
can't check if your carry chain is done without exposing yourself to a
timing attack.  So you have to assume that the carries propagate for as long as
possible.  Even if you're just adding 1, you have to propagate the carry all
the way.  But with a redundant encoding, it doesn't propagate at all.

Likewise, the reductions in my code aren't done by testing to see if
overflow is about to occur.  Instead, they're done by a static analysis (in
an external program) which computes when the numbers might overflow, and where
reductions should be placed to prevent this.  So I'm hoping that I can avoid
screwing this property up over a couple thousand lines.

The reductions themselves also take constant time.  They basically just shuffle
8 bits up to the next word, and do an add.  The resulting number still isn't
perfectly reduced -- that's much more expensive -- but the carry on each
limb is now at most 1.  This can also be vectorized, but it's more annoying
because of the shuffles involved.
>  You might like to look up some of the problems with 64-bit multiply on
older ARM systems for an idea why - many of those theoretical attacks are now
practical for a remote attacker in a couple of hours.
Yeah, I'll have to be careful about how things get reduced on ARM.  In
particular, I'm using PCMPEQD in some places, which is constant-time, but
the most obvious translation of it to a vectorless platform is not
constant-time.  And I'll have to make sure that if there are any 64-bit ops
left, they end up constant-time.  Likewise, I need to either make asm intrinsics
for any true 128-bit ops that I want to do on a 64-bit platform, or else find a
way to assure myself that they consistently emit ADC and not a jump.

On the other hand, I'm not targeting ARM platforms with a variable-time
multiply instruction, because that's a pain.

Cheers,
-- Mike

llvm dev - Feb 2014 - [LLVMdev] Optimizing math code

[LLVMdev] Optimizing math code

[LLVMdev] Optimizing math code

[LLVMdev] Optimizing math code