I suggest that LLVM needs intrinsics for add/sub with carry, e.g. declare {T, i1} @llvm.addc.T(T %a, T %b, i1 c) The current multiprecision clang intrinsics example: void foo(unsigned *x, unsigned *y, unsigned *z) { unsigned carryin = 0; unsigned carryout; z[0] = __builtin_addc(x[0], y[0], carryin, &carryout); carryin = carryout; z[1] = __builtin_addc(x[1], y[1], carryin, &carryout); carryin = carryout; z[2] = __builtin_addc(x[2], y[2], carryin, &carryout); carryin = carryout; z[3] = __builtin_addc(x[3], y[3], carryin, &carryout); } uses the LLVM intrinsic "llvm.uadd.with.overflow" and generates horrible code that doesn't use the "adc" x86 instruction. What is the current thinking on improving multiprecision arithmetic?
On Feb 15, 2017, at 2:22 PM, Bagel via llvm-dev <llvm-dev at lists.llvm.org> wrote:> I suggest that LLVM needs intrinsics for add/sub with carry, e.g. > > declare {T, i1} @llvm.addc.T(T %a, T %b, i1 c) > > The current multiprecision clang intrinsics example: > void foo(unsigned *x, unsigned *y, unsigned *z) > { unsigned carryin = 0; > unsigned carryout; > z[0] = __builtin_addc(x[0], y[0], carryin, &carryout); > carryin = carryout; > z[1] = __builtin_addc(x[1], y[1], carryin, &carryout); > carryin = carryout; > z[2] = __builtin_addc(x[2], y[2], carryin, &carryout); > carryin = carryout; > z[3] = __builtin_addc(x[3], y[3], carryin, &carryout); > } > uses the LLVM intrinsic "llvm.uadd.with.overflow" and generates > horrible code that doesn't use the "adc" x86 instruction. > > What is the current thinking on improving multiprecision arithmetic?Why do you think this requires new intrinsics instead of teaching the optimizer what to do with the existing intrinsics? – Steve
On Wed, Feb 15, 2017 at 2:28 PM, Stephen Canon via llvm-dev < llvm-dev at lists.llvm.org> wrote:> On Feb 15, 2017, at 2:22 PM, Bagel via llvm-dev <llvm-dev at lists.llvm.org> > wrote: > > > I suggest that LLVM needs intrinsics for add/sub with carry, e.g. > > > > declare {T, i1} @llvm.addc.T(T %a, T %b, i1 c) > > > > The current multiprecision clang intrinsics example: > > void foo(unsigned *x, unsigned *y, unsigned *z) > > { unsigned carryin = 0; > > unsigned carryout; > > z[0] = __builtin_addc(x[0], y[0], carryin, &carryout); > > carryin = carryout; > > z[1] = __builtin_addc(x[1], y[1], carryin, &carryout); > > carryin = carryout; > > z[2] = __builtin_addc(x[2], y[2], carryin, &carryout); > > carryin = carryout; > > z[3] = __builtin_addc(x[3], y[3], carryin, &carryout); > > } > > uses the LLVM intrinsic "llvm.uadd.with.overflow" and generates > > horrible code that doesn't use the "adc" x86 instruction. > > > > What is the current thinking on improving multiprecision arithmetic? > > Why do you think this requires new intrinsics instead of teaching the > optimizer what to do with the existing intrinsics? >In general, it is harder to reason about memory. Also, you are forced to allocate memory for the carryout even if you are not interested in using it.> > – Steve > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170215/7ad59336/attachment.html>
Stephen Canon via llvm-dev <llvm-dev at lists.llvm.org> wrote:> > Why do you think this requires new intrinsics instead of teaching the optimizer what to do with the existing intrinsics?IMO, as a multiprecision math library maker, the "teaching the optimizer what to do with the existing intrinsics" approach is much better as long as it can be made to work. If one is careful, MSVC does optimize its intrinsics into ADC instructions in a reasonable way, so I think it is probably doable. (Below, all math is multiprecision.) There are actually a few different operations besides pure addition and subtraction, e.g. `a - (b >> 1)` instead of just `a - b`. Also consider that we sometimes want "a - b" to be side-channel free and other times we'd rather "a - b" to be as fast as possible and optimized for the case where carries are unlikely to propagate (far). That's already three different operations, just for subtraction. Cheers, Brian -- https://briansmith.org/
It takes two "llvm.uadd.with.overflow" instances to model the add-with-carry when there is a carry-in. Look at the IR generated by the example. I figured that the optimization of this would bedifficult (else it would have already been done :-)). And would this optimization have to be done for every architecture? On 02/15/2017 04:28 PM, Stephen Canon wrote:> > Why do you think this requires new intrinsics instead of teaching the optimizer what to do with the existing intrinsics? > > – Steve >