Hi all, After searching for a decent compiler backend for ages (google sometimes isn't helpful), I recently stumbled upon LLVM. Woot!! I work on bignum arithmetic (I'm a professional mathematician) and have recently decided to switch from developing GPL'd bignum code to BSD licensed code. (See mpir.org which I contributed to for a while - a fork of GMP). Please bear with me if these are stupid questions (I am a mathematician and have only a glancing knowledge of compiler internals - though not quite zero knowledge). Looking through the language reference, I see you have an int type which can be any number of bits. On x86_64, however, I see that mul only "works" for up to 128 bits. In all cases it appears to return only the low half of the product. For addition, I still haven't understood the documentation well enough to know whether a carry out from an int64 + int64 addition can be retrieved, e.g. through a trap, and how to do ADC. There are also some other primitives which might be useful for bignum.... Obviously many non-C languages have optimised bignum capabilities these days, and you do NOT get good performance by writing such primitives in C!! Anyhow, to my questions: a) What plans are there to support addition, subtraction, multiplication, division, shifting, logical not and other logic ops for >= 64/128 bits (machine word size, whatever)? The documentation mentions integers of millions of bits.... b) Would it be desirable to add primitives for retrieving both the low and high half of a multiplication and optimisations to combine where wide multiplications are available on the chip? c) I want to ask something about retrieving the carry or borrow from an addition or subtraction and using it in an ADC or SBB..... not sure what to ask..... d) Are you guys at all interested in supporting languages that provide multiprecision arithmetic? I have a vague notion that I'd like to sign on to help with that (and may be able to bring some other experienced devels with me if there was some interest). I did see APInt and understand the necessity of such a library for constant folding etc. but my questions aren't specifically about that (though perhaps I can help out there too, if anything still remains to be done....) Specifically I am talking about providing so-called "native" bignum support, i.e. highly optimised, riding on the bare metal, type bignum stuff. Please understand this is just an initial email to enquire about current status, plans, etc, so I can gauge how my existing expertise might or might not be of use. Bill Hart. P.S: I'm excited about finding LLVM. It is without a doubt one of the most outstanding Open Source projects I have encountered to date!! But you already knew that.
On Fri, Jun 11, 2010 at 10:37 AM, Bill Hart <goodwillhart at googlemail.com> wrote:> a) What plans are there to support addition, subtraction, > multiplication, division, shifting, logical not and other logic ops > for >= 64/128 bits (machine word size, whatever)? The documentation > mentions integers of millions of bits....Pretty much everything besides muliplication and division should work at any width; the result is generally not especially efficient, though.> b) Would it be desirable to add primitives for retrieving both the low > and high half of a multiplication and optimisations to combine where > wide multiplications are available on the chip?This should already work where applicable; for example, the following C code gives an appropriately optimized result on x86-32: int a(int x, int y) { return ((long long)x)*y >> 32; }> c) I want to ask something about retrieving the carry or borrow from > an addition or subtraction and using it in an ADC or SBB..... not sure > what to ask.....LLVM automatically expands wide addition/subtraction with ADC/SBB. There's no other general way to write something which optimizes to an ADC/SBB, at least at the moment. What do you want here?> d) Are you guys at all interested in supporting languages that provide > multiprecision arithmetic? I have a vague notion that I'd like to sign > on to help with that (and may be able to bring some other experienced > devels with me if there was some interest).By that, you mean languages that provide arbitrary-width integers? You can already support that by lowering such operations to calls to a multi-precision library instead of IR operations in your frontend. It would be an interesting experiment to write a pass which converts IR operations which are too wide to calls to a multi-precision library, though. -Eli
On Fri, Jun 11, 2010 at 3:28 PM, Bill Hart <goodwillhart at googlemail.com> wrote:> Hi Eli, > > On 11 June 2010 22:44, Eli Friedman <eli.friedman at gmail.com> wrote: >> On Fri, Jun 11, 2010 at 10:37 AM, Bill Hart <goodwillhart at googlemail.com> wrote: >>> a) What plans are there to support addition, subtraction, >>> multiplication, division, shifting, logical not and other logic ops >>> for >= 64/128 bits (machine word size, whatever)? The documentation >>> mentions integers of millions of bits.... >> >> Pretty much everything besides muliplication and division should work >> at any width; the result is generally not especially efficient, >> though. > > OK, I only tried multiplication last night when I first had a decent > fiddle (been marking exams, so haven't had much time to really get > dirty hands, sorry). > > Of course efficiency is imperative for bignum stuff as it underpins > absolutely behemoth projects like Sage (sagemath.org). > Every cycle counts, so to speak. One cycle per limb difference in > multiplication makes a nearly 40% difference in absolutely everything > for literally many thousands of developers!Right; improvements are definitely welcome here; you might want to take a look at what we currently generate here, though, to see what improvements are most necessary.>>> b) Would it be desirable to add primitives for retrieving both the low >>> and high half of a multiplication and optimisations to combine where >>> wide multiplications are available on the chip? >> >> This should already work where applicable; for example, the following >> C code gives an appropriately optimized result on x86-32: >> >> int a(int x, int y) { return ((long long)x)*y >> 32; } >> > > Right. I'll take a closer look at that, but I don't think it is what I mean. > > I'm specifically talking about x86_64 and returning the high 64 bits > of a 64x64 bit multiply. I don't think this is done, or done > efficiently in LLVM currently. Or I might be missing something.Should work the same as the previous example; try the following on x86-64: long long a(long long x, long long y) { return ((__int128_t)x)*y >> 64; }> A basic operation in bignum stuff is addmul_1, i.e. multiply a > multiple precision integer with a single "limb" (machine word) and add > the resulting multiple precision result to another bignum. This is not > best achieved by doing lots of wide multiplies, dealing with carries, > then adding. On AMD x86_64 we get 2.5 cycles per limb in direct > assembly for addmul_1 (actually achieving the maximum retirement rate > for macro ops on AMD), but more like 10 cycles per limb with C. Even > the assembly version of addmul_1 is not efficient enough for a > multiplication basecase though, where you essentially need the whole > basecase multiplication done in assembly! > > It is this assembly basecase which I would like to be able to code > directly in LLVM assembly.Hmm... might be interesting, but you'll likely get much better practical results by just>>> c) I want to ask something about retrieving the carry or borrow from >>> an addition or subtraction and using it in an ADC or SBB..... not sure >>> what to ask..... >> >> LLVM automatically expands wide addition/subtraction with ADC/SBB. >> There's no other general way to write something which optimizes to an >> ADC/SBB, at least at the moment. What do you want here? > > I should take a look at what happens for a multiprecision > addition/subtraction before I answer that one. I merely assumed that > this also only worked up to 128 bits on x86_64 as per multiplication.It should work at any width (although once you get to extremely wide integers, the number of spills makes the code get messy). Also, this isn't something we do at the moment, but it's would be possible to teach the backend to transform code using the i1 output of llvm.uadd.with.overflow into an ADC, etc.>> >>> d) Are you guys at all interested in supporting languages that provide >>> multiprecision arithmetic? I have a vague notion that I'd like to sign >>> on to help with that (and may be able to bring some other experienced >>> devels with me if there was some interest). >> >> By that, you mean languages that provide arbitrary-width integers? > > Right. E.g. Haskell, python, ECL, many others. > >> You can already support that by lowering such operations to calls to a >> multi-precision library instead of IR operations in your frontend. > > Indeed that is what they do.... currently. :-) > > I work on the bignum libraries though, not language front ends. So, I > want to know how to implement a bignum library using LLVM. At the > moment, I can't get close enough to the machine to do anything > efficient. These libraries usually have large amounts of assembly, not > C. But LLVM is so capable, I think we've been doing this wrong all > along. Instead of writing assembly for each machine, we should be > using LLVM assembly code and writing part of the bignum library on the > "other side" of this interface. > > If you have time, check out the cminusminus language reference. They > had a bits1 type for flags, like carry from addition, and a mulhi and > mullo instruction. Now cminusminus never took off and died an untimely > death (largely because they wrote it in some bizarre ML in my > opinion), but from a bignum point of view, they got this one minor > detail right.Hmm... LLVM currently uses i1 pretty extensively, and you can express equivalents of mulhi/mullo as mentioned above.> There's a few ways this could be done though. I mean, imagine if LLVM > optimised around bignum calls by say breaking bignum calls up into > more basic primitives and rearranging and coallescing those, etc. I'm > not saying it *should* do that. It would be a multi-year effort, and > probably a large grant application to do it (I'm seriously toying with > the idea...). But I am interested to know what the existing intentions > of the LLVM devs are for supporting arbitrary precision stuff, as it > seemed, at least at first blush, as though something non-trivial may > have been planned all along. > > It interested me greatly when I noticed you had these intn types for > an arbitrary number of bits, and it almost made me think there was > some planning had gone into supporting arbitrary precision arithmetic > as basic instructions in the LLVM. Bizarre, but not stupid, perhaps.There really isn't any grand plan here; LLVM switched a while back from named integer types to a generalized integer type as part of a large type-system rewrite; the types were left unrestricted just because there wasn't any particular reason to restrict them. The support for such types has gradually improved since then. And most recently, we've started generating such types in the optimizers to allow transforming complex structures into SSA form. There have really only been sporadic queries about practical support for real multi-precision arithmetic, but we'd welcome any improvements. -Eli> The reason this is not stupid is that parallel architectures, > especially GPU like ones could well be set up to offer bignum > capabilities. There it really may make sense to offer them as a low > level primitive. But if there, then why not elsewhere. > > Sorry for the rambling. I just want to convey that I don't have some > specific set plan in mind. I'm really just interested in understanding > the original thought that went into these intn types, and why they are > there. I think that in order to really be of any help to LLVM, I need > to understand what the original design goals were in this regard. > >> It >> would be an interesting experiment to write a pass which converts IR >> operations which are too wide to calls to a multi-precision library, >> though. > > Yes, very interesting indeed!! > > Bill.
On 12 June 2010 00:51, Eli Friedman <eli.friedman at gmail.com> wrote:> On Fri, Jun 11, 2010 at 3:28 PM, Bill Hart <goodwillhart at googlemail.com> wrote: >> Hi Eli, >> >> On 11 June 2010 22:44, Eli Friedman <eli.friedman at gmail.com> wrote: >>> On Fri, Jun 11, 2010 at 10:37 AM, Bill Hart <goodwillhart at googlemail.com> wrote: >>>> a) What plans are there to support addition, subtraction, >>>> multiplication, division, shifting, logical not and other logic ops >>>> for >= 64/128 bits (machine word size, whatever)? The documentation >>>> mentions integers of millions of bits.... >>> >>> Pretty much everything besides muliplication and division should work >>> at any width; the result is generally not especially efficient, >>> though. >> >> OK, I only tried multiplication last night when I first had a decent >> fiddle (been marking exams, so haven't had much time to really get >> dirty hands, sorry). >> >> Of course efficiency is imperative for bignum stuff as it underpins >> absolutely behemoth projects like Sage (sagemath.org). >> Every cycle counts, so to speak. One cycle per limb difference in >> multiplication makes a nearly 40% difference in absolutely everything >> for literally many thousands of developers! > > Right; improvements are definitely welcome here; you might want to > take a look at what we currently generate here, though, to see what > improvements are most necessary.Roger that.> >>>> b) Would it be desirable to add primitives for retrieving both the low >>>> and high half of a multiplication and optimisations to combine where >>>> wide multiplications are available on the chip? >>> >>> This should already work where applicable; for example, the following >>> C code gives an appropriately optimized result on x86-32: >>> >>> int a(int x, int y) { return ((long long)x)*y >> 32; } >>> >> >> Right. I'll take a closer look at that, but I don't think it is what I mean. >> >> I'm specifically talking about x86_64 and returning the high 64 bits >> of a 64x64 bit multiply. I don't think this is done, or done >> efficiently in LLVM currently. Or I might be missing something. > > Should work the same as the previous example; try the following on x86-64: > long long a(long long x, long long y) { return ((__int128_t)x)*y >> 64; }Ah, I looked for this last night for some time. I tried uint128_t, u_int128_t, uint128, int128, everything you can imagine... except of course .... that. Should've read the manual. Damn, it does do this in one mul for unsigned multiplication! I feel embarrassed now. The stupid thing is I actually know this is what happens in gcc 4.4.x when you do something like the above in C. I should've thought of this. That totally solves my "problem".> >> A basic operation in bignum stuff is addmul_1, i.e. multiply a >> multiple precision integer with a single "limb" (machine word) and add >> the resulting multiple precision result to another bignum. This is not >> best achieved by doing lots of wide multiplies, dealing with carries, >> then adding. On AMD x86_64 we get 2.5 cycles per limb in direct >> assembly for addmul_1 (actually achieving the maximum retirement rate >> for macro ops on AMD), but more like 10 cycles per limb with C. Even >> the assembly version of addmul_1 is not efficient enough for a >> multiplication basecase though, where you essentially need the whole >> basecase multiplication done in assembly! >> >> It is this assembly basecase which I would like to be able to code >> directly in LLVM assembly. > > Hmm... might be interesting, but you'll likely get much better > practical results by just... using GMP? Right. Good plan. Been there, done that. Going to work on a project which for the time being is called bsdnt (virtually vaporware at this stage). Definitely don't want to just replicate GMP with a BSD license. We're interested in parallel code and lots of other goodies. Implementing on top of LLVM is (probably) perfect for this.> >>>> c) I want to ask something about retrieving the carry or borrow from >>>> an addition or subtraction and using it in an ADC or SBB..... not sure >>>> what to ask..... >>> >>> LLVM automatically expands wide addition/subtraction with ADC/SBB. >>> There's no other general way to write something which optimizes to an >>> ADC/SBB, at least at the moment. What do you want here? >> >> I should take a look at what happens for a multiprecision >> addition/subtraction before I answer that one. I merely assumed that >> this also only worked up to 128 bits on x86_64 as per multiplication. > > It should work at any width (although once you get to extremely wide > integers, the number of spills makes the code get messy). > > Also, this isn't something we do at the moment, but it's would be > possible to teach the backend to transform code using the i1 output of > llvm.uadd.with.overflow into an ADC, etc. > >>> >>>> d) Are you guys at all interested in supporting languages that provide >>>> multiprecision arithmetic? I have a vague notion that I'd like to sign >>>> on to help with that (and may be able to bring some other experienced >>>> devels with me if there was some interest). >>> >>> By that, you mean languages that provide arbitrary-width integers? >> >> Right. E.g. Haskell, python, ECL, many others. >> >>> You can already support that by lowering such operations to calls to a >>> multi-precision library instead of IR operations in your frontend. >> >> Indeed that is what they do.... currently. :-) >> >> I work on the bignum libraries though, not language front ends. So, I >> want to know how to implement a bignum library using LLVM. At the >> moment, I can't get close enough to the machine to do anything >> efficient. These libraries usually have large amounts of assembly, not >> C. But LLVM is so capable, I think we've been doing this wrong all >> along. Instead of writing assembly for each machine, we should be >> using LLVM assembly code and writing part of the bignum library on the >> "other side" of this interface. >> >> If you have time, check out the cminusminus language reference. They >> had a bits1 type for flags, like carry from addition, and a mulhi and >> mullo instruction. Now cminusminus never took off and died an untimely >> death (largely because they wrote it in some bizarre ML in my >> opinion), but from a bignum point of view, they got this one minor >> detail right. > > Hmm... LLVM currently uses i1 pretty extensively, and you can express > equivalents of mulhi/mullo as mentioned above.What is i1? Sorry for the really daft question. These short abbreviations are sometimes hard to look up.....> >> There's a few ways this could be done though. I mean, imagine if LLVM >> optimised around bignum calls by say breaking bignum calls up into >> more basic primitives and rearranging and coallescing those, etc. I'm >> not saying it *should* do that. It would be a multi-year effort, and >> probably a large grant application to do it (I'm seriously toying with >> the idea...). But I am interested to know what the existing intentions >> of the LLVM devs are for supporting arbitrary precision stuff, as it >> seemed, at least at first blush, as though something non-trivial may >> have been planned all along. >> >> It interested me greatly when I noticed you had these intn types for >> an arbitrary number of bits, and it almost made me think there was >> some planning had gone into supporting arbitrary precision arithmetic >> as basic instructions in the LLVM. Bizarre, but not stupid, perhaps. > > There really isn't any grand plan here; LLVM switched a while back > from named integer types to a generalized integer type as part of a > large type-system rewrite; the types were left unrestricted just > because there wasn't any particular reason to restrict them. The > support for such types has gradually improved since then. And most > recently, we've started generating such types in the optimizers to > allow transforming complex structures into SSA form.OK. Thanks. That's interesting to know. I'll see if I can now get a better understanding of the evolution of this.> > There have really only been sporadic queries about practical support > for real multi-precision arithmetic, but we'd welcome any > improvements.Sure. I'm mainly thinking of LLVM's stated aim of "offering support for other non-C languages", which is (a paraphrase of) what you have somewhere, listed in the open projects on the LLVM site. It occurs to me that bignum support is one thing that many languages require, but currently have to use GMP for. If your language is BSD licensed though, that is out of the question, hence some pretty poor bignum implementations out there (I mean relatively speaking performance-wise). There are obviously numerous ways we might use LLVM to aid development of "bsdnt". I'll keep exploring those options. It sounds like, for the time being, analysing existing code output and looking for ways to improve it on certain arches is perhaps one way we may be of assistance. By the way, the LLVM website is absolutely spectacular. The attention to detail is amazing. There were some things that I found oddly difficult to find, and I am still mystified why I googled for such a long time for something like LLVM (days of googling, not minutes) and didn't hit upon it. I wonder if I simply overlooked it because I thought it was "just" a JIT or virtual machine. Heh, ever thought of putting lots of terms like, "faster than C", "high level assembly language", "lower level than C", "modern compiler back end", "better than gcc's RTL", "similar to RTL", etc. on the site, to make things more visible to a certain class of google searches. (Better than gcc, faster than gcc, BSD licensed C compiler, beats gcc in the programming language shootout - heh, naughty me). :-) Like the /. article today. Congrats!! Bill.> > -Eli > >> The reason this is not stupid is that parallel architectures, >> especially GPU like ones could well be set up to offer bignum >> capabilities. There it really may make sense to offer them as a low >> level primitive. But if there, then why not elsewhere. >> >> Sorry for the rambling. I just want to convey that I don't have some >> specific set plan in mind. I'm really just interested in understanding >> the original thought that went into these intn types, and why they are >> there. I think that in order to really be of any help to LLVM, I need >> to understand what the original design goals were in this regard. >> >>> It >>> would be an interesting experiment to write a pass which converts IR >>> operations which are too wide to calls to a multi-precision library, >>> though. >> >> Yes, very interesting indeed!! >> >> Bill. >
On 12 June 2010 00:51, Eli Friedman <eli.friedman at gmail.com> wrote:> On Fri, Jun 11, 2010 at 3:28 PM, Bill Hart <goodwillhart at googlemail.com> wrote: >> Hi Eli, >> >> On 11 June 2010 22:44, Eli Friedman <eli.friedman at gmail.com> wrote: >>> On Fri, Jun 11, 2010 at 10:37 AM, Bill Hart <goodwillhart at googlemail.com> wrote: >>>> a) What plans are there to support addition, subtraction, >>>> multiplication, division, shifting, logical not and other logic ops >>>> for >= 64/128 bits (machine word size, whatever)? The documentation >>>> mentions integers of millions of bits.... >>> >>> Pretty much everything besides muliplication and division should work >>> at any width; the result is generally not especially efficient, >>> though. >> >> OK, I only tried multiplication last night when I first had a decent >> fiddle (been marking exams, so haven't had much time to really get >> dirty hands, sorry). >> >> Of course efficiency is imperative for bignum stuff as it underpins >> absolutely behemoth projects like Sage (sagemath.org). >> Every cycle counts, so to speak. One cycle per limb difference in >> multiplication makes a nearly 40% difference in absolutely everything >> for literally many thousands of developers! > > Right; improvements are definitely welcome here; you might want to > take a look at what we currently generate here, though, to see what > improvements are most necessary. >I think I have an example of why it is somehow important to be able to retrieve the carry and do an add with carry. Consider this short C program: #include <stdlib.h> #include <stdio.h> #define BITS 64 /**************************************** Types ****************************************/ typedef unsigned long ul; typedef __uint128_t ull; typedef ulong mp_size; typedef const ulong * mp_src; typedef ulong * mp_dst; typedef ulong * mp_ptr; /**************************************** Random routines ****************************************/ ull __randval = (ull) 13993185049168412078UL; const ull __randprime = (ull) 9223372036854775814UL * 2 + 1; const ull __randmult = 18148508189596611927UL; ul ul_randlimb(void) { __randval = (__randval * __randmult) % __randprime; return (ul) __randval; } /**************************************** Unsigned multiple precision routines ****************************************/ mp_ptr mp_init(mp_size n) { return malloc(n*sizeof(ul)); } static inline ul mp_add_nc(mp_dst r, mp_src a, mp_src b, mp_size n) { long i; ul cy; const __uint128_t v = (__uint128_t) a[0] + b[0]; r[0] = (ul) v; cy = v >> 64; for (i = 1; i < n; i++) { __uint128_t u = (__uint128_t) a[i] + b[i] + cy; r[i] = (ul) u; cy = u >> BITS; } return cy; } void mp_rand_n(mp_dst r, mp_size n) { mp_size i; for (i = 0; i < n; i++) r[i] = ul_randlimb(); } int main(void) { mp_ptr a, b, c; ul i; a = mp_init(1000); b = mp_init(1000); c = mp_init(1000); mp_rand_n(a, 1000); mp_rand_n(b, 1000); for (i = 0; i < 2400000; i++) mp_add_nc(c, a, b, 1000); return 0; } Ignore all of it except the mp_add_nc function. Now this runs at 4 cycles per int64 addition on AMD K10. If I fiddle a bit and loop unroll, I get 2.5 cycles. But optimal is actually 1.6 cycles. The part of the loop in question becomes: %tmp.i = add i64 %indvar.i, 1 ; <i64> [#uses=2] %22 = load i64* %scevgep.i, align 8 ; <i64> [#uses=1] %23 = zext i64 %22 to i128 ; <i128> [#uses=1] %24 = load i64* %scevgep3.i, align 8 ; <i64> [#uses=1] %25 = zext i64 %24 to i128 ; <i128> [#uses=1] %26 = zext i64 %cy.02.i to i128 ; <i128> [#uses=1] %27 = add i128 %23, %26 ; <i128> [#uses=1] %28 = add i128 %27, %25 ; <i128> [#uses=2] %29 = trunc i128 %28 to i64 ; <i64> [#uses=1] store i64 %29, i64* %scevgep4.i, align 8 %30 = lshr i128 %28, 64 ; <i128> [#uses=1] %31 = trunc i128 %30 to i64 ; <i64> [#uses=1] %exitcond = icmp eq i64 %tmp.i, 999 ; <i1> [#uses=1] In other words, it just extends everything to an i128 and adds. There's no way to tell it that it can add a[i], b[i] and cy with a single adc. (Well it could if the loop iteration wasn't messing with the carry flag). Indeed, this compiles to: .LBB1_7: # %bb.i # Parent Loop BB1_6 Depth=1 # => This Inner Loop Header: Depth=2 addq (%rbx,%rsi,8), %rdi movl $0, %r8d adcq $0, %r8 addq (%r14,%rsi,8), %rdi adcq $0, %r8 movq %rdi, (%r15,%rsi,8) incq %rsi cmpq $1000, %rsi # imm = 0x3E8 movq %r8, %rdi jne .LBB1_7 So it basically tries to keep track of the carry in %r8 instead of in the carry flag. As hinted, the other optimisation missed here, is that instead of comparing with $1000 it can start with %rsi at $-1000 and increment each iteration and simply do a jnz .LBB1_7 doing away with the cmpq. I've tried tricking it into doing this in every way I can think of, but without success so far. Bill.