I've been playing around with LLVM to write a backend for a rather "simple" (co-)processor. Assume that only three arithmetic instructions exist: ADD mod N, SUB mod N and MUL mod N. The modulus N is programmable and stored in a register. No ordinary arithmetic instructions are available. The word size is 256-bit. In other words, the following function, b + c mod N, corresponds to only one instruction on my target machine, given the modulus N set by another instruction to an auxiliary register. # this should translate to one instruction like: r3 = add r1, r2 # n is extended to 257-bit for simplicity define i256 @add(i256 %b, i256 %c, i257 %n) { %1 = zext i256 %b to i257 %2 = zext i256 %c to i257 %3 = add i257 %2, %1 %4 = icmp uge i257 %3, %n %5 = select i1 %4, i257 %n, i257 0 %6 = sub i257 %3, %5 %7 = trunc i257 %6 to i256 ret i256 %7 } The ultimate goal is to use LLVM to optimize and emit machine code for programs full of modular addition, subtraction and multiplication. For example, I might want off-the-shelf LLVM passes to optimize (a+b+c)-(a+c) mod p = b mod p for me. For a large syntax tree, say more than 10,000 nodes, I might even need common subexpression elimination. Of course, we can target the same program to ARM or X86 for performance comparison. The questions is: We have to recognize the patterns for modular arithmetic from optimized IR and translate into corresponding instructions. After investigation, I think LLVM can not do a good job on such a simple but unordinary instruction set. Is it possible that the patterns still recognizable after optimization? On the whole, is this viable using LLVM or is it better to write a compiler for this particular DSL from scratch? Thanks, Shang-Yi Yang -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20131115/d73e5845/attachment.html>
Hi, My personal opinion: Just to be sure I understand what you're considering: you want to write a backend that will produce optimized machine code for a device with modular arithmetic instructions (not simulate such a device on a standard CPU)? In which case, won't the same assumptions that are embodied in the transformations for the case of unsigned 2's complement arithmetic (in your case with 256 bit unsigned words) with wraparound result in correct code? I suspect two issues will come up: 1. The optimizations in LLVM which can reason about "useful for optimization" features of wrapping unsigned values are probably not particularly strong at the moment, since most interest tends to be in the case of optimizations that are valid in the case it's known there won't be a wraparound (eg, monotonically increasing of the result of adding elements). This is going to be just a lack of implementation rather than a fundamental issue. 2. It's not clear from your description whether changing the value in the "N register" is a very infrequent event or not. If it happens so infrequently you can get away with just putting compiler barriers before and after the change that would fit reasonably. On the other hand if it's changed frequently getting performance might require allowing the optimizer to move instructions around significantly enough that it's necessary to keep track of what the N register should be for each instruction. LLVM doesn't really have any built-in notion of this, so you're going to have to make significant changes to the compiler to do that. Whether this means writing your own compiler would be less work I don't know. Hope this helps, On Fri, Nov 15, 2013 at 9:15 AM, Shang-Yi Yang <ilway25 at gmail.com> wrote:> I've been playing around with LLVM to write a backend for a rather > "simple" (co-)processor. Assume that only three arithmetic instructions > exist: ADD mod N, SUB mod N and MUL mod N. The modulus N is programmable > and stored in a register. No ordinary arithmetic instructions are > available. The word size is 256-bit. > > In other words, the following function, b + c mod N, corresponds to only > one instruction on my target machine, given the modulus N set by another > instruction to an auxiliary register. > > # this should translate to one instruction like: r3 = add r1, r2 > # n is extended to 257-bit for simplicity > define i256 @add(i256 %b, i256 %c, i257 %n) { > %1 = zext i256 %b to i257 > %2 = zext i256 %c to i257 > %3 = add i257 %2, %1 > %4 = icmp uge i257 %3, %n > %5 = select i1 %4, i257 %n, i257 0 > %6 = sub i257 %3, %5 > %7 = trunc i257 %6 to i256 > ret i256 %7 > } > > The ultimate goal is to use LLVM to optimize and emit machine code for > programs full of modular addition, subtraction and multiplication. For > example, I might want off-the-shelf LLVM passes to optimize (a+b+c)-(a+c) > mod p = b mod p for me. For a large syntax tree, say more than 10,000 > nodes, I might even need common subexpression elimination. Of course, we > can target the same program to ARM or X86 for performance comparison. > > The questions is: We have to recognize the patterns for modular arithmetic > from optimized IR and translate into corresponding instructions. After > investigation, I think LLVM can not do a good job on such a simple but > unordinary instruction set. Is it possible that the patterns still > recognizable after optimization? On the whole, is this viable using LLVM or > is it better to write a compiler for this particular DSL from scratch? > > Thanks, > Shang-Yi Yang > > > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev > >-- cheers, dave tweed__________________________ high-performance computing and machine vision expert: david.tweed at gmail.com "while having code so boring anyone can maintain it, use Python." -- attempted insult seen on slashdot -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20131115/b0b68edd/attachment.html>
Thanks for your insightful suggestions. Yes, I am programming for a real device that does modular arithmetic (and only modular arithmetic). The modulus N is fixed during a single launch of a program. One way I could also come up with is to simply use add i256 %a, %b to represent a + b mod n, and let LLVM passes to reason about possible optimizations. However these are not semantically identical and it may produce a wrong result, though in a rather low possibility. Here is another issue I have been thinking these days. As far as I know, legalizing a DAG node may expand the node to fit into the ISA. What happens if a node is not representable by any combination of the instructions from the ISA? An extremely opposite situation is that, two nodes must have some particular semantic in the DAG so it can be translate to a single instruction. For example, if MUL a, b of the ISA actually does a * b * 12345. So only %1 = mul i256 %a, %b %2 = mul i256 %1, i256 12345 can be translated into a legal instruction (and they may be scattered around the DAG after some passes), but I think %1 = mul i256 %a, %b might fail to match. Is there any workaround? Shang-Yi 2013/11/16 David Tweed <david.tweed at gmail.com>> Hi, > > My personal opinion: Just to be sure I understand what you're considering: > you want to write a backend that will produce optimized machine code for a > device with modular arithmetic instructions (not simulate such a device on > a standard CPU)? In which case, won't the same assumptions that are > embodied in the transformations for the case of unsigned 2's complement > arithmetic (in your case with 256 bit unsigned words) with wraparound > result in correct code? I suspect two issues will come up: >> 1. The optimizations in LLVM which can reason about "useful for > optimization" features of wrapping unsigned values are probably not > particularly strong at the moment, since most interest tends to be in the > case of optimizations that are valid in the case it's known there won't be > a wraparound (eg, monotonically increasing of the result of adding > elements). This is going to be just a lack of implementation rather than a > fundamental issue. > > 2. It's not clear from your description whether changing the value in the > "N register" is a very infrequent event or not. If it happens so > infrequently you can get away with just putting compiler barriers before > and after the change that would fit reasonably. On the other hand if it's > changed frequently getting performance might require allowing the optimizer > to move instructions around significantly enough that it's necessary to > keep track of what the N register should be for each instruction. LLVM > doesn't really have any built-in notion of this, so you're going to have to > make significant changes to the compiler to do that. Whether this means > writing your own compiler would be less work I don't know. > > Hope this helps, > > > On Fri, Nov 15, 2013 at 9:15 AM, Shang-Yi Yang <ilway25 at gmail.com> wrote: > >> I've been playing around with LLVM to write a backend for a rather >> "simple" (co-)processor. Assume that only three arithmetic instructions >> exist: ADD mod N, SUB mod N and MUL mod N. The modulus N is programmable >> and stored in a register. No ordinary arithmetic instructions are >> available. The word size is 256-bit. >> >> In other words, the following function, b + c mod N, corresponds to only >> one instruction on my target machine, given the modulus N set by another >> instruction to an auxiliary register. >> >> # this should translate to one instruction like: r3 = add r1, r2 >> # n is extended to 257-bit for simplicity >> define i256 @add(i256 %b, i256 %c, i257 %n) { >> %1 = zext i256 %b to i257 >> %2 = zext i256 %c to i257 >> %3 = add i257 %2, %1 >> %4 = icmp uge i257 %3, %n >> %5 = select i1 %4, i257 %n, i257 0 >> %6 = sub i257 %3, %5 >> %7 = trunc i257 %6 to i256 >> ret i256 %7 >> } >> >> The ultimate goal is to use LLVM to optimize and emit machine code for >> programs full of modular addition, subtraction and multiplication. For >> example, I might want off-the-shelf LLVM passes to optimize (a+b+c)-(a+c) >> mod p = b mod p for me. For a large syntax tree, say more than 10,000 >> nodes, I might even need common subexpression elimination. Of course, we >> can target the same program to ARM or X86 for performance comparison. >> >> The questions is: We have to recognize the patterns for modular >> arithmetic from optimized IR and translate into corresponding instructions. >> After investigation, I think LLVM can not do a good job on such a simple >> but unordinary instruction set. Is it possible that the patterns still >> recognizable after optimization? On the whole, is this viable using LLVM or >> is it better to write a compiler for this particular DSL from scratch? >> >> Thanks, >> Shang-Yi Yang >> >> >> _______________________________________________ >> LLVM Developers mailing list >> LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu >> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev >> >> > > > -- > cheers, dave tweed__________________________ > high-performance computing and machine vision expert: > david.tweed at gmail.com > "while having code so boring anyone can maintain it, use Python." -- > attempted insult seen on slashdot > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20131118/4740ba7a/attachment.html>