Chandler Carruth
2014-Jan-05 01:29 UTC
[LLVMdev] A question about everyone's favorite constructs: NSW and NUW
So, I know there are a lot of mixed feelings about NSW and NUW, but I'm running into a common pattern where they really help: void f(char *array, int i) { // do some stuff... g(array[i++]); // do some more stuff... g(array[i++]); // more of the same } So, this kind of code comes up pretty frequently. Unrolled loops with an int IV[1], encoding, decoding, compression, etc. What we'd like to do is take the sequence of N increments and turn them into N addressing modes with a constant offset. Then add a small constant at the end and test the flags if we're in a loop, etc. This is *much faster* than clobbering registers to do LEA or inc or add for each one (and yea, we do LEA and use lots of registers currently). The problem is that this is being compiled in 64-bit mode. So the index is 32-bits and gets sign extended prior to the GEP. So we end up with a lot of repetitions of: %a = add nsw i32 %i, N # for some immediate small integer N %b = sext i32 %i to i64 %ptr = gep inbounds i8* %base, i64 %b When we try to match this pattern in the backend to an x86 addressing mode we fail because we can't look through the sext. This is particularly frustrating because we *should* be able to look through the sext due to the NSW bit on the add, but we don't preserve NSW constraints into the DAG. How can we fix this? I see the following options: 1) Canonicalize the other way in the IR. This could take a few different forms: 1a) Widen math (where NSW or NUW allows) which is only used by an *ext feeding a GEP index. 1b) Widen all math which is only used by an *ext that targets a legal integer size. 2) Add the NSW (and NUW) bits to the DAG, and query them in the address mode matching to match across (sext (op nsw ...)) and (zext (op nuw ...)) 3) Add a complete hack to CodeGenPrepare (or similar) which does #1 as a non-canonical step in the IR. I dislike #3 intensely -- we really want to be able to combine on this stuff in a canonical way, so I want the pattern to either always be (gep ... (sext (op nsw ...))) or (gep ... (op nsw (sext ...))). Also, it feels like quite a hack around the lack of fidelity in the backend. I'm perfectly fine with either #1 or #2. Doing #1 would require a lot of benchmarking to make sure the different canonicalization doesn't regress lots of things, and probably a few fixes where it does regress things. These are likely to be easy to find by looking for patterns that match sext. Doing #2 requires a plumbing these flags through the DAG. I don't yet know how hard that would be, but it doesn't sound too bad. The current approach I would probably take is to add an SDNodeProperty for NSW and NUW, and nodes for operations with those flags. But that means generalizing everything that currently only looks at ADD. There is probably a better way to express this in the DAG, so if this is the direction folks thing LLVM should go, point me at any ideas you have about how to best implement it. Currently, I would naively go after #1 just because I know exactly how to do that. But it doesn't feel quite right and I wonder if we should really invest the time in making sure the backend is aware of the NSW and NUW flags. -Chandler [1]: Note that we could fix this inside of loop bodies when its an induction variable by widening the induction variable. But that doesn't fix in the general case. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20140104/81c75a9f/attachment.html>
Hal Finkel
2014-Jan-05 17:17 UTC
[LLVMdev] A question about everyone's favorite constructs: NSW and NUW
----- Original Message -----> From: "Chandler Carruth" <chandlerc at gmail.com> > To: "LLVM Developers Mailing List" <llvmdev at cs.uiuc.edu> > Sent: Saturday, January 4, 2014 7:29:07 PM > Subject: [LLVMdev] A question about everyone's favorite constructs: NSW and NUW > > > > So, I know there are a lot of mixed feelings about NSW and NUW, but > I'm running into a common pattern where they really help: > > > void f(char *array, int i) { > // do some stuff... > g(array[i++]); > > > // do some more stuff... > g(array[i++]); > > > // more of the same > } > > > So, this kind of code comes up pretty frequently. Unrolled loops with > an int IV[1], encoding, decoding, compression, etc. What we'd like > to do is take the sequence of N increments and turn them into N > addressing modes with a constant offset. Then add a small constant > at the end and test the flags if we're in a loop, etc. This is *much > faster* than clobbering registers to do LEA or inc or add for each > one (and yea, we do LEA and use lots of registers currently). > > > The problem is that this is being compiled in 64-bit mode. So the > index is 32-bits and gets sign extended prior to the GEP. So we end > up with a lot of repetitions of: > > > %a = add nsw i32 %i, N # for some immediate small integer N > %b = sext i32 %i to i64 > %ptr = gep inbounds i8* %base, i64 %b > > > When we try to match this pattern in the backend to an x86 addressing > mode we fail because we can't look through the sext. This is > particularly frustrating because we *should* be able to look through > the sext due to the NSW bit on the add, but we don't preserve NSW > constraints into the DAG. > > > How can we fix this? I see the following options: > > > 1) Canonicalize the other way in the IR. This could take a few > different forms: > 1a) Widen math (where NSW or NUW allows) which is only used by an > *ext feeding a GEP index. > 1b) Widen all math which is only used by an *ext that targets a legal > integer size.I'm in favor of having NSW/NUW information available at the SDAG level (and even at the MI level), but for other reasons. For this problem, I vote for solution 1b. My rationale is that this would solve a problem that I currently have in the PowerPC backend for 64-bit targets in a general way: On 64-bit PPC targets, return values are (zero or sign) extended to 64-bits. Here's a simple example that demonstrates the issue: define signext i32 @foo(i1 zeroext %a) #0 { entry: %cond = select i1 %a, i32 7, i32 12 ret i32 %cond } compiles into something like this: ... li r4, 7 li r3, 12 isel r3, r4, r3, 1 rldicl r3, r3, 0, 32 <-- this is a completely unnecessary extension blr Obviously, this affects cases other than when the extension is part of the return value (although the return value case is a pretty obvious code wart). One possible solution to this that I've thought about (although not yet experimented with) is 'promoting' all 32-bit operations to 64-bit ones up front. That, however, might lead to a bunch of unnecessary extensions of its own. I think that the 1b solution would nicely solve this problem without introducing unnecessary extensions. One issue that might come up is that 64-bit operations might be more expensive (either in latency or in throughput) than the corresponding 32-bit ones (I'm thinking specifically about multiplication and division here, but if we're spilling a lot, there could also be stack-size and bandwidth effects), and we might want some target input on that. In short, I would not be so hard on #3 ;) -- this is a legitimate problem that might not have a nice target-cost-independent solution, but I suspect that the benefits of using 1b as the new canonical form probably outweigh the negative side effects. -Hal> > > 2) Add the NSW (and NUW) bits to the DAG, and query them in the > address mode matching to match across (sext (op nsw ...)) and (zext > (op nuw ...)) > > > 3) Add a complete hack to CodeGenPrepare (or similar) which does #1 > as a non-canonical step in the IR. > > > > > I dislike #3 intensely -- we really want to be able to combine on > this stuff in a canonical way, so I want the pattern to either > always be (gep ... (sext (op nsw ...))) or (gep ... (op nsw (sext > ...))). Also, it feels like quite a hack around the lack of fidelity > in the backend. > > > I'm perfectly fine with either #1 or #2. Doing #1 would require a lot > of benchmarking to make sure the different canonicalization doesn't > regress lots of things, and probably a few fixes where it does > regress things. These are likely to be easy to find by looking for > patterns that match sext. > > > Doing #2 requires a plumbing these flags through the DAG. I don't yet > know how hard that would be, but it doesn't sound too bad. The > current approach I would probably take is to add an SDNodeProperty > for NSW and NUW, and nodes for operations with those flags. But that > means generalizing everything that currently only looks at ADD. > There is probably a better way to express this in the DAG, so if > this is the direction folks thing LLVM should go, point me at any > ideas you have about how to best implement it. > > > Currently, I would naively go after #1 just because I know exactly > how to do that. But it doesn't feel quite right and I wonder if we > should really invest the time in making sure the backend is aware of > the NSW and NUW flags. > > > -Chandler > > > [1]: Note that we could fix this inside of loop bodies when its an > induction variable by widening the induction variable. But that > doesn't fix in the general case. > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev >-- Hal Finkel Assistant Computational Scientist Leadership Computing Facility Argonne National Laboratory
Chandler Carruth
2014-Jan-07 01:44 UTC
[LLVMdev] A question about everyone's favorite constructs: NSW and NUW
On Sun, Jan 5, 2014 at 12:17 PM, Hal Finkel <hfinkel at anl.gov> wrote:> ----- Original Message ----- > > From: "Chandler Carruth" <chandlerc at gmail.com> > > To: "LLVM Developers Mailing List" <llvmdev at cs.uiuc.edu> > > Sent: Saturday, January 4, 2014 7:29:07 PM > > Subject: [LLVMdev] A question about everyone's favorite constructs: NSW > and NUW > > > > > > > > So, I know there are a lot of mixed feelings about NSW and NUW, but > > I'm running into a common pattern where they really help: > > > > > > void f(char *array, int i) { > > // do some stuff... > > g(array[i++]); > > > > > > // do some more stuff... > > g(array[i++]); > > > > > > // more of the same > > } > > > > > > So, this kind of code comes up pretty frequently. Unrolled loops with > > an int IV[1], encoding, decoding, compression, etc. What we'd like > > to do is take the sequence of N increments and turn them into N > > addressing modes with a constant offset. Then add a small constant > > at the end and test the flags if we're in a loop, etc. This is *much > > faster* than clobbering registers to do LEA or inc or add for each > > one (and yea, we do LEA and use lots of registers currently). > > > > > > The problem is that this is being compiled in 64-bit mode. So the > > index is 32-bits and gets sign extended prior to the GEP. So we end > > up with a lot of repetitions of: > > > > > > %a = add nsw i32 %i, N # for some immediate small integer N > > %b = sext i32 %i to i64 > > %ptr = gep inbounds i8* %base, i64 %b > > > > > > When we try to match this pattern in the backend to an x86 addressing > > mode we fail because we can't look through the sext. This is > > particularly frustrating because we *should* be able to look through > > the sext due to the NSW bit on the add, but we don't preserve NSW > > constraints into the DAG. > > > > > > How can we fix this? I see the following options: > > > > > > 1) Canonicalize the other way in the IR. This could take a few > > different forms: > > 1a) Widen math (where NSW or NUW allows) which is only used by an > > *ext feeding a GEP index. > > 1b) Widen all math which is only used by an *ext that targets a legal > > integer size. > > I'm in favor of having NSW/NUW information available at the SDAG level > (and even at the MI level), but for other reasons.Cool. Could you suggest how best to implement this (especially in the SDAG)? I don't like my ideas thus far.> For this problem, I vote for solution 1b. My rationale is that this would > solve a problem that I currently have in the PowerPC backend for 64-bit > targets in a general way: > > On 64-bit PPC targets, return values are (zero or sign) extended to > 64-bits. Here's a simple example that demonstrates the issue: > > define signext i32 @foo(i1 zeroext %a) #0 { > entry: > %cond = select i1 %a, i32 7, i32 12 > ret i32 %cond > } > > compiles into something like this: > > ... > li r4, 7 > li r3, 12 > isel r3, r4, r3, 1 > rldicl r3, r3, 0, 32 <-- this is a completely unnecessary extension > blr > > Obviously, this affects cases other than when the extension is part of the > return value (although the return value case is a pretty obvious code wart). > > One possible solution to this that I've thought about (although not yet > experimented with) is 'promoting' all 32-bit operations to 64-bit ones up > front. That, however, might lead to a bunch of unnecessary extensions of > its own. I think that the 1b solution would nicely solve this problem > without introducing unnecessary extensions. >It feels like there are way better solutions to this. Why can't you just fix this by looking for a sext used by ret in a function which is already signext? It seems like you could match this in the SDAG and fix it very directly?> > One issue that might come up is that 64-bit operations might be more > expensive (either in latency or in throughput) than the corresponding > 32-bit ones (I'm thinking specifically about multiplication and division > here, but if we're spilling a lot, there could also be stack-size and > bandwidth effects), and we might want some target input on that. >Oh, its worse than this. I'm no longer liking #1 at all. If we hoist the extension over math which has flags allowing us to do so safely, we actually lose significant information. The wider the type on which we have the NSW flag, the less information we have about the known value range of the inputs. While this may not have a significant impact in many benchmarks, it seems like a very worrisome change to make when picking the canonical form because we essentially say that we will *never* be able to preserve this information even if some day it matters a lot. =/ I'm increasingly interested in getting NSW to reach the SDAG so that we can actually do the extension when we know we have a fast 64-bit operation available in the form of an addressing mode. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20140106/cb153e0e/attachment.html>
Andrew Trick
2014-Jan-07 05:41 UTC
[LLVMdev] A question about everyone's favorite constructs: NSW and NUW
On Jan 4, 2014, at 5:29 PM, Chandler Carruth <chandlerc at gmail.com> wrote:> So, I know there are a lot of mixed feelings about NSW and NUW, but I'm running into a common pattern where they really help: > > void f(char *array, int i) { > // do some stuff... > g(array[i++]); > > // do some more stuff... > g(array[i++]); > > // more of the same > } > > So, this kind of code comes up pretty frequently. Unrolled loops with an int IV[1], encoding, decoding, compression, etc. What we'd like to do is take the sequence of N increments and turn them into N addressing modes with a constant offset. Then add a small constant at the end and test the flags if we're in a loop, etc. This is *much faster* than clobbering registers to do LEA or inc or add for each one (and yea, we do LEA and use lots of registers currently). > > The problem is that this is being compiled in 64-bit mode. So the index is 32-bits and gets sign extended prior to the GEP. So we end up with a lot of repetitions of: > > %a = add nsw i32 %i, N # for some immediate small integer N > %b = sext i32 %i to i64 > %ptr = gep inbounds i8* %base, i64 %b > > When we try to match this pattern in the backend to an x86 addressing mode we fail because we can't look through the sext. This is particularly frustrating because we *should* be able to look through the sext due to the NSW bit on the add, but we don't preserve NSW constraints into the DAG. > > How can we fix this? I see the following options: > > 1) Canonicalize the other way in the IR. This could take a few different forms: > 1a) Widen math (where NSW or NUW allows) which is only used by an *ext feeding a GEP index. > 1b) Widen all math which is only used by an *ext that targets a legal integer size. > > 2) Add the NSW (and NUW) bits to the DAG, and query them in the address mode matching to match across (sext (op nsw ...)) and (zext (op nuw ...)) > > 3) Add a complete hack to CodeGenPrepare (or similar) which does #1 as a non-canonical step in the IR.Here are some of my current prespectives that are *loosely* related to the issue. All FWIW. There's no need to debate them in this thread other than as they apply to the problem at hand. - I agree that promotion to 64-bit loses information and should only be done when target information tells us that it's good for codegen. - I personally have no problem using "indvars" as a codegen prepare pass for any noncanonical target-specific transforms that require SCEV. The only canonicalization that it usually accomplishes that is replacement of loop exit values. The sign/zero extend elimination (widening) and LFTR are really codegen optimizations that should be defered until the last round of loop opts. I don't see it as being any different in this respect than LSR and vectorization. Unfortunately we rerun canonical InstCombine after indvars, which can undo any target-specific goodness. But when that becomes a serious problem we should just fix that too. If we really need to replace loop exit values earlier before other canonical passes, we can split up indvars. - I know we have considered adding NSW to SelectionDag in the past and I'm not opposed to it in principle, but it was difficult to justify adding the complexity. I don't remember all the arguments against it now. There are some obvious issues: (1) it's not clear how to cleanly implement it. (2) I woud be concerned about relying on preservation of such a brittle attribute so late. It seems easy for target specific DAG combines to either drop or misuse the NSW flag. (3) Determining where to widen values may require global analysis. - I know that the problem of determining when values should be promoted to a wider type and which uses to replace doesn't have a general solution that is practical to implement. We can easily end up with more register pressure and extra copies that the coalescer is not smart about removing or optimizing. It seems wise to tackle a narrow subset of the problem (e.g. codegen of specific patterns that match addressing modes). Quentin probably has more insight here.> I dislike #3 intensely -- we really want to be able to combine on this stuff in a canonical way, so I want the pattern to either always be (gep ... (sext (op nsw ...))) or (gep ... (op nsw (sext ...))). Also, it feels like quite a hack around the lack of fidelity in the backend.I understand why this sort of hack is so terrible early in the IR pass pipeline (especially before inlining). But I think it is already the norm for anyone seriously optimizing for a specific target to manipulate the IR (here I do welcome others to confirm or refute this point). Since we can't do anything to change that, let's just declare that those codegen IR transforms are part of the "backend" and be done. Naturally, the more a target customizes codegen preparation, the less DAGCombine can be reused. That said, if NSW can be added to SelectionDAG without too many issues, then it may be a moot point. -Andy> > I'm perfectly fine with either #1 or #2. Doing #1 would require a lot of benchmarking to make sure the different canonicalization doesn't regress lots of things, and probably a few fixes where it does regress things. These are likely to be easy to find by looking for patterns that match sext. > > Doing #2 requires a plumbing these flags through the DAG. I don't yet know how hard that would be, but it doesn't sound too bad. The current approach I would probably take is to add an SDNodeProperty for NSW and NUW, and nodes for operations with those flags. But that means generalizing everything that currently only looks at ADD. There is probably a better way to express this in the DAG, so if this is the direction folks thing LLVM should go, point me at any ideas you have about how to best implement it. > > Currently, I would naively go after #1 just because I know exactly how to do that. But it doesn't feel quite right and I wonder if we should really invest the time in making sure the backend is aware of the NSW and NUW flags. > > -Chandler > > [1]: Note that we could fix this inside of loop bodies when its an induction variable by widening the induction variable. But that doesn't fix in the general case. > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Jim Grosbach
2014-Jan-07 19:16 UTC
[LLVMdev] A question about everyone's favorite constructs: NSW and NUW
On Jan 6, 2014, at 9:41 PM, Andrew Trick <atrick at apple.com> wrote:> > On Jan 4, 2014, at 5:29 PM, Chandler Carruth <chandlerc at gmail.com> wrote: > >> So, I know there are a lot of mixed feelings about NSW and NUW, but I'm running into a common pattern where they really help: >> >> void f(char *array, int i) { >> // do some stuff... >> g(array[i++]); >> >> // do some more stuff... >> g(array[i++]); >> >> // more of the same >> } >> >> So, this kind of code comes up pretty frequently. Unrolled loops with an int IV[1], encoding, decoding, compression, etc. What we'd like to do is take the sequence of N increments and turn them into N addressing modes with a constant offset. Then add a small constant at the end and test the flags if we're in a loop, etc. This is *much faster* than clobbering registers to do LEA or inc or add for each one (and yea, we do LEA and use lots of registers currently). >> >> The problem is that this is being compiled in 64-bit mode. So the index is 32-bits and gets sign extended prior to the GEP. So we end up with a lot of repetitions of: >> >> %a = add nsw i32 %i, N # for some immediate small integer N >> %b = sext i32 %i to i64 >> %ptr = gep inbounds i8* %base, i64 %b >> >> When we try to match this pattern in the backend to an x86 addressing mode we fail because we can't look through the sext. This is particularly frustrating because we *should* be able to look through the sext due to the NSW bit on the add, but we don't preserve NSW constraints into the DAG. >> >> How can we fix this? I see the following options: >> >> 1) Canonicalize the other way in the IR. This could take a few different forms: >> 1a) Widen math (where NSW or NUW allows) which is only used by an *ext feeding a GEP index. >> 1b) Widen all math which is only used by an *ext that targets a legal integer size. >> >> 2) Add the NSW (and NUW) bits to the DAG, and query them in the address mode matching to match across (sext (op nsw ...)) and (zext (op nuw ...)) >> >> 3) Add a complete hack to CodeGenPrepare (or similar) which does #1 as a non-canonical step in the IR. > > Here are some of my current prespectives that are *loosely* related to the issue. All FWIW. There's no need to debate them in this thread other than as they apply to the problem at hand. > > - I agree that promotion to 64-bit loses information and should only be done when target information tells us that it's good for codegen. > > - I personally have no problem using "indvars" as a codegen prepare pass for any noncanonical target-specific transforms that require SCEV. The only canonicalization that it usually accomplishes that is replacement of loop exit values. The sign/zero extend elimination (widening) and LFTR are really codegen optimizations that should be defered until the last round of loop opts. I don't see it as being any different in this respect than LSR and vectorization. Unfortunately we rerun canonical InstCombine after indvars, which can undo any target-specific goodness. But when that becomes a serious problem we should just fix that too. If we really need to replace loop exit values earlier before other canonical passes, we can split up indvars. > > - I know we have considered adding NSW to SelectionDag in the past and I'm not opposed to it in principle, but it was difficult to justify adding the complexity. I don't remember all the arguments against it now. There are some obvious issues: (1) it's not clear how to cleanly implement it. (2) I woud be concerned about relying on preservation of such a brittle attribute so late. It seems easy for target specific DAG combines to either drop or misuse the NSW flag. (3) Determining where to widen values may require global analysis. > > - I know that the problem of determining when values should be promoted to a wider type and which uses to replace doesn't have a general solution that is practical to implement. We can easily end up with more register pressure and extra copies that the coalescer is not smart about removing or optimizing. It seems wise to tackle a narrow subset of the problem (e.g. codegen of specific patterns that match addressing modes). Quentin probably has more insight here. > >> I dislike #3 intensely -- we really want to be able to combine on this stuff in a canonical way, so I want the pattern to either always be (gep ... (sext (op nsw ...))) or (gep ... (op nsw (sext ...))). Also, it feels like quite a hack around the lack of fidelity in the backend. > > I understand why this sort of hack is so terrible early in the IR pass pipeline (especially before inlining). But I think it is already the norm for anyone seriously optimizing for a specific target to manipulate the IR (here I do welcome others to confirm or refute this point). Since we can't do anything to change that, let's just declare that those codegen IR transforms are part of the "backend" and be done. Naturally, the more a target customizes codegen preparation, the less DAGCombine can be reused.Yes. That’s CodgeGenPrepare in a nutshell, really. I don’t think anyone claims CGP is a good thing, merely that with our existing infrastructure, it’s a necessary and effective thing. -Jim> > That said, if NSW can be added to SelectionDAG without too many issues, then it may be a moot point. > > -Andy > >> >> I'm perfectly fine with either #1 or #2. Doing #1 would require a lot of benchmarking to make sure the different canonicalization doesn't regress lots of things, and probably a few fixes where it does regress things. These are likely to be easy to find by looking for patterns that match sext. >> >> Doing #2 requires a plumbing these flags through the DAG. I don't yet know how hard that would be, but it doesn't sound too bad. The current approach I would probably take is to add an SDNodeProperty for NSW and NUW, and nodes for operations with those flags. But that means generalizing everything that currently only looks at ADD. There is probably a better way to express this in the DAG, so if this is the direction folks thing LLVM should go, point me at any ideas you have about how to best implement it. >> >> Currently, I would naively go after #1 just because I know exactly how to do that. But it doesn't feel quite right and I wonder if we should really invest the time in making sure the backend is aware of the NSW and NUW flags. >> >> -Chandler >> >> [1]: Note that we could fix this inside of loop bodies when its an induction variable by widening the induction variable. But that doesn't fix in the general case. >> _______________________________________________ >> LLVM Developers mailing list >> LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu >> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev > > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20140107/850166b8/attachment.html>