Robert Lougher
2014-Dec-02 19:23 UTC
[LLVMdev] Memset/memcpy: user control of loop-idiom recognizer
Hi, In feedback from game studios a common issue is the replacement of loops with calls to memcpy/memset. These loops are often hand-optimised, and highly-efficient and the developers strongly want a way to control the compiler (i.e. leave my loop alone). The culprit is of course the loop-idiom recognizer. This replaces any loop that looks like a memset/memcpy with calls. This affects loops with both a variable and a constant trip-count. The question is, does this make sense in all cases? Also, should the compiler provide a way to turn it off for certain types of loop, or on a loop individually? The standard answer is to use -fno-builtin but this does not provide fine-grain control (e.g. we may want the loop-idiom to recognise constant loops but not variable loops). As an example, it could be argued that replacing constant loops always makes sense. Here the compiler knows how big the memset/memcpy is and can make an accurate decision. For small values the memcpy/memset will be expanded inline, while larger values will remain a call, but due to the size the overhead will be negligible. On the other hand, the compiler knows very little about variable loops (the loop could be used primarily for copying 10 bytes or 10 Mbytes, the compiler doesn't know). The compiler will replace it with a call, but as it is variable it will not be expanded inline. In this case small values may see significant overhead in comparison to the original loop. The game studio examples all fall into this category. The loop-idiom recognizer also has no notion of "quality" - it always assumes that replacing the loop makes sense. While it might be the case for a naive byte-copy, some of the examples we've seen have been carefully tuned. So, to summarise, we feel that there's sufficient justification to add some sort of user-control. However, we do not want to suggest a solution, but prefer to start a discussion, and obtain opinions. So to start, how do people feel about: - A switch to disable loop-idiom recognizer completely? - A switch to disable loop-idiom recognizer for loops with variable trip count? - A switch to disable loop-idiom recognizer for loops with constant trip count (can't see this being much use)? - Per-function control of loop-idiom recognizer (which must work with LTO)? Thanks for any feedback! Rob. -- Robert Lougher SN Systems - Sony Computer Entertainment Group
Reid Kleckner
2014-Dec-02 19:45 UTC
[LLVMdev] Memset/memcpy: user control of loop-idiom recognizer
What if we had a pragma or attribute that lowered down to metadata indicating that the variable length trip count was small? Then backends could choose to lower short memcpys to an inlined, slightly widened loop. For example, 'rep movsq' on x86_64. That seems nice from the compiler perspective, since it preserves the canonical form and we get the same kind of information from profiling. Then again, I can imagine most game dev users just want control and don't want to change their code. On Tue, Dec 2, 2014 at 11:23 AM, Robert Lougher <rob.lougher at gmail.com> wrote:> Hi, > > In feedback from game studios a common issue is the replacement of > loops with calls to memcpy/memset. These loops are often > hand-optimised, and highly-efficient and the developers strongly want > a way to control the compiler (i.e. leave my loop alone). > > The culprit is of course the loop-idiom recognizer. This replaces any > loop that looks like a memset/memcpy with calls. This affects loops > with both a variable and a constant trip-count. The question is, does > this make sense in all cases? Also, should the compiler provide a way > to turn it off for certain types of loop, or on a loop individually? > The standard answer is to use -fno-builtin but this does not provide > fine-grain control (e.g. we may want the loop-idiom to recognise > constant loops but not variable loops). > > As an example, it could be argued that replacing constant loops always > makes sense. Here the compiler knows how big the memset/memcpy is and > can make an accurate decision. For small values the memcpy/memset > will be expanded inline, while larger values will remain a call, but > due to the size the overhead will be negligible. > > On the other hand, the compiler knows very little about variable loops > (the loop could be used primarily for copying 10 bytes or 10 Mbytes, > the compiler doesn't know). The compiler will replace it with a call, > but as it is variable it will not be expanded inline. In this case > small values may see significant overhead in comparison to the > original loop. The game studio examples all fall into this category. > > The loop-idiom recognizer also has no notion of "quality" - it always > assumes that replacing the loop makes sense. While it might be the > case for a naive byte-copy, some of the examples we've seen have been > carefully tuned. > > So, to summarise, we feel that there's sufficient justification to add > some sort of user-control. However, we do not want to suggest a > solution, but prefer to start a discussion, and obtain opinions. So > to start, how do people feel about: > > - A switch to disable loop-idiom recognizer completely? > > - A switch to disable loop-idiom recognizer for loops with variable trip > count? > > - A switch to disable loop-idiom recognizer for loops with constant > trip count (can't see this being much use)? > > - Per-function control of loop-idiom recognizer (which must work with LTO)? > > Thanks for any feedback! > Rob. > > -- > Robert Lougher > SN Systems - Sony Computer Entertainment Group > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20141202/e31be04d/attachment.html>
Joerg Sonnenberger
2014-Dec-02 19:57 UTC
[LLVMdev] Memset/memcpy: user control of loop-idiom recognizer
On Tue, Dec 02, 2014 at 07:23:01PM +0000, Robert Lougher wrote:> In feedback from game studios a common issue is the replacement of > loops with calls to memcpy/memset. These loops are often > hand-optimised, and highly-efficient and the developers strongly want > a way to control the compiler (i.e. leave my loop alone).I doubt that. If anything, it means the lowering of the intrinsic is bad, not that the transformation should not happen. Joerg
Robert Lougher
2014-Dec-02 20:08 UTC
[LLVMdev] Memset/memcpy: user control of loop-idiom recognizer
On 2 December 2014 at 19:57, Joerg Sonnenberger <joerg at britannica.bec.de> wrote:> On Tue, Dec 02, 2014 at 07:23:01PM +0000, Robert Lougher wrote: >> In feedback from game studios a common issue is the replacement of >> loops with calls to memcpy/memset. These loops are often >> hand-optimised, and highly-efficient and the developers strongly want >> a way to control the compiler (i.e. leave my loop alone). > > I doubt that. If anything, it means the lowering of the intrinsic is > bad, not that the transformation should not happen. > > JoergYes, that's why I talked about variable and constant trip-counts. For constant loops there generally isn't a problem, as they can be lowered inline (if small). Variable loops, however, get expanded into a library call. Rob.> _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Philip Reames
2014-Dec-02 21:01 UTC
[LLVMdev] Memset/memcpy: user control of loop-idiom recognizer
On 12/02/2014 11:45 AM, Reid Kleckner wrote:> What if we had a pragma or attribute that lowered down to metadata > indicating that the variable length trip count was small? > > Then backends could choose to lower short memcpys to an inlined, > slightly widened loop. For example, 'rep movsq' on x86_64. > > That seems nice from the compiler perspective, since it preserves the > canonical form and we get the same kind of information from profiling. > Then again, I can imagine most game dev users just want control and > don't want to change their code.I like this general idea. Here's another possibility... We actually already have such a construct in the form of the expect builtins. http://llvm.org/docs/BranchWeightMetadata.html#built-in-expect-instructions One way to structure this would be: if (__builtin_expect(N < SmallSize, 1)) { //small loop here } else { // memcpy here // or unreachable if you're really brave } I could see us failing to exploit this of course. :)> > On Tue, Dec 2, 2014 at 11:23 AM, Robert Lougher <rob.lougher at gmail.com > <mailto:rob.lougher at gmail.com>> wrote: > > Hi, > > In feedback from game studios a common issue is the replacement of > loops with calls to memcpy/memset. These loops are often > hand-optimised, and highly-efficient and the developers strongly want > a way to control the compiler (i.e. leave my loop alone). > > The culprit is of course the loop-idiom recognizer. This replaces any > loop that looks like a memset/memcpy with calls. This affects loops > with both a variable and a constant trip-count. The question is, does > this make sense in all cases? Also, should the compiler provide a way > to turn it off for certain types of loop, or on a loop individually? > The standard answer is to use -fno-builtin but this does not provide > fine-grain control (e.g. we may want the loop-idiom to recognise > constant loops but not variable loops). > > As an example, it could be argued that replacing constant loops always > makes sense. Here the compiler knows how big the memset/memcpy is and > can make an accurate decision. For small values the memcpy/memset > will be expanded inline, while larger values will remain a call, but > due to the size the overhead will be negligible. > > On the other hand, the compiler knows very little about variable loops > (the loop could be used primarily for copying 10 bytes or 10 Mbytes, > the compiler doesn't know). The compiler will replace it with a call, > but as it is variable it will not be expanded inline. In this case > small values may see significant overhead in comparison to the > original loop. The game studio examples all fall into this category. > > The loop-idiom recognizer also has no notion of "quality" - it always > assumes that replacing the loop makes sense. While it might be the > case for a naive byte-copy, some of the examples we've seen have been > carefully tuned. > > So, to summarise, we feel that there's sufficient justification to add > some sort of user-control. However, we do not want to suggest a > solution, but prefer to start a discussion, and obtain opinions. So > to start, how do people feel about: > > - A switch to disable loop-idiom recognizer completely? > > - A switch to disable loop-idiom recognizer for loops with > variable trip count? > > - A switch to disable loop-idiom recognizer for loops with constant > trip count (can't see this being much use)? > > - Per-function control of loop-idiom recognizer (which must work > with LTO)? > > Thanks for any feedback! > Rob. > > -- > Robert Lougher > SN Systems - Sony Computer Entertainment Group > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu <mailto:LLVMdev at cs.uiuc.edu> > http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev > > > > > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20141202/5202408c/attachment.html>
David Chisnall
2014-Dec-02 21:37 UTC
[LLVMdev] Memset/memcpy: user control of loop-idiom recognizer
On 2 Dec 2014, at 19:57, Joerg Sonnenberger <joerg at britannica.bec.de> wrote:> On Tue, Dec 02, 2014 at 07:23:01PM +0000, Robert Lougher wrote: >> In feedback from game studios a common issue is the replacement of >> loops with calls to memcpy/memset. These loops are often >> hand-optimised, and highly-efficient and the developers strongly want >> a way to control the compiler (i.e. leave my loop alone). > > I doubt that. If anything, it means the lowering of the intrinsic is > bad, not that the transformation should not happen.I'd agree. On x86-64, however, memcpy is difficult. Some recent profiling shows that various different approaches using SSE instructions have around a 50% performance difference between Sandy Bridge, Ivy Bridge and Haswell, with different versions performing very differently (no idea what the variation is like between AMD chips). Lowering memcpy in LLVM is particularly horrible as it's done in three different places, only one of which has anything that's a bit like a cost model. We can often generate a very efficient memcpy loop in the back end if we know that the data being copied is strongly aligned. For x86-64 (and our architecture), if the data is 256-bit aligned and known to be a multiple of 256 bits (or, even better, a multiple of a known multiple of 256 bits) then we can generate something that is likely to be significantly faster than a call to memcpy, but we often lose this information by the time we are doing the lowering. The interface for target-specific lowering of memcpy is horribly convoluted (and assumes that memcpy is always in AS 0, even though the intrinsic supports multiple address spaces, but that's a different issue) and so some cleanup would make it possible to exploit some of this information a bit better. Ideally, I'd see it moved entirely to the back end (or a single flag saying 'expand this in the IR, I don't care about optimising it yet'), rather than having the back end trying to provide SelectionDAG with some things that it sometimes uses. David
Sean Silva
2014-Dec-05 06:49 UTC
[LLVMdev] Memset/memcpy: user control of loop-idiom recognizer
On Wed, Dec 3, 2014 at 4:23 AM, Robert Lougher <rob.lougher at gmail.com> wrote:> Hi, > > In feedback from game studios a common issue is the replacement of > loops with calls to memcpy/memset. These loops are often > hand-optimised, and highly-efficient and the developers strongly want > a way to control the compiler (i.e. leave my loop alone). >Please provide examples of such "hand-optimised, and highly-efficient" routines and test cases (and execution conditions) that demonstrate a performance improvement. -- Sean Silva> > The culprit is of course the loop-idiom recognizer. This replaces any > loop that looks like a memset/memcpy with calls. This affects loops > with both a variable and a constant trip-count. The question is, does > this make sense in all cases? Also, should the compiler provide a way > to turn it off for certain types of loop, or on a loop individually? > The standard answer is to use -fno-builtin but this does not provide > fine-grain control (e.g. we may want the loop-idiom to recognise > constant loops but not variable loops). > > As an example, it could be argued that replacing constant loops always > makes sense. Here the compiler knows how big the memset/memcpy is and > can make an accurate decision. For small values the memcpy/memset > will be expanded inline, while larger values will remain a call, but > due to the size the overhead will be negligible. > > On the other hand, the compiler knows very little about variable loops > (the loop could be used primarily for copying 10 bytes or 10 Mbytes, > the compiler doesn't know). The compiler will replace it with a call, > but as it is variable it will not be expanded inline. In this case > small values may see significant overhead in comparison to the > original loop. The game studio examples all fall into this category. > > The loop-idiom recognizer also has no notion of "quality" - it always > assumes that replacing the loop makes sense. While it might be the > case for a naive byte-copy, some of the examples we've seen have been > carefully tuned. > > So, to summarise, we feel that there's sufficient justification to add > some sort of user-control. However, we do not want to suggest a > solution, but prefer to start a discussion, and obtain opinions. So > to start, how do people feel about: > > - A switch to disable loop-idiom recognizer completely? > > - A switch to disable loop-idiom recognizer for loops with variable trip > count? > > - A switch to disable loop-idiom recognizer for loops with constant > trip count (can't see this being much use)? > > - Per-function control of loop-idiom recognizer (which must work with LTO)? > > Thanks for any feedback! > Rob. > > -- > Robert Lougher > SN Systems - Sony Computer Entertainment Group > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20141205/9b62703d/attachment.html>
Robert Lougher
2014-Dec-05 16:02 UTC
[LLVMdev] Memset/memcpy: user control of loop-idiom recognizer
On 5 December 2014 at 06:49, Sean Silva <chisophugis at gmail.com> wrote:> > > On Wed, Dec 3, 2014 at 4:23 AM, Robert Lougher <rob.lougher at gmail.com> > wrote: >> >> Hi, >> >> In feedback from game studios a common issue is the replacement of >> loops with calls to memcpy/memset. These loops are often >> hand-optimised, and highly-efficient and the developers strongly want >> a way to control the compiler (i.e. leave my loop alone). > > > Please provide examples of such "hand-optimised, and highly-efficient" > routines and test cases (and execution conditions) that demonstrate a > performance improvement. >This sounds like a cop-out, but we can't share customer code (even if we could get a small runnable example). But this is all getting beside the point. I discussed performance issues to try and justify why the user should have control. That was probably a mistake as it has subverted the conversation. The blunt fact is that game developers don't like their loops being replaced and they want user control. The real conversation I wanted was what form should this user control take. To be honest, I am surprised at the level of resistance to giving users *any* control over their codegen.