thr3ads.net - llvm dev - [LLVMdev] Memset/memcpy: user control of loop-idiom recognizer [Dec 2014]

If this information is useful, please help other people find it:
Share via:

Robert Lougher

2014-Dec-02 19:23 UTC

[LLVMdev] Memset/memcpy: user control of loop-idiom recognizer

Hi,

In feedback from game studios a common issue is the replacement of
loops with calls to memcpy/memset.  These loops are often
hand-optimised, and highly-efficient and the developers strongly want
a way to control the compiler (i.e. leave my loop alone).

The culprit is of course the loop-idiom recognizer.  This replaces any
loop that looks like a memset/memcpy with calls.  This affects loops
with both a variable and a constant trip-count.  The question is, does
this make sense in all cases?  Also, should the compiler provide a way
to turn it off for certain types of loop, or on a loop individually?
The standard answer is to use -fno-builtin but this does not provide
fine-grain control (e.g. we may want the loop-idiom to recognise
constant loops but not variable loops).

As an example, it could be argued that replacing constant loops always
makes sense.  Here the compiler knows how big the memset/memcpy is and
can make an accurate decision.  For small values the memcpy/memset
will be expanded inline, while larger values will remain a call, but
due to the size the overhead will be negligible.

On the other hand, the compiler knows very little about variable loops
(the loop could be used primarily for copying 10 bytes or 10 Mbytes,
the compiler doesn't know).  The compiler will replace it with a call,
but as it is variable it will not be expanded inline.  In this case
small values may see significant overhead in comparison to the
original loop.  The game studio examples all fall into this category.

The loop-idiom recognizer also has no notion of "quality" - it always
assumes that replacing the loop makes sense.  While it might be the
case for a naive byte-copy, some of the examples we've seen have been
carefully tuned.

So, to summarise, we feel that there's sufficient justification to add
some sort of user-control.  However, we do not want to suggest a
solution, but prefer to start a discussion, and obtain opinions.  So
to start, how do people feel about:

- A switch to disable loop-idiom recognizer completely?

- A switch to disable loop-idiom recognizer for loops with variable trip count?

- A switch to disable loop-idiom recognizer for loops with constant
trip count (can't see this being much use)?

- Per-function control of loop-idiom recognizer (which must work with LTO)?

Thanks for any feedback!
Rob.

--
Robert Lougher
SN Systems - Sony Computer Entertainment Group

Reid Kleckner

2014-Dec-02 19:45 UTC

head link

[LLVMdev] Memset/memcpy: user control of loop-idiom recognizer

What if we had a pragma or attribute that lowered down to metadata
indicating that the variable length trip count was small?

Then backends could choose to lower short memcpys to an inlined, slightly
widened loop. For example, 'rep movsq' on x86_64.

That seems nice from the compiler perspective, since it preserves the
canonical form and we get the same kind of information from profiling. Then
again, I can imagine most game dev users just want control and don't want
to change their code.

On Tue, Dec 2, 2014 at 11:23 AM, Robert Lougher <rob.lougher at gmail.com>
wrote:
> Hi,
>
> In feedback from game studios a common issue is the replacement of
> loops with calls to memcpy/memset.  These loops are often
> hand-optimised, and highly-efficient and the developers strongly want
> a way to control the compiler (i.e. leave my loop alone).
>
> The culprit is of course the loop-idiom recognizer.  This replaces any
> loop that looks like a memset/memcpy with calls.  This affects loops
> with both a variable and a constant trip-count.  The question is, does
> this make sense in all cases?  Also, should the compiler provide a way
> to turn it off for certain types of loop, or on a loop individually?
> The standard answer is to use -fno-builtin but this does not provide
> fine-grain control (e.g. we may want the loop-idiom to recognise
> constant loops but not variable loops).
>
> As an example, it could be argued that replacing constant loops always
> makes sense.  Here the compiler knows how big the memset/memcpy is and
> can make an accurate decision.  For small values the memcpy/memset
> will be expanded inline, while larger values will remain a call, but
> due to the size the overhead will be negligible.
>
> On the other hand, the compiler knows very little about variable loops
> (the loop could be used primarily for copying 10 bytes or 10 Mbytes,
> the compiler doesn't know).  The compiler will replace it with a call,
> but as it is variable it will not be expanded inline.  In this case
> small values may see significant overhead in comparison to the
> original loop.  The game studio examples all fall into this category.
>
> The loop-idiom recognizer also has no notion of "quality" - it
always
> assumes that replacing the loop makes sense.  While it might be the
> case for a naive byte-copy, some of the examples we've seen have been
> carefully tuned.
>
> So, to summarise, we feel that there's sufficient justification to add
> some sort of user-control.  However, we do not want to suggest a
> solution, but prefer to start a discussion, and obtain opinions.  So
> to start, how do people feel about:
>
> - A switch to disable loop-idiom recognizer completely?
>
> - A switch to disable loop-idiom recognizer for loops with variable trip
> count?
>
> - A switch to disable loop-idiom recognizer for loops with constant
> trip count (can't see this being much use)?
>
> - Per-function control of loop-idiom recognizer (which must work with LTO)?
>
> Thanks for any feedback!
> Rob.
>
> --
> Robert Lougher
> SN Systems - Sony Computer Entertainment Group
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20141202/e31be04d/attachment.html>

Joerg Sonnenberger

2014-Dec-02 19:57 UTC

head link

[LLVMdev] Memset/memcpy: user control of loop-idiom recognizer

On Tue, Dec 02, 2014 at 07:23:01PM +0000, Robert Lougher
wrote:> In feedback from game studios a common issue is the replacement of
> loops with calls to memcpy/memset.  These loops are often
> hand-optimised, and highly-efficient and the developers strongly want
> a way to control the compiler (i.e. leave my loop alone).
I doubt that. If anything, it means the lowering of the intrinsic is
bad, not that the transformation should not happen.

Joerg

Robert Lougher

2014-Dec-02 20:08 UTC

head link

[LLVMdev] Memset/memcpy: user control of loop-idiom recognizer

On 2 December 2014 at 19:57, Joerg Sonnenberger <joerg at
britannica.bec.de> wrote:> On Tue, Dec 02, 2014 at 07:23:01PM +0000, Robert Lougher wrote:
>> In feedback from game studios a common issue is the replacement of
>> loops with calls to memcpy/memset.  These loops are often
>> hand-optimised, and highly-efficient and the developers strongly want
>> a way to control the compiler (i.e. leave my loop alone).
>
> I doubt that. If anything, it means the lowering of the intrinsic is
> bad, not that the transformation should not happen.
>
> Joerg
Yes, that's why I talked about variable and constant trip-counts.  For
constant loops there generally isn't a problem, as they can be lowered
inline (if small).  Variable loops, however, get expanded into a
library call.

Rob.
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev

Philip Reames

2014-Dec-02 21:01 UTC

head link

[LLVMdev] Memset/memcpy: user control of loop-idiom recognizer

On 12/02/2014 11:45 AM, Reid Kleckner wrote:> What if we had a pragma or attribute that lowered down to metadata 
> indicating that the variable length trip count was small?
>
> Then backends could choose to lower short memcpys to an inlined, 
> slightly widened loop. For example, 'rep movsq' on x86_64.
>
> That seems nice from the compiler perspective, since it preserves the 
> canonical form and we get the same kind of information from profiling. 
> Then again, I can imagine most game dev users just want control and 
> don't want to change their code.I like this general idea.  Here's another possibility...

We actually already have such a construct in the form of the expect 
builtins. 
http://llvm.org/docs/BranchWeightMetadata.html#built-in-expect-instructions

One way to structure this would be:
if (__builtin_expect(N < SmallSize, 1)) {
   //small loop here
} else {
   // memcpy here
   // or unreachable if you're really brave
}

I could see us failing to exploit this of course.  :)

>
> On Tue, Dec 2, 2014 at 11:23 AM, Robert Lougher <rob.lougher at
gmail.com
> <mailto:rob.lougher at gmail.com>> wrote:
>
>     Hi,
>
>     In feedback from game studios a common issue is the replacement of
>     loops with calls to memcpy/memset.  These loops are often
>     hand-optimised, and highly-efficient and the developers strongly want
>     a way to control the compiler (i.e. leave my loop alone).
>
>     The culprit is of course the loop-idiom recognizer.  This replaces any
>     loop that looks like a memset/memcpy with calls.  This affects loops
>     with both a variable and a constant trip-count.  The question is, does
>     this make sense in all cases?  Also, should the compiler provide a way
>     to turn it off for certain types of loop, or on a loop individually?
>     The standard answer is to use -fno-builtin but this does not provide
>     fine-grain control (e.g. we may want the loop-idiom to recognise
>     constant loops but not variable loops).
>
>     As an example, it could be argued that replacing constant loops always
>     makes sense.  Here the compiler knows how big the memset/memcpy is and
>     can make an accurate decision.  For small values the memcpy/memset
>     will be expanded inline, while larger values will remain a call, but
>     due to the size the overhead will be negligible.
>
>     On the other hand, the compiler knows very little about variable loops
>     (the loop could be used primarily for copying 10 bytes or 10 Mbytes,
>     the compiler doesn't know).  The compiler will replace it with a
call,
>     but as it is variable it will not be expanded inline.  In this case
>     small values may see significant overhead in comparison to the
>     original loop.  The game studio examples all fall into this category.
>
>     The loop-idiom recognizer also has no notion of "quality" -
it always
>     assumes that replacing the loop makes sense.  While it might be the
>     case for a naive byte-copy, some of the examples we've seen have
been
>     carefully tuned.
>
>     So, to summarise, we feel that there's sufficient justification to
add
>     some sort of user-control.  However, we do not want to suggest a
>     solution, but prefer to start a discussion, and obtain opinions.  So
>     to start, how do people feel about:
>
>     - A switch to disable loop-idiom recognizer completely?
>
>     - A switch to disable loop-idiom recognizer for loops with
>     variable trip count?
>
>     - A switch to disable loop-idiom recognizer for loops with constant
>     trip count (can't see this being much use)?
>
>     - Per-function control of loop-idiom recognizer (which must work
>     with LTO)?
>
>     Thanks for any feedback!
>     Rob.
>
>     --
>     Robert Lougher
>     SN Systems - Sony Computer Entertainment Group
>     _______________________________________________
>     LLVM Developers mailing list
>     LLVMdev at cs.uiuc.edu <mailto:LLVMdev at cs.uiuc.edu>
>     http://llvm.cs.uiuc.edu
>     http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>
>
>
>
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20141202/5202408c/attachment.html>

David Chisnall

2014-Dec-02 21:37 UTC

head link

[LLVMdev] Memset/memcpy: user control of loop-idiom recognizer

On 2 Dec 2014, at 19:57, Joerg Sonnenberger <joerg at britannica.bec.de>
wrote:
> On Tue, Dec 02, 2014 at 07:23:01PM +0000, Robert Lougher wrote:
>> In feedback from game studios a common issue is the replacement of
>> loops with calls to memcpy/memset.  These loops are often
>> hand-optimised, and highly-efficient and the developers strongly want
>> a way to control the compiler (i.e. leave my loop alone).
> 
> I doubt that. If anything, it means the lowering of the intrinsic is
> bad, not that the transformation should not happen.
I'd agree.  On x86-64, however, memcpy is difficult.  Some recent profiling
shows that various different approaches using SSE instructions have around a 50%
performance difference between Sandy Bridge, Ivy Bridge and Haswell, with
different versions performing very differently (no idea what the variation is
like between AMD chips).

Lowering memcpy in LLVM is particularly horrible as it's done in three
different places, only one of which has anything that's a bit like a cost
model.

We can often generate a very efficient memcpy loop in the back end if we know
that the data being copied is strongly aligned.  For x86-64 (and our
architecture), if the data is 256-bit aligned and known to be a multiple of 256
bits (or, even better, a multiple of a known multiple of 256 bits) then we can
generate something that is likely to be significantly faster than a call to
memcpy, but we often lose this information by the time we are doing the
lowering.

The interface for target-specific lowering of memcpy is horribly convoluted (and
assumes that memcpy is always in AS 0, even though the intrinsic supports
multiple address spaces, but that's a different issue) and so some cleanup
would make it possible to exploit some of this information a bit better. 
Ideally, I'd see it moved entirely to the back end (or a single flag saying
'expand this in the IR, I don't care about optimising it yet'),
rather than having the back end trying to provide SelectionDAG with some things
that it sometimes uses.

David

Sean Silva

2014-Dec-05 06:49 UTC

head link

[LLVMdev] Memset/memcpy: user control of loop-idiom recognizer

On Wed, Dec 3, 2014 at 4:23 AM, Robert Lougher <rob.lougher at gmail.com>
wrote:
> Hi,
>
> In feedback from game studios a common issue is the replacement of
> loops with calls to memcpy/memset.  These loops are often
> hand-optimised, and highly-efficient and the developers strongly want
> a way to control the compiler (i.e. leave my loop alone).
>
Please provide examples of such "hand-optimised, and highly-efficient"
routines and test cases (and execution conditions) that demonstrate a
performance improvement.

-- Sean Silva


>
> The culprit is of course the loop-idiom recognizer.  This replaces any
> loop that looks like a memset/memcpy with calls.  This affects loops
> with both a variable and a constant trip-count.  The question is, does
> this make sense in all cases?  Also, should the compiler provide a way
> to turn it off for certain types of loop, or on a loop individually?
> The standard answer is to use -fno-builtin but this does not provide
> fine-grain control (e.g. we may want the loop-idiom to recognise
> constant loops but not variable loops).
>
> As an example, it could be argued that replacing constant loops always
> makes sense.  Here the compiler knows how big the memset/memcpy is and
> can make an accurate decision.  For small values the memcpy/memset
> will be expanded inline, while larger values will remain a call, but
> due to the size the overhead will be negligible.
>
> On the other hand, the compiler knows very little about variable loops
> (the loop could be used primarily for copying 10 bytes or 10 Mbytes,
> the compiler doesn't know).  The compiler will replace it with a call,
> but as it is variable it will not be expanded inline.  In this case
> small values may see significant overhead in comparison to the
> original loop.  The game studio examples all fall into this category.
>
> The loop-idiom recognizer also has no notion of "quality" - it
always
> assumes that replacing the loop makes sense.  While it might be the
> case for a naive byte-copy, some of the examples we've seen have been
> carefully tuned.
>
> So, to summarise, we feel that there's sufficient justification to add
> some sort of user-control.  However, we do not want to suggest a
> solution, but prefer to start a discussion, and obtain opinions.  So
> to start, how do people feel about:
>
> - A switch to disable loop-idiom recognizer completely?
>
> - A switch to disable loop-idiom recognizer for loops with variable trip
> count?
>
> - A switch to disable loop-idiom recognizer for loops with constant
> trip count (can't see this being much use)?
>
> - Per-function control of loop-idiom recognizer (which must work with LTO)?
>
> Thanks for any feedback!
> Rob.
>
> --
> Robert Lougher
> SN Systems - Sony Computer Entertainment Group
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20141205/9b62703d/attachment.html>

Robert Lougher

2014-Dec-05 16:02 UTC

head link

[LLVMdev] Memset/memcpy: user control of loop-idiom recognizer

On 5 December 2014 at 06:49, Sean Silva <chisophugis at gmail.com>
wrote:>
>
> On Wed, Dec 3, 2014 at 4:23 AM, Robert Lougher <rob.lougher at
gmail.com>
> wrote:
>>
>> Hi,
>>
>> In feedback from game studios a common issue is the replacement of
>> loops with calls to memcpy/memset.  These loops are often
>> hand-optimised, and highly-efficient and the developers strongly want
>> a way to control the compiler (i.e. leave my loop alone).
>
>
> Please provide examples of such "hand-optimised, and
highly-efficient"
> routines and test cases (and execution conditions) that demonstrate a
> performance improvement.
>
This sounds like a cop-out, but we can't share customer code (even if
we could get a small runnable example).  But this is all getting
beside the point.  I discussed performance issues to try and justify
why the user should have control.  That was probably a mistake as it
has subverted the conversation.  The blunt fact is that game
developers don't like their loops being replaced and they want user
control.  The real conversation I wanted was what form should this
user control take.  To be honest, I am surprised at the level of
resistance to giving users *any* control over their codegen.

llvm dev - Dec 2014 - [LLVMdev] Memset/memcpy: user control of loop-idiom recognizer

[LLVMdev] Memset/memcpy: user control of loop-idiom recognizer

[LLVMdev] Memset/memcpy: user control of loop-idiom recognizer

[LLVMdev] Memset/memcpy: user control of loop-idiom recognizer

[LLVMdev] Memset/memcpy: user control of loop-idiom recognizer

[LLVMdev] Memset/memcpy: user control of loop-idiom recognizer

[LLVMdev] Memset/memcpy: user control of loop-idiom recognizer

[LLVMdev] Memset/memcpy: user control of loop-idiom recognizer

[LLVMdev] Memset/memcpy: user control of loop-idiom recognizer