thr3ads.net - llvm dev - [LLVMdev] Memset/memcpy: user control of loop-idiom recognizer [Dec 2014]

If this information is useful, please help other people find it:
Share via:

Alex Rosenberg

2014-Dec-02 22:18 UTC

[LLVMdev] Memset/memcpy: user control of loop-idiom recognizer

On Dec 3, 2014, at 6:12 AM, Eric Christopher <echristo at gmail.com>
wrote:> 
> 
> 
>> On Tue Dec 02 2014 at 12:12:01 PM Robert Lougher <rob.lougher at
gmail.com> wrote:
>> On 2 December 2014 at 19:57, Joerg Sonnenberger <joerg at
britannica.bec.de> wrote:
>> > On Tue, Dec 02, 2014 at 07:23:01PM +0000, Robert Lougher wrote:
>> >> In feedback from game studios a common issue is the
replacement of
>> >> loops with calls to memcpy/memset.  These loops are often
>> >> hand-optimised, and highly-efficient and the developers
strongly want
>> >> a way to control the compiler (i.e. leave my loop alone).
>> >
>> > I doubt that. If anything, it means the lowering of the intrinsic
is
>> > bad, not that the transformation should not happen.
>> >
>> > Joerg
>> 
>> Yes, that's why I talked about variable and constant trip-counts. 
For
>> constant loops there generally isn't a problem, as they can be
lowered
>> inline (if small).  Variable loops, however, get expanded into a
>> library call.
> 
> So the biggest problem is that you don't want a call and would prefer
to have inline memcpy code everywhere or something else? If the memcpy isn't
being lowered efficiently I'm curious as to what isn't being lowered
well.
Our C library amplifies this problem by being in a dynamic library, so the call
has additional overhead, which for small trip counts swamps the copy/set.

Certainly, the lowering can be better across the many cases as discussed
elsewhere in this thread.

Game developers expect precise control and are surprised by this
canonicalization. They also don't have the compiler's frame of reference
as a basis for understanding issues like this.

Alex
> -eric
>  
>> Rob.
>> 
>> > _______________________________________________
>> > LLVM Developers mailing list
>> > LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
>> > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>> _______________________________________________
>> LLVM Developers mailing list
>> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20141203/7a6108e4/attachment.html>

Robert Lougher

2014-Dec-03 23:36 UTC

head link

[LLVMdev] Memset/memcpy: user control of loop-idiom recognizer

On 2 December 2014 at 22:18, Alex Rosenberg <alexr at leftfield.org>
wrote:>
> Our C library amplifies this problem by being in a dynamic library, so the
> call has additional overhead, which for small trip counts swamps the
> copy/set.
>
I can't imagine we're the only platform (now or in the future) that
has comparatively slow library calls.  We had discussed some sort of
platform flag (has slow library calls) but this would be too late to
affect the loop-idiom.  However, it could affect lowering.  Following
on from Reid's earlier idea to lower short memcpys to an inlined,
slightly widened loop, we could expand into a guarded loop for small
values and a call?
> Game developers expect precise control and are surprised by this
> canonicalization. They also don't have the compiler's frame of
reference as
> a basis for understanding issues like this.
>
Unfortunately this issue has now been noticed.  Whether or not we can
"get away" with fixing the performance issue without giving them the
control remains to be seen...

Rob.
> Alex
>
> -eric
>
>>
>> Rob.
>>
>> > _______________________________________________
>> > LLVM Developers mailing list
>> > LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
>> > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>> _______________________________________________
>> LLVM Developers mailing list
>> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev

Robert Lougher

2014-Dec-04 02:21 UTC

head link

[LLVMdev] Memset/memcpy: user control of loop-idiom recognizer

On 2 December 2014 at 22:18, Alex Rosenberg <alexr at leftfield.org>
wrote:> On Dec 3, 2014, at 6:12 AM, Eric Christopher <echristo at gmail.com>
wrote:
>
> On Tue Dec 02 2014 at 12:12:01 PM Robert Lougher <rob.lougher at
gmail.com>
> wrote:
>>
>> On 2 December 2014 at 19:57, Joerg Sonnenberger <joerg at
britannica.bec.de>
>> wrote:
>> > On Tue, Dec 02, 2014 at 07:23:01PM +0000, Robert Lougher wrote:
>> >> In feedback from game studios a common issue is the
replacement of
>> >> loops with calls to memcpy/memset.  These loops are often
>> >> hand-optimised, and highly-efficient and the developers
strongly want
>> >> a way to control the compiler (i.e. leave my loop alone).
>> >
>> > I doubt that. If anything, it means the lowering of the intrinsic
is
>> > bad, not that the transformation should not happen.
>> >
>> > Joerg
>>
>> Yes, that's why I talked about variable and constant trip-counts. 
For
>> constant loops there generally isn't a problem, as they can be
lowered
>> inline (if small).  Variable loops, however, get expanded into a
>> library call.
>>
>
> So the biggest problem is that you don't want a call and would prefer
to
> have inline memcpy code everywhere or something else? If the memcpy
isn't
> being lowered efficiently I'm curious as to what isn't being
lowered well.
>
>
> Our C library amplifies this problem by being in a dynamic library, so the
> call has additional overhead, which for small trip counts swamps the
> copy/set.
>
> Certainly, the lowering can be better across the many cases as discussed
> elsewhere in this thread.
>
It's also worth mentioning that when the loop-idiom recognizer is
disabled the loop vectorizer steps in, and will vectorize the loop.

Rob.
>
> Alex

David Chisnall

2014-Dec-05 07:46 UTC

head link

[LLVMdev] Memset/memcpy: user control of loop-idiom recognizer

On 3 Dec 2014, at 23:36, Robert Lougher <rob.lougher at gmail.com> wrote:
> On 2 December 2014 at 22:18, Alex Rosenberg <alexr at leftfield.org>
wrote:
>> 
>> Our C library amplifies this problem by being in a dynamic library, so
the
>> call has additional overhead, which for small trip counts swamps the
>> copy/set.
>> 
> 
> I can't imagine we're the only platform (now or in the future) that
> has comparatively slow library calls.  We had discussed some sort of
> platform flag (has slow library calls) but this would be too late to
> affect the loop-idiom.  However, it could affect lowering.  Following
> on from Reid's earlier idea to lower short memcpys to an inlined,
> slightly widened loop, we could expand into a guarded loop for small
> values and a call?
I think the bug is not that we are recognising that the loop is memcpy, it's
that we're then generating an inefficient memcpy.  We do this for a variety
of reasons, some of which apply elsewhere.  One issue I hit a few months ago was
that the vectoriser doesn't notice whether unaligned loads and stores are
supported, so will happily replace two adjacent i32 align 4 loads followed by
two adjacent i64 align 4 stores with an i64 align 4 load followed by an i64
align 4 store, which more than doubles the number of instructions that the back
end emits.

We expand memcpy and friends in several different places (in the IR in at least
one place, then in SelectionDAG, and then again in the back end, as I recall - I
remember playing whack-a-bug with this for a while as the lowering was
differently broken for our target in each place).  In SelectionDAG, we're
dealing with a single basic block, so we can't construct the loop.  In the
back end we've already lost a lot of high-level type information that would
make this easier.

I'd be in favour of consolidating the memcpy / memset / memmove expansion
into an IR pass that would take a cost model from the target.

David

llvm dev - Dec 2014 - [LLVMdev] Memset/memcpy: user control of loop-idiom recognizer

[LLVMdev] Memset/memcpy: user control of loop-idiom recognizer

[LLVMdev] Memset/memcpy: user control of loop-idiom recognizer

[LLVMdev] Memset/memcpy: user control of loop-idiom recognizer

[LLVMdev] Memset/memcpy: user control of loop-idiom recognizer