thr3ads.net - llvm dev - [LLVMdev] Memset/memcpy: user control of loop-idiom recognizer [Dec 2014]

If this information is useful, please help other people find it:
Share via:

David Chisnall

2014-Dec-05 07:46 UTC

[LLVMdev] Memset/memcpy: user control of loop-idiom recognizer

On 3 Dec 2014, at 23:36, Robert Lougher <rob.lougher at gmail.com> wrote:
> On 2 December 2014 at 22:18, Alex Rosenberg <alexr at leftfield.org>
wrote:
>> 
>> Our C library amplifies this problem by being in a dynamic library, so
the
>> call has additional overhead, which for small trip counts swamps the
>> copy/set.
>> 
> 
> I can't imagine we're the only platform (now or in the future) that
> has comparatively slow library calls.  We had discussed some sort of
> platform flag (has slow library calls) but this would be too late to
> affect the loop-idiom.  However, it could affect lowering.  Following
> on from Reid's earlier idea to lower short memcpys to an inlined,
> slightly widened loop, we could expand into a guarded loop for small
> values and a call?
I think the bug is not that we are recognising that the loop is memcpy, it's
that we're then generating an inefficient memcpy.  We do this for a variety
of reasons, some of which apply elsewhere.  One issue I hit a few months ago was
that the vectoriser doesn't notice whether unaligned loads and stores are
supported, so will happily replace two adjacent i32 align 4 loads followed by
two adjacent i64 align 4 stores with an i64 align 4 load followed by an i64
align 4 store, which more than doubles the number of instructions that the back
end emits.

We expand memcpy and friends in several different places (in the IR in at least
one place, then in SelectionDAG, and then again in the back end, as I recall - I
remember playing whack-a-bug with this for a while as the lowering was
differently broken for our target in each place).  In SelectionDAG, we're
dealing with a single basic block, so we can't construct the loop.  In the
back end we've already lost a lot of high-level type information that would
make this easier.

I'd be in favour of consolidating the memcpy / memset / memmove expansion
into an IR pass that would take a cost model from the target.

David

Philip Reames

2014-Dec-05 18:08 UTC

head link

[LLVMdev] Memset/memcpy: user control of loop-idiom recognizer

On 12/04/2014 11:46 PM, David Chisnall wrote:> On 3 Dec 2014, at 23:36, Robert Lougher <rob.lougher at gmail.com>
wrote:
>
>> On 2 December 2014 at 22:18, Alex Rosenberg <alexr at
leftfield.org> wrote:
>>> Our C library amplifies this problem by being in a dynamic library,
so the
>>> call has additional overhead, which for small trip counts swamps
the
>>> copy/set.
>>>
>> I can't imagine we're the only platform (now or in the future)
that
>> has comparatively slow library calls.  We had discussed some sort of
>> platform flag (has slow library calls) but this would be too late to
>> affect the loop-idiom.  However, it could affect lowering.  Following
>> on from Reid's earlier idea to lower short memcpys to an inlined,
>> slightly widened loop, we could expand into a guarded loop for small
>> values and a call?
> I think the bug is not that we are recognising that the loop is memcpy,
it's that we're then generating an inefficient memcpy.  We do this for a
variety of reasons, some of which apply elsewhere.  One issue I hit a few months
ago was that the vectoriser doesn't notice whether unaligned loads and
stores are supported, so will happily replace two adjacent i32 align 4 loads
followed by two adjacent i64 align 4 stores with an i64 align 4 load followed by
an i64 align 4 store, which more than doubles the number of instructions that
the back end emits.
>
> We expand memcpy and friends in several different places (in the IR in at
least one place, then in SelectionDAG, and then again in the back end, as I
recall - I remember playing whack-a-bug with this for a while as the lowering
was differently broken for our target in each place).  In SelectionDAG,
we're dealing with a single basic block, so we can't construct the loop.
In the back end we've already lost a lot of high-level type information that
would make this easier.
>
> I'd be in favour of consolidating the memcpy / memset / memmove
expansion into an IR pass that would take a cost model from the target.+1

It sounds like we might also be loosing information about alignment in 
the loop-idiom recognizer.  Or at least not using it when we
lower.>
> David
>
>
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev

Smith, Kevin B

2014-Dec-05 19:06 UTC

head link

[LLVMdev] Memset/memcpy: user control of loop-idiom recognizer

There are a large number of ways to lose information in translating loops into
memset/memcpy calls, alignment is one of them.
As previously mentioned, loop-trip-count is another.  Another is size of
accesses.  For example, the loop may have originally been using
int64_t sized copies.  This has definite impact on what the best memset/memcpy
expansion is, because effectively, the loop knows that
it is always writing a multiple of 8 bytes, and does so in 8 bytes chunks.  So,
that the number of bytes has some specific value property (like the lower 3 bits
are always 0, another reason for having known bits and known bit values :-)) all
(should) affect the lowering of such loops/calls, but probably doesn't.

Database folks often write their own copy routines for use in specific
instances, as do OSes, such as when they know they are clearing or copying exact
page size on exact page-size boundaries, and have very special implementations
of these, including some that will use non-temporal hints, so as not to
pollute cache.

It is also worth pointing out that most loops have a very specific behavior in
the case of overlaps that is well-defined, and that memcpy does not.

There are definitely good reasons why various knowledgeable users would not want
a compiler to perform such a transform on at least some of their loops.

Kevin Smith 

-----Original Message-----
From: llvmdev-bounces at cs.uiuc.edu [mailto:llvmdev-bounces at cs.uiuc.edu] On
Behalf Of Philip Reames
Sent: Friday, December 05, 2014 10:08 AM
To: David Chisnall; Robert Lougher
Cc: LLVM Developers Mailing List
Subject: Re: [LLVMdev] Memset/memcpy: user control of loop-idiom recognizer

On 12/04/2014 11:46 PM, David Chisnall wrote:> On 3 Dec 2014, at 23:36, Robert Lougher <rob.lougher at gmail.com>
wrote:
>
>> On 2 December 2014 at 22:18, Alex Rosenberg <alexr at
leftfield.org> wrote:
>>> Our C library amplifies this problem by being in a dynamic library,
so the
>>> call has additional overhead, which for small trip counts swamps
the
>>> copy/set.
>>>
>> I can't imagine we're the only platform (now or in the future)
that
>> has comparatively slow library calls.  We had discussed some sort of
>> platform flag (has slow library calls) but this would be too late to
>> affect the loop-idiom.  However, it could affect lowering.  Following
>> on from Reid's earlier idea to lower short memcpys to an inlined,
>> slightly widened loop, we could expand into a guarded loop for small
>> values and a call?
> I think the bug is not that we are recognising that the loop is memcpy,
it's that we're then generating an inefficient memcpy.  We do this for a
variety of reasons, some of which apply elsewhere.  One issue I hit a few months
ago was that the vectoriser doesn't notice whether unaligned loads and
stores are supported, so will happily replace two adjacent i32 align 4 loads
followed by two adjacent i64 align 4 stores with an i64 align 4 load followed by
an i64 align 4 store, which more than doubles the number of instructions that
the back end emits.
>
> We expand memcpy and friends in several different places (in the IR in at
least one place, then in SelectionDAG, and then again in the back end, as I
recall - I remember playing whack-a-bug with this for a while as the lowering
was differently broken for our target in each place).  In SelectionDAG,
we're dealing with a single basic block, so we can't construct the loop.
In the back end we've already lost a lot of high-level type information that
would make this easier.
>
> I'd be in favour of consolidating the memcpy / memset / memmove
expansion into an IR pass that would take a cost model from the target.+1

It sounds like we might also be loosing information about alignment in 
the loop-idiom recognizer.  Or at least not using it when we
lower.>
> David
>
>
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
_______________________________________________
LLVM Developers mailing list
LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev

ehostunreach

2014-Dec-05 19:53 UTC

head link

[LLVMdev] Memset/memcpy: user control of loop-idiom recognizer

On Fri, Dec 5, 2014 at 6:08 PM, Philip Reames <listmail at
philipreames.com> wrote:>
> On 12/04/2014 11:46 PM, David Chisnall wrote:
>>
>> On 3 Dec 2014, at 23:36, Robert Lougher <rob.lougher at
gmail.com> wrote:
>>
>>> On 2 December 2014 at 22:18, Alex Rosenberg <alexr at
leftfield.org> wrote:
>>>>
>>>> Our C library amplifies this problem by being in a dynamic
library, so
>>>> the
>>>> call has additional overhead, which for small trip counts
swamps the
>>>> copy/set.
>>>>
>>> I can't imagine we're the only platform (now or in the
future) that
>>> has comparatively slow library calls.  We had discussed some sort
of
>>> platform flag (has slow library calls) but this would be too late
to
>>> affect the loop-idiom.  However, it could affect lowering. 
Following
>>> on from Reid's earlier idea to lower short memcpys to an
inlined,
>>> slightly widened loop, we could expand into a guarded loop for
small
>>> values and a call?
>>
>> I think the bug is not that we are recognising that the loop is memcpy,
>> it's that we're then generating an inefficient memcpy.  We do
this for a
>> variety of reasons, some of which apply elsewhere.  One issue I hit a
few
>> months ago was that the vectoriser doesn't notice whether unaligned
loads
>> and stores are supported, so will happily replace two adjacent i32
align 4
>> loads followed by two adjacent i64 align 4 stores with an i64 align 4
load
>> followed by an i64 align 4 store, which more than doubles the number of
>> instructions that the back end emits.
>>
>> We expand memcpy and friends in several different places (in the IR in
at
>> least one place, then in SelectionDAG, and then again in the back end,
as I
>> recall - I remember playing whack-a-bug with this for a while as the
>> lowering was differently broken for our target in each place).  In
>> SelectionDAG, we're dealing with a single basic block, so we
can't construct
>> the loop.  In the back end we've already lost a lot of high-level
type
>> information that would make this easier.
>>
>> I'd be in favour of consolidating the memcpy / memset / memmove
expansion
>> into an IR pass that would take a cost model from the target.
>
> +1
>
> It sounds like we might also be loosing information about alignment in the
> loop-idiom recognizer.  Or at least not using it when we lower.
>
>
The LoadCombine pass suffers from the same problem too. It's producing
unaligned
loads and that's why it's not enabled by default.

I'm currently working on that problem (trying to combine stores too)
and I suppose
that once it's solved we will be able to draw important conclusions about
how to
make _alignment-aware_ the rest of the LLVM places mentioned in this thread.

I don't know a lot about the vectorizer yet, but I don't think that
discovering the
load/store alignment of adjacent loads/stores should be one of its main tasks.
By enabling a properly working LoadStoreCombine pass we could probably eliminate
altogether the alignment problem in the vectorizer.

Regards,
Vasileios Kalintiris

Possibly Parallel Threads

Search for more maybe matching threads

llvm dev - Dec 2014 - [LLVMdev] Memset/memcpy: user control of loop-idiom recognizer

[LLVMdev] Memset/memcpy: user control of loop-idiom recognizer

[LLVMdev] Memset/memcpy: user control of loop-idiom recognizer

[LLVMdev] Memset/memcpy: user control of loop-idiom recognizer

[LLVMdev] Memset/memcpy: user control of loop-idiom recognizer

Possibly Parallel Threads