thr3ads.net - llvm dev - [LLVMdev] Memset/memcpy: user control of loop-idiom recognizer [Dec 2014]

If this information is useful, please help other people find it:
Share via:

Smith, Kevin B

2014-Dec-06 04:05 UTC

[LLVMdev] Memset/memcpy: user control of loop-idiom recognizer

Hal,

I appreciate the clarification.  That was what I was expecting (that the
transformation uses intrinsics), Intel compiler does the same thing internally,
and like
LLVM it is into an internal intrinsic,
not a plain library call.  Nevertheless, there are a huge number of ways (In
machine code) to write "the best" memory copy or memory set sort of
code
if, as a programmer, you are able to constrain the parameters in many of the
ways I was referring to.  And often, the loops that implement these equivalences
have those conditions programmed into them, but with no real way to indicate
that to the compilation system.  That sometimes makes it very tricky (as Rob
is bringing up) for the lowering of these intrinsics to do as good of a job as
the original loop did.  Now as a counterpoint, of course there are also a bunch
of
cases where the compiler will do MUCH better than the original loop as well, and
that is why both the LLVM and Intel compilation systems have made the
effort to do this transformation.

I'm just trying to point out that the transformation from loop to intrinsic
is lossy in a number of ways, that even if it wasn't lossy, the number of
possible lowerings
results in a huge search space for the best lowering, and that therefore, I
think it is definitely worth considering what a reasonable way might be to
throttle
the loop->intrinsic transformation based on some IR level hint coming from
the programmer and through the front-end.

Kevin 

-----Original Message-----
From: Hal Finkel [mailto:hfinkel at anl.gov] 
Sent: Friday, December 05, 2014 5:45 PM
To: Smith, Kevin B
Cc: LLVM Developers Mailing List; Philip Reames; David Chisnall; Robert Lougher
Subject: Re: [LLVMdev] Memset/memcpy: user control of loop-idiom recognizer

----- Original Message -----> From: "Kevin B Smith" <kevin.b.smith at intel.com>
> To: "Philip Reames" <listmail at philipreames.com>,
"David Chisnall" <david.chisnall at cl.cam.ac.uk>, "Robert
Lougher"
> <rob.lougher at gmail.com>
> Cc: "LLVM Developers Mailing List" <llvmdev at cs.uiuc.edu>
> Sent: Friday, December 5, 2014 1:06:14 PM
> Subject: Re: [LLVMdev] Memset/memcpy: user control of loop-idiom recognizer
> 
> There are a large number of ways to lose information in translating
> loops into memset/memcpy calls, alignment is one of them.
> As previously mentioned, loop-trip-count is another.  Another is size
> of accesses.  For example, the loop may have originally been using
> int64_t sized copies.  This has definite impact on what the best
> memset/memcpy expansion is, because effectively, the loop knows that
> it is always writing a multiple of 8 bytes, and does so in 8 bytes
> chunks.  So, that the number of bytes has some specific value
> property (like the lower 3 bits
> are always 0, another reason for having known bits and known bit
> values :-)) all (should) affect the lowering of such loops/calls,
> but probably doesn't.
Hi Kevin,

Just so everyone is on the same page, when we convert a loop to a memcpy
intrinsic, we're really talking about this:
http://llvm.org/docs/LangRef.html#llvm-memcpy-intrinsic -- and this intrinsic
carries alignment information. Now one problem is that it carries only one
alignment specifier, not separate ones for the source and destination, and we
may want to improve that. Nevertheless, I want everyone to understand that
we're not just transforming these loops into libc calls, but into
intrinsics, and the targets then control whether these are expanded, and how, or
turned into actual libc calls.
> 
> Database folks often write their own copy routines for use in
> specific instances, as do OSes, such as when they know they are
> clearing or copying exact
> page size on exact page-size boundaries, and have very special
> implementations of these, including some that will use non-temporal
> hints, so as not to
> pollute cache.
I don't think we perform loop idiom recognition based on target-specific
intrinsics (such as those providing non-temporal stores).

 -Hal
> 
> It is also worth pointing out that most loops have a very specific
> behavior in the case of overlaps that is well-defined, and that
> memcpy does not.
> 
> There are definitely good reasons why various knowledgeable users
> would not want a compiler to perform such a transform on at least
> some of their loops.
> 
> Kevin Smith
> 
> -----Original Message-----
> From: llvmdev-bounces at cs.uiuc.edu
> [mailto:llvmdev-bounces at cs.uiuc.edu] On Behalf Of Philip Reames
> Sent: Friday, December 05, 2014 10:08 AM
> To: David Chisnall; Robert Lougher
> Cc: LLVM Developers Mailing List
> Subject: Re: [LLVMdev] Memset/memcpy: user control of loop-idiom
> recognizer
> 
> 
> On 12/04/2014 11:46 PM, David Chisnall wrote:
> > On 3 Dec 2014, at 23:36, Robert Lougher <rob.lougher at
gmail.com>
> > wrote:
> >
> >> On 2 December 2014 at 22:18, Alex Rosenberg <alexr at
leftfield.org>
> >> wrote:
> >>> Our C library amplifies this problem by being in a dynamic
> >>> library, so the
> >>> call has additional overhead, which for small trip counts
swamps
> >>> the
> >>> copy/set.
> >>>
> >> I can't imagine we're the only platform (now or in the
future)
> >> that
> >> has comparatively slow library calls.  We had discussed some sort
> >> of
> >> platform flag (has slow library calls) but this would be too late
> >> to
> >> affect the loop-idiom.  However, it could affect lowering.
> >>  Following
> >> on from Reid's earlier idea to lower short memcpys to an
inlined,
> >> slightly widened loop, we could expand into a guarded loop for
> >> small
> >> values and a call?
> > I think the bug is not that we are recognising that the loop is
> > memcpy, it's that we're then generating an inefficient memcpy.
We
> > do this for a variety of reasons, some of which apply elsewhere.
> >  One issue I hit a few months ago was that the vectoriser doesn't
> > notice whether unaligned loads and stores are supported, so will
> > happily replace two adjacent i32 align 4 loads followed by two
> > adjacent i64 align 4 stores with an i64 align 4 load followed by
> > an i64 align 4 store, which more than doubles the number of
> > instructions that the back end emits.
> >
> > We expand memcpy and friends in several different places (in the IR
> > in at least one place, then in SelectionDAG, and then again in the
> > back end, as I recall - I remember playing whack-a-bug with this
> > for a while as the lowering was differently broken for our target
> > in each place).  In SelectionDAG, we're dealing with a single
> > basic block, so we can't construct the loop.  In the back end
> > we've already lost a lot of high-level type information that would
> > make this easier.
> >
> > I'd be in favour of consolidating the memcpy / memset / memmove
> > expansion into an IR pass that would take a cost model from the
> > target.
> +1
> 
> It sounds like we might also be loosing information about alignment
> in
> the loop-idiom recognizer.  Or at least not using it when we lower.
> >
> > David
> >
> >
> > _______________________________________________
> > LLVM Developers mailing list
> > LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
> 
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
> 
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
> 
-- 
Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory

Hal Finkel

2014-Dec-06 13:06 UTC

head link

[LLVMdev] Memset/memcpy: user control of loop-idiom recognizer

----- Original Message -----> From: "Kevin B Smith" <kevin.b.smith at intel.com>
> To: "Hal Finkel" <hfinkel at anl.gov>
> Cc: "LLVM Developers Mailing List" <llvmdev at
cs.uiuc.edu>, "Philip Reames" <listmail at philipreames.com>,
"David
> Chisnall" <david.chisnall at cl.cam.ac.uk>, "Robert
Lougher" <rob.lougher at gmail.com>
> Sent: Friday, December 5, 2014 10:05:49 PM
> Subject: RE: [LLVMdev] Memset/memcpy: user control of loop-idiom recognizer
> 
> Hal,
> 
> I appreciate the clarification.  That was what I was expecting (that
> the transformation uses intrinsics), Intel compiler does the same
> thing internally, and like
> LLVM it is into an internal intrinsic,
> not a plain library call.  Nevertheless, there are a huge number of
> ways (In machine code) to write "the best" memory copy or memory
set
> sort of code
> if, as a programmer, you are able to constrain the parameters in many
> of the ways I was referring to.  And often, the loops that implement
> these equivalences
> have those conditions programmed into them, but with no real way to
> indicate that to the compilation system.  That sometimes makes it
> very tricky (as Rob
> is bringing up) for the lowering of these intrinsics to do as good of
> a job as the original loop did.  Now as a counterpoint, of course
> there are also a bunch of
> cases where the compiler will do MUCH better than the original loop
> as well, and that is why both the LLVM and Intel compilation systems
> have made the
> effort to do this transformation.
> 
> I'm just trying to point out that the transformation from loop to
> intrinsic is lossy in a number of ways, that even if it wasn't
> lossy, the number of possible lowerings
> results in a huge search space for the best lowering, and that
> therefore, I think it is definitely worth considering what a
> reasonable way might be to throttle
> the loop->intrinsic transformation based on some IR level hint coming
> from the programmer and through the front-end.
Hi Kevin,

I don't disagree, but if we can come up with a reasonable way of describing
this space, then using this description to hint the memcpy intrinsic might be
better than a binary recognize/don't-recognize switch. It is not yet clear
to me. Quickly, I can think of a few:
 - Alignment (we currently provide one alignment, but the source and destination
can have different alignments)
 - Direction (should the memory be traversed forward or backward)
 - Blocking factor and direction (how much memory should be loaded/stored
"at a time", and in what order should those loads/stores be issued)
 - Load/store size (what data type was used for the individual loads/stores)
 - Cache hinting (if we do idiom recognition on target-specific intrinsics,
we'd need to capture whether the stores were non-temporal, etc.)

Thanks again,
Hal
> 
> Kevin
> 
> -----Original Message-----
> From: Hal Finkel [mailto:hfinkel at anl.gov]
> Sent: Friday, December 05, 2014 5:45 PM
> To: Smith, Kevin B
> Cc: LLVM Developers Mailing List; Philip Reames; David Chisnall;
> Robert Lougher
> Subject: Re: [LLVMdev] Memset/memcpy: user control of loop-idiom
> recognizer
> 
> ----- Original Message -----
> > From: "Kevin B Smith" <kevin.b.smith at intel.com>
> > To: "Philip Reames" <listmail at philipreames.com>,
"David Chisnall"
> > <david.chisnall at cl.cam.ac.uk>, "Robert Lougher"
> > <rob.lougher at gmail.com>
> > Cc: "LLVM Developers Mailing List" <llvmdev at
cs.uiuc.edu>
> > Sent: Friday, December 5, 2014 1:06:14 PM
> > Subject: Re: [LLVMdev] Memset/memcpy: user control of loop-idiom
> > recognizer
> > 
> > There are a large number of ways to lose information in translating
> > loops into memset/memcpy calls, alignment is one of them.
> > As previously mentioned, loop-trip-count is another.  Another is
> > size
> > of accesses.  For example, the loop may have originally been using
> > int64_t sized copies.  This has definite impact on what the best
> > memset/memcpy expansion is, because effectively, the loop knows
> > that
> > it is always writing a multiple of 8 bytes, and does so in 8 bytes
> > chunks.  So, that the number of bytes has some specific value
> > property (like the lower 3 bits
> > are always 0, another reason for having known bits and known bit
> > values :-)) all (should) affect the lowering of such loops/calls,
> > but probably doesn't.
> 
> Hi Kevin,
> 
> Just so everyone is on the same page, when we convert a loop to a
> memcpy intrinsic, we're really talking about this:
> http://llvm.org/docs/LangRef.html#llvm-memcpy-intrinsic -- and this
> intrinsic carries alignment information. Now one problem is that it
> carries only one alignment specifier, not separate ones for the
> source and destination, and we may want to improve that.
> Nevertheless, I want everyone to understand that we're not just
> transforming these loops into libc calls, but into intrinsics, and
> the targets then control whether these are expanded, and how, or
> turned into actual libc calls.
> 
> > 
> > Database folks often write their own copy routines for use in
> > specific instances, as do OSes, such as when they know they are
> > clearing or copying exact
> > page size on exact page-size boundaries, and have very special
> > implementations of these, including some that will use non-temporal
> > hints, so as not to
> > pollute cache.
> 
> I don't think we perform loop idiom recognition based on
> target-specific intrinsics (such as those providing non-temporal
> stores).
> 
>  -Hal
> 
> > 
> > It is also worth pointing out that most loops have a very specific
> > behavior in the case of overlaps that is well-defined, and that
> > memcpy does not.
> > 
> > There are definitely good reasons why various knowledgeable users
> > would not want a compiler to perform such a transform on at least
> > some of their loops.
> > 
> > Kevin Smith
> > 
> > -----Original Message-----
> > From: llvmdev-bounces at cs.uiuc.edu
> > [mailto:llvmdev-bounces at cs.uiuc.edu] On Behalf Of Philip Reames
> > Sent: Friday, December 05, 2014 10:08 AM
> > To: David Chisnall; Robert Lougher
> > Cc: LLVM Developers Mailing List
> > Subject: Re: [LLVMdev] Memset/memcpy: user control of loop-idiom
> > recognizer
> > 
> > 
> > On 12/04/2014 11:46 PM, David Chisnall wrote:
> > > On 3 Dec 2014, at 23:36, Robert Lougher <rob.lougher at
gmail.com>
> > > wrote:
> > >
> > >> On 2 December 2014 at 22:18, Alex Rosenberg
> > >> <alexr at leftfield.org>
> > >> wrote:
> > >>> Our C library amplifies this problem by being in a
dynamic
> > >>> library, so the
> > >>> call has additional overhead, which for small trip counts
> > >>> swamps
> > >>> the
> > >>> copy/set.
> > >>>
> > >> I can't imagine we're the only platform (now or in
the future)
> > >> that
> > >> has comparatively slow library calls.  We had discussed some
> > >> sort
> > >> of
> > >> platform flag (has slow library calls) but this would be too
> > >> late
> > >> to
> > >> affect the loop-idiom.  However, it could affect lowering.
> > >>  Following
> > >> on from Reid's earlier idea to lower short memcpys to an
> > >> inlined,
> > >> slightly widened loop, we could expand into a guarded loop
for
> > >> small
> > >> values and a call?
> > > I think the bug is not that we are recognising that the loop is
> > > memcpy, it's that we're then generating an inefficient
memcpy.
> > >  We
> > > do this for a variety of reasons, some of which apply elsewhere.
> > >  One issue I hit a few months ago was that the vectoriser
doesn't
> > > notice whether unaligned loads and stores are supported, so will
> > > happily replace two adjacent i32 align 4 loads followed by two
> > > adjacent i64 align 4 stores with an i64 align 4 load followed by
> > > an i64 align 4 store, which more than doubles the number of
> > > instructions that the back end emits.
> > >
> > > We expand memcpy and friends in several different places (in the
> > > IR
> > > in at least one place, then in SelectionDAG, and then again in
> > > the
> > > back end, as I recall - I remember playing whack-a-bug with this
> > > for a while as the lowering was differently broken for our target
> > > in each place).  In SelectionDAG, we're dealing with a single
> > > basic block, so we can't construct the loop.  In the back end
> > > we've already lost a lot of high-level type information that
> > > would
> > > make this easier.
> > >
> > > I'd be in favour of consolidating the memcpy / memset /
memmove
> > > expansion into an IR pass that would take a cost model from the
> > > target.
> > +1
> > 
> > It sounds like we might also be loosing information about alignment
> > in
> > the loop-idiom recognizer.  Or at least not using it when we lower.
> > >
> > > David
> > >
> > >
> > > _______________________________________________
> > > LLVM Developers mailing list
> > > LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> > > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
> > 
> > _______________________________________________
> > LLVM Developers mailing list
> > LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
> > 
> > _______________________________________________
> > LLVM Developers mailing list
> > LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
> > 
> 
> --
> Hal Finkel
> Assistant Computational Scientist
> Leadership Computing Facility
> Argonne National Laboratory
> 
-- 
Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory

Joerg Sonnenberger

2014-Dec-06 21:48 UTC

head link

[LLVMdev] Memset/memcpy: user control of loop-idiom recognizer

On Sat, Dec 06, 2014 at 07:06:31AM -0600, Hal Finkel
wrote:>  - Direction (should the memory be traversed forward or backward)
I don't think that this makes sense for memset and memcpy. It does
matter for memmove.

Joerg

llvm dev - Dec 2014 - [LLVMdev] Memset/memcpy: user control of loop-idiom recognizer

[LLVMdev] Memset/memcpy: user control of loop-idiom recognizer

[LLVMdev] Memset/memcpy: user control of loop-idiom recognizer

[LLVMdev] Memset/memcpy: user control of loop-idiom recognizer