thr3ads.net - llvm dev - [llvm-dev] [RFC] intrinsics for load/store-with-length semantics [Aug 2020]

If this information is useful, please help other people find it:
Share via:

Hussain Kadhem via llvm-dev

2020-Aug-27 09:52 UTC

[llvm-dev] [RFC] intrinsics for load/store-with-length semantics

<div class="socmaildefaultfont" dir="ltr"
style="font-family:Arial, Helvetica, sans-serif;font-size:10pt"
><div dir="ltr" ><div>We propose introducing two new
intrinsics: llvm.variable.length.load and
llvm.variable.length.store.<br>We have implemented the infrastructure for
defining and lowering these in this phabricator patch: <a
href="https://reviews.llvm.org/D86693"
>https://reviews.llvm.org/D86693</a></div>
<div> </div>
<div>These represent the semantics of loading and storing a variable
number of bytes to a fixed-width register;<br>in effect, a masked load or
store where the only active lanes are given by a contiguous block.</div>
<div> </div>
<div>There are a few reasons for separately representing this kind of
operation, even though as noted it can be represented by a subset of masked
loads and stores.</div>
<div> </div>
<div>- For targets that have separate hardware support for this kind of
operation, it makes it easier to generate an optimal lowering. We are currently
working on enabling this for some PowerPC subtargets. In particular, there are
some targets that support this kind of operation but not masked loads and stores
in general, including Power9.</div>
<div> </div>
<div>- Scalarization of this pattern can be done using a number of
branches logarithmic in the width of the register, rather than the linear case
for general masked operations.</div>
<div>- Scalarized residuals of vectorized loops tend to employ these
semantics (tail-folding in particular), so this infrastructure can be used to
make more specific optimization decisions for lowering loop residuals. This also
pulls out the logic of how to represent and lower such semantics from the loop
vectorizer, allowing for better separation of concerns. Our group is currently
working on implementing some of these optimizations in the loop
vectorizer.</div>
<div> </div>
<div>- Representing these semantics using current masked intrinsics would
require introducing intermediate steps to generate the appropriate bitmasks, and
then detecting them during lowering. This introduces nontrivial complexity that
we want to avoid. If it isn't possible to detect all cases during lowering
by inspecting the AST, expensive runtime checks would then have to be
introduced.</div>
<div> </div>
<div> </div>
<div>Please refer to the phabricator patch for our implementation, which
includes intrinsic definitions, new SDAG nodes, and support for type widening
and scalarization.</div>
<div> </div></div></div><BR>

Eli Friedman via llvm-dev

2020-Aug-27 10:43 UTC

head link

[llvm-dev] [RFC] intrinsics for load/store-with-length semantics

“The vectorizer needs this” seems like a fair reason to add it to the IR.

Pattern-matching an llvm.masked.load with an llvm.get.active.lane.mask operand
might not be that terrible?  If that works, I’d prefer to go with that because
we already have that codepath.  Otherwise, adding a new intrinsic seems okay.

There’s a possibility that we’ll want a version of llvm.masked.load that takes
both a length and a mask, eventually. See https://reviews.llvm.org/D57504 .  Not
completely sure how that should interact with this proposal.

-Eli

From: llvm-dev <llvm-dev-bounces at lists.llvm.org> On Behalf Of Hussain
Kadhem via llvm-dev
Sent: Thursday, August 27, 2020 2:52 AM
To: llvm-dev at lists.llvm.org
Subject: [EXT] [llvm-dev] [RFC] intrinsics for load/store-with-length semantics

We propose introducing two new intrinsics: llvm.variable.length.load and
llvm.variable.length.store.
We have implemented the infrastructure for defining and lowering these in this
phabricator patch: https://reviews.llvm.org/D86693

These represent the semantics of loading and storing a variable number of bytes
to a fixed-width register;
in effect, a masked load or store where the only active lanes are given by a
contiguous block.

There are a few reasons for separately representing this kind of operation, even
though as noted it can be represented by a subset of masked loads and stores.

- For targets that have separate hardware support for this kind of operation, it
makes it easier to generate an optimal lowering. We are currently working on
enabling this for some PowerPC subtargets. In particular, there are some targets
that support this kind of operation but not masked loads and stores in general,
including Power9.

- Scalarization of this pattern can be done using a number of branches
logarithmic in the width of the register, rather than the linear case for
general masked operations.
- Scalarized residuals of vectorized loops tend to employ these semantics
(tail-folding in particular), so this infrastructure can be used to make more
specific optimization decisions for lowering loop residuals. This also pulls out
the logic of how to represent and lower such semantics from the loop vectorizer,
allowing for better separation of concerns. Our group is currently working on
implementing some of these optimizations in the loop vectorizer.

- Representing these semantics using current masked intrinsics would require
introducing intermediate steps to generate the appropriate bitmasks, and then
detecting them during lowering. This introduces nontrivial complexity that we
want to avoid. If it isn't possible to detect all cases during lowering by
inspecting the AST, expensive runtime checks would then have to be introduced.


Please refer to the phabricator patch for our implementation, which includes
intrinsic definitions, new SDAG nodes, and support for type widening and
scalarization.


-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20200827/beec825d/attachment.html>

Simon Moll via llvm-dev

2020-Aug-27 11:57 UTC

head link

[llvm-dev] [RFC] intrinsics for load/store-with-length semantics

Hi Hussain,

Thanks for your patch! We are working on a similar extension (Vector
Predication) with the goal of enabling variable length on all SIMD operations.
We have a reference patch that includes variable-length load store intrinsics.
You find it here: https://reviews.llvm.org/D57504
Integer VP intrinsics are upstream and documented here:
https://llvm.org/docs/LangRef.html#vector-predication-intrinsics

The VP load/store intrinsics look like this:

     @llvm.vp.load(%ptr, %mask, %vector_length)
    @llvm.vp.store(%data, %ptr, %mask, %vector_length)

where the %vector_length argument specifies the variable length of the
operation. There is also a %mask but you can simply pass the constant-true mask
to disable it.
Now i wonder, would VP load/store cover your use case? It'd be great if your
patch, the scalarization logic, TTI, could be integrated into the framework of
VP intrinsics.

- Simon


On 8/27/20 12:43 PM, Eli Friedman wrote:
"The vectorizer needs this" seems like a fair reason to add it to the
IR.

Pattern-matching an llvm.masked.load with an llvm.get.active.lane.mask operand
might not be that terrible?  If that works, I'd prefer to go with that
because we already have that codepath.  Otherwise, adding a new intrinsic seems
okay.

There's a possibility that we'll want a version of llvm.masked.load that
takes both a length and a mask, eventually. See https://reviews.llvm.org/D57504
.  Not completely sure how that should interact with this proposal.

-Eli

From: llvm-dev <llvm-dev-bounces at
lists.llvm.org><mailto:llvm-dev-bounces at lists.llvm.org> On Behalf Of
Hussain Kadhem via llvm-dev
Sent: Thursday, August 27, 2020 2:52 AM
To: llvm-dev at lists.llvm.org<mailto:llvm-dev at lists.llvm.org>
Subject: [EXT] [llvm-dev] [RFC] intrinsics for load/store-with-length semantics

We propose introducing two new intrinsics: llvm.variable.length.load and
llvm.variable.length.store.
We have implemented the infrastructure for defining and lowering these in this
phabricator patch: https://reviews.llvm.org/D86693

These represent the semantics of loading and storing a variable number of bytes
to a fixed-width register;
in effect, a masked load or store where the only active lanes are given by a
contiguous block.

There are a few reasons for separately representing this kind of operation, even
though as noted it can be represented by a subset of masked loads and stores.

- For targets that have separate hardware support for this kind of operation, it
makes it easier to generate an optimal lowering. We are currently working on
enabling this for some PowerPC subtargets. In particular, there are some targets
that support this kind of operation but not masked loads and stores in general,
including Power9.

- Scalarization of this pattern can be done using a number of branches
logarithmic in the width of the register, rather than the linear case for
general masked operations.
- Scalarized residuals of vectorized loops tend to employ these semantics
(tail-folding in particular), so this infrastructure can be used to make more
specific optimization decisions for lowering loop residuals. This also pulls out
the logic of how to represent and lower such semantics from the loop vectorizer,
allowing for better separation of concerns. Our group is currently working on
implementing some of these optimizations in the loop vectorizer.

- Representing these semantics using current masked intrinsics would require
introducing intermediate steps to generate the appropriate bitmasks, and then
detecting them during lowering. This introduces nontrivial complexity that we
want to avoid. If it isn't possible to detect all cases during lowering by
inspecting the AST, expensive runtime checks would then have to be introduced.


Please refer to the phabricator patch for our implementation, which includes
intrinsic definitions, new SDAG nodes, and support for type widening and
scalarization.



-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20200827/022e6503/attachment.html>

llvm dev - Aug 2020 - [RFC] intrinsics for load/store-with-length semantics

[llvm-dev] [RFC] intrinsics for load/store-with-length semantics

[llvm-dev] [RFC] intrinsics for load/store-with-length semantics

[llvm-dev] [RFC] intrinsics for load/store-with-length semantics