thr3ads.net - llvm dev - [llvm-dev] masked-load endpoints optimization [Apr 2016]

If this information is useful, please help other people find it:
Share via:

Sanjay Patel via llvm-dev

2016-Mar-11 16:57 UTC

[llvm-dev] masked-load endpoints optimization

Thanks, Ashutosh.

Yes, either TTI or TLI could be used to limit the transform if we do it in
CGP rather than the DAG.

The real question I have is whether it is legal to read the extra memory,
regardless of whether this is a masked load or something else.

Note that the x86 backend already does this, so either my proposal is ok
for x86, or we're already doing an illegal optimization:

define <4 x i32> @load_bonus_bytes(i32* %addr1, <4 x i32> %v) {
  %ld1 = load i32, i32* %addr1
  %addr2 = getelementptr i32, i32* %addr1, i64 3
  %ld2 = load i32, i32* %addr2
  %vec1 = insertelement <4 x i32> undef, i32 %ld1, i32 0
  %vec2 = insertelement <4 x i32> %vec1, i32 %ld2, i32 3
  ret <4 x i32> %vec2
}

$ ./llc -o - loadcombine.ll
...
    movups    (%rdi), %xmm0
    retq




On Thu, Mar 10, 2016 at 10:22 PM, Nema, Ashutosh <Ashutosh.Nema at
amd.com>
wrote:
> This looks interesting, the main motivation appears to be replacing masked
> vector load with a general vector load followed by a select.
>
>
>
> Observed masked vector loads are in general expensive in comparison with a
> vector load.
>
>
>
> But if first & last element of a masked vector load are guaranteed to
be
> accessed then it can be transformed to a vector load.
>
>
>
> In opt this can be driven by TTI, where the benefit of this transformation
> should be checked.
>
>
>
> Regards,
>
> Ashutosh
>
>
>
> *From:* llvm-dev [mailto:llvm-dev-bounces at lists.llvm.org] *On Behalf Of
*Sanjay
> Patel via llvm-dev
> *Sent:* Friday, March 11, 2016 3:37 AM
> *To:* llvm-dev
> *Subject:* [llvm-dev] masked-load endpoints optimization
>
>
>
> If we're loading the first and last elements of a vector using a masked
> load [1], can we replace the masked load with a full vector load?
>
> "The result of this operation is equivalent to a regular vector load
> instruction followed by a ‘select’ between the loaded and the passthru
> values, predicated on the same mask. However, using this intrinsic prevents
> exceptions on memory access to masked-off lanes."
>
> I think the fact that we're loading the endpoints of the vector
guarantees
> that a full vector load can't have any different faulting/exception
> behavior on x86 and most (?) other targets. We would, however, be reading
> memory that the program has not explicitly requested.
>
> IR example:
>
> define <4 x i32> @maskedload_endpoints(<4 x i32>* %addr, <4
x i32> %v) {
>
>   ; load the first and last elements pointed to by %addr and shuffle those
> into %v
>
>   %res = call <4 x i32> @llvm.masked.load.v4i32(<4 x i32>*
%addr, i32 4,
> <4 x i1> <i1 1, i1 0, i1 0, i1 1>, <4 x i32> %v)
>   ret <4 x i32> %res
> }
>
> would become something like:
>
>
> define <4 x i32> @maskedload_endpoints(<4 x i32>* %addr, <4
x i32> %v) {
>
>   %vecload = load <4 x i32>, <4 x i32>* %addr, align 4
>
>   %sel = select <4 x i1> <i1 1, i1 0, i1 0, i1 1>, <4 x
i32> %vecload, <4
> x i32> %v
>
>   ret <4 x i32> %sel
> }
>
> If this isn't valid as an IR optimization, would it be acceptable as a
DAG
> combine with target hook to opt in?
>
>
> [1] http://llvm.org/docs/LangRef.html#llvm-masked-load-intrinsics
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160311/d5ea5b7c/attachment.html>

Sanjay Patel via llvm-dev

2016-Mar-14 17:06 UTC

head link

[llvm-dev] masked-load endpoints optimization

I checked in a patch to do this transform for x86-only for now:
http://reviews.llvm.org/D18094 / http://reviews.llvm.org/rL263446

On Fri, Mar 11, 2016 at 9:57 AM, Sanjay Patel <spatel at rotateright.com>
wrote:
> Thanks, Ashutosh.
>
> Yes, either TTI or TLI could be used to limit the transform if we do it in
> CGP rather than the DAG.
>
> The real question I have is whether it is legal to read the extra memory,
> regardless of whether this is a masked load or something else.
>
> Note that the x86 backend already does this, so either my proposal is ok
> for x86, or we're already doing an illegal optimization:
>
> define <4 x i32> @load_bonus_bytes(i32* %addr1, <4 x i32> %v) {
>   %ld1 = load i32, i32* %addr1
>   %addr2 = getelementptr i32, i32* %addr1, i64 3
>   %ld2 = load i32, i32* %addr2
>   %vec1 = insertelement <4 x i32> undef, i32 %ld1, i32 0
>   %vec2 = insertelement <4 x i32> %vec1, i32 %ld2, i32 3
>   ret <4 x i32> %vec2
> }
>
> $ ./llc -o - loadcombine.ll
> ...
>     movups    (%rdi), %xmm0
>     retq
>
>
>
>
> On Thu, Mar 10, 2016 at 10:22 PM, Nema, Ashutosh <Ashutosh.Nema at
amd.com>
> wrote:
>
>> This looks interesting, the main motivation appears to be replacing
>> masked vector load with a general vector load followed by a select.
>>
>>
>>
>> Observed masked vector loads are in general expensive in comparison
with
>> a vector load.
>>
>>
>>
>> But if first & last element of a masked vector load are guaranteed
to be
>> accessed then it can be transformed to a vector load.
>>
>>
>>
>> In opt this can be driven by TTI, where the benefit of this
>> transformation should be checked.
>>
>>
>>
>> Regards,
>>
>> Ashutosh
>>
>>
>>
>> *From:* llvm-dev [mailto:llvm-dev-bounces at lists.llvm.org] *On Behalf
Of *Sanjay
>> Patel via llvm-dev
>> *Sent:* Friday, March 11, 2016 3:37 AM
>> *To:* llvm-dev
>> *Subject:* [llvm-dev] masked-load endpoints optimization
>>
>>
>>
>> If we're loading the first and last elements of a vector using a
masked
>> load [1], can we replace the masked load with a full vector load?
>>
>> "The result of this operation is equivalent to a regular vector
load
>> instruction followed by a ‘select’ between the loaded and the passthru
>> values, predicated on the same mask. However, using this intrinsic
prevents
>> exceptions on memory access to masked-off lanes."
>>
>> I think the fact that we're loading the endpoints of the vector
>> guarantees that a full vector load can't have any different
>> faulting/exception behavior on x86 and most (?) other targets. We
would,
>> however, be reading memory that the program has not explicitly
requested.
>>
>> IR example:
>>
>> define <4 x i32> @maskedload_endpoints(<4 x i32>* %addr,
<4 x i32> %v) {
>>
>>   ; load the first and last elements pointed to by %addr and shuffle
>> those into %v
>>
>>   %res = call <4 x i32> @llvm.masked.load.v4i32(<4 x i32>*
%addr, i32 4,
>> <4 x i1> <i1 1, i1 0, i1 0, i1 1>, <4 x i32> %v)
>>   ret <4 x i32> %res
>> }
>>
>> would become something like:
>>
>>
>> define <4 x i32> @maskedload_endpoints(<4 x i32>* %addr,
<4 x i32> %v) {
>>
>>   %vecload = load <4 x i32>, <4 x i32>* %addr, align 4
>>
>>   %sel = select <4 x i1> <i1 1, i1 0, i1 0, i1 1>, <4 x
i32> %vecload, <4
>> x i32> %v
>>
>>   ret <4 x i32> %sel
>> }
>>
>> If this isn't valid as an IR optimization, would it be acceptable
as a
>> DAG combine with target hook to opt in?
>>
>>
>> [1] http://llvm.org/docs/LangRef.html#llvm-masked-load-intrinsics
>>
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160314/27903f51/attachment.html>

Zaks, Ayal via llvm-dev

2016-Apr-03 21:51 UTC

head link

[llvm-dev] masked-load endpoints optimization

< The real question I have is whether it is legal to read the extra memory,
regardless of whether this is a masked load or something else.

If one is allowed to read from a given address, a reasonable(?) assumption is
that the aligned cache-line containing this address can be read. This should
help answer the question.

Such an assumption may be interesting to consider also in
isSafeToLoadUnconditionally btw.

Wonder in what situations one may know (at compile time) that both the first and
the last bit of a mask are on?

Thanks for rL263446,
Ayal.


From: llvm-dev [mailto:llvm-dev-bounces at lists.llvm.org] On Behalf Of Sanjay
Patel via llvm-dev
Sent: Friday, March 11, 2016 18:57
To: Nema, Ashutosh <Ashutosh.Nema at amd.com<mailto:Ashutosh.Nema at
amd.com>>
Cc: llvm-dev <llvm-dev at lists.llvm.org<mailto:llvm-dev at
lists.llvm.org>>
Subject: Re: [llvm-dev] masked-load endpoints optimization

Thanks, Ashutosh.
Yes, either TTI or TLI could be used to limit the transform if we do it in CGP
rather than the DAG.
The real question I have is whether it is legal to read the extra memory,
regardless of whether this is a masked load or something else.
Note that the x86 backend already does this, so either my proposal is ok for
x86, or we're already doing an illegal optimization:

define <4 x i32> @load_bonus_bytes(i32* %addr1, <4 x i32> %v) {
  %ld1 = load i32, i32* %addr1
  %addr2 = getelementptr i32, i32* %addr1, i64 3
  %ld2 = load i32, i32* %addr2
  %vec1 = insertelement <4 x i32> undef, i32 %ld1, i32 0
  %vec2 = insertelement <4 x i32> %vec1, i32 %ld2, i32 3
  ret <4 x i32> %vec2
}

$ ./llc -o - loadcombine.ll
...
    movups    (%rdi), %xmm0
    retq



On Thu, Mar 10, 2016 at 10:22 PM, Nema, Ashutosh <Ashutosh.Nema at
amd.com<mailto:Ashutosh.Nema at amd.com>> wrote:
This looks interesting, the main motivation appears to be replacing masked
vector load with a general vector load followed by a select.

Observed masked vector loads are in general expensive in comparison with a
vector load.

But if first & last element of a masked vector load are guaranteed to be
accessed then it can be transformed to a vector load.

In opt this can be driven by TTI, where the benefit of this transformation
should be checked.

Regards,
Ashutosh

From: llvm-dev [mailto:llvm-dev-bounces at
lists.llvm.org<mailto:llvm-dev-bounces at lists.llvm.org>] On Behalf Of
Sanjay Patel via llvm-dev
Sent: Friday, March 11, 2016 3:37 AM
To: llvm-dev
Subject: [llvm-dev] masked-load endpoints optimization

If we're loading the first and last elements of a vector using a masked load
[1], can we replace the masked load with a full vector load?

"The result of this operation is equivalent to a regular vector load
instruction followed by a ‘select’ between the loaded and the passthru values,
predicated on the same mask. However, using this intrinsic prevents exceptions
on memory access to masked-off lanes."

I think the fact that we're loading the endpoints of the vector guarantees
that a full vector load can't have any different faulting/exception behavior
on x86 and most (?) other targets. We would, however, be reading memory that the
program has not explicitly requested.
IR example:

define <4 x i32> @maskedload_endpoints(<4 x i32>* %addr, <4 x
i32> %v) {
  ; load the first and last elements pointed to by %addr and shuffle those into
%v
  %res = call <4 x i32> @llvm.masked.load.v4i32(<4 x i32>* %addr,
i32 4, <4 x i1> <i1 1, i1 0, i1 0, i1 1>, <4 x i32> %v)
  ret <4 x i32> %res
}
would become something like:

define <4 x i32> @maskedload_endpoints(<4 x i32>* %addr, <4 x
i32> %v) {
  %vecload = load <4 x i32>, <4 x i32>* %addr, align 4
  %sel = select <4 x i1> <i1 1, i1 0, i1 0, i1 1>, <4 x i32>
%vecload, <4 x i32> %v
  ret <4 x i32> %sel
}
If this isn't valid as an IR optimization, would it be acceptable as a DAG
combine with target hook to opt in?

[1] http://llvm.org/docs/LangRef.html#llvm-masked-load-intrinsics

---------------------------------------------------------------------
Intel Israel (74) Limited

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160403/51e97364/attachment.html>

Sanjay Patel via llvm-dev

2016-Apr-04 17:11 UTC

head link

[llvm-dev] masked-load endpoints optimization

On Sun, Apr 3, 2016 at 3:51 PM, Zaks, Ayal <ayal.zaks at intel.com>
wrote:>
> < The real question I have is whether it is legal to read the extramemory, regardless of whether this is a masked load or something
else.>
>
>
> If one is allowed to read from a given address, a reasonable(?)assumption is that the aligned cache-line containing this address can be
read. This should help answer the question.


I started another thread with this question to also include cfe-dev:
http://lists.llvm.org/pipermail/llvm-dev/2016-March/096828.html

For reference, the necessary conditions to do the transform are at least
this:

1. both ends of vector are used
2. vector is smaller than granularity of cacheline and memory protection on
targeted architecture
3. not FP, or arch doesn’t raise flags on FP loads (most)
4. not volatile or atomic

I have tried to meet all those requirement for x86, so the transform is
still available in trunk. If I've missed a predicate, it should be
considered a bug.

> Wonder in what situations one may know (at compile time) that both thefirst and the last bit of a mask are on?

The main motivation was to make sure that all masked move operations were
optimally supported by the x86 backend such that we could replace any
regular AVX vector load/store with a masked op (including 'all' and
'none'
masks) in source. This helps hand-coded, but possibly very templated,
vector source code to perform as well as specialized vector code. In
theory, it should also allow the auto-vectorizers to produce more efficient
prologue/epilog loop code, but that has not been implemented yet AFAIK.

The "load doughnut" optimization was just something I noticed while
handling the expected patterns, so I thought I better throw it in too. :)

>
>
>
> Thanks for rL263446,
>
> Ayal.
>
>
>
>
>
> From: llvm-dev [mailto:llvm-dev-bounces at lists.llvm.org] On Behalf Of
Sanjay Patel via llvm-dev> Sent: Friday, March 11, 2016 18:57
> To: Nema, Ashutosh <Ashutosh.Nema at amd.com>
> Cc: llvm-dev <llvm-dev at lists.llvm.org>
> Subject: Re: [llvm-dev] masked-load endpoints optimization
>
>
>
> Thanks, Ashutosh.
>
> Yes, either TTI or TLI could be used to limit the transform if we do it
in CGP rather than the DAG.>
> The real question I have is whether it is legal to read the extra memory,regardless of whether this is a masked load or something
else.>
> Note that the x86 backend already does this, so either my proposal is okfor x86, or we're already doing an illegal
optimization:>
>
> define <4 x i32> @load_bonus_bytes(i32* %addr1, <4 x i32> %v) {
>   %ld1 = load i32, i32* %addr1
>   %addr2 = getelementptr i32, i32* %addr1, i64 3
>   %ld2 = load i32, i32* %addr2
>   %vec1 = insertelement <4 x i32> undef, i32 %ld1, i32 0
>   %vec2 = insertelement <4 x i32> %vec1, i32 %ld2, i32 3
>   ret <4 x i32> %vec2
> }
>
> $ ./llc -o - loadcombine.ll
> ...
>     movups    (%rdi), %xmm0
>     retq
>
>
>
>
> On Thu, Mar 10, 2016 at 10:22 PM, Nema, Ashutosh <Ashutosh.Nema at
amd.com>
wrote:>
> This looks interesting, the main motivation appears to be replacingmasked vector load with a general vector load followed by a
select.>
>
>
> Observed masked vector loads are in general expensive in comparison with
a vector load.>
>
>
> But if first & last element of a masked vector load are guaranteed to
be
accessed then it can be transformed to a vector load.>
>
>
> In opt this can be driven by TTI, where the benefit of this
transformation should be checked.>
>
>
> Regards,
>
> Ashutosh
>
>
>
> From: llvm-dev [mailto:llvm-dev-bounces at lists.llvm.org] On Behalf Of
Sanjay Patel via llvm-dev> Sent: Friday, March 11, 2016 3:37 AM
> To: llvm-dev
> Subject: [llvm-dev] masked-load endpoints optimization
>
>
>
> If we're loading the first and last elements of a vector using a maskedload [1], can we replace the masked load with a full vector
load?>
> "The result of this operation is equivalent to a regular vector loadinstruction followed by a ‘select’ between the loaded and the passthru
values, predicated on the same mask. However, using this intrinsic prevents
exceptions on memory access to masked-off lanes.">
> I think the fact that we're loading the endpoints of the vectorguarantees that a full vector load can't have any different
faulting/exception behavior on x86 and most (?) other targets. We would,
however, be reading memory that the program has not explicitly
requested.>
> IR example:
>
> define <4 x i32> @maskedload_endpoints(<4 x i32>* %addr, <4
x i32> %v) {
>
>   ; load the first and last elements pointed to by %addr and shuffle
those into %v>
>   %res = call <4 x i32> @llvm.masked.load.v4i32(<4 x i32>*
%addr, i32 4,<4 x i1> <i1 1, i1 0, i1 0, i1 1>, <4 x i32>
%v)>   ret <4 x i32> %res
> }
>
> would become something like:
>
>
> define <4 x i32> @maskedload_endpoints(<4 x i32>* %addr, <4
x i32> %v) {
>
>   %vecload = load <4 x i32>, <4 x i32>* %addr, align 4
>
>   %sel = select <4 x i1> <i1 1, i1 0, i1 0, i1 1>, <4 x
i32> %vecload, <4
x i32> %v>
>   ret <4 x i32> %sel
> }
>
> If this isn't valid as an IR optimization, would it be acceptable as a
DAG combine with target hook to opt in?>
>
> [1] http://llvm.org/docs/LangRef.html#llvm-masked-load-intrinsics
>
>
>
> ---------------------------------------------------------------------
> Intel Israel (74) Limited
>
> This e-mail and any attachments may contain confidential material for
> the sole use of the intended recipient(s). Any review or distribution
> by others is strictly prohibited. If you are not the intended
> recipient, please contact the sender and delete all copies.-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160404/ef8b46f9/attachment.html>

llvm dev - Apr 2016 - masked-load endpoints optimization

[llvm-dev] masked-load endpoints optimization

[llvm-dev] masked-load endpoints optimization

[llvm-dev] masked-load endpoints optimization

[llvm-dev] masked-load endpoints optimization