thr3ads.net - llvm dev - [llvm-dev] masked-load endpoints optimization [Mar 2016]

If this information is useful, please help other people find it:
Share via:

Sanjay Patel via llvm-dev

2016-Mar-10 22:06 UTC

[llvm-dev] masked-load endpoints optimization

If we're loading the first and last elements of a vector using a masked
load [1], can we replace the masked load with a full vector load?

"The result of this operation is equivalent to a regular vector load
instruction followed by a ‘select’ between the loaded and the passthru
values, predicated on the same mask. However, using this intrinsic prevents
exceptions on memory access to masked-off lanes."

I think the fact that we're loading the endpoints of the vector guarantees
that a full vector load can't have any different faulting/exception
behavior on x86 and most (?) other targets. We would, however, be reading
memory that the program has not explicitly requested.

IR example:

define <4 x i32> @maskedload_endpoints(<4 x i32>* %addr, <4 x
i32> %v) {
  ; load the first and last elements pointed to by %addr and shuffle those
into %v
  %res = call <4 x i32> @llvm.masked.load.v4i32(<4 x i32>* %addr,
i32 4, <4
x i1> <i1 1, i1 0, i1 0, i1 1>, <4 x i32> %v)
  ret <4 x i32> %res
}

would become something like:

define <4 x i32> @maskedload_endpoints(<4 x i32>* %addr, <4 x
i32> %v) {
  %vecload = load <4 x i32>, <4 x i32>* %addr, align 4
  %sel = select <4 x i1> <i1 1, i1 0, i1 0, i1 1>, <4 x i32>
%vecload, <4 x
i32> %v
  ret <4 x i32> %sel
}

If this isn't valid as an IR optimization, would it be acceptable as a DAG
combine with target hook to opt in?

[1] http://llvm.org/docs/LangRef.html#llvm-masked-load-intrinsics
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160310/749fad7c/attachment.html>

Nema, Ashutosh via llvm-dev

2016-Mar-11 05:22 UTC

head link

[llvm-dev] masked-load endpoints optimization

This looks interesting, the main motivation appears to be replacing masked
vector load with a general vector load followed by a select.

Observed masked vector loads are in general expensive in comparison with a
vector load.

But if first & last element of a masked vector load are guaranteed to be
accessed then it can be transformed to a vector load.

In opt this can be driven by TTI, where the benefit of this transformation
should be checked.

Regards,
Ashutosh

From: llvm-dev [mailto:llvm-dev-bounces at lists.llvm.org] On Behalf Of Sanjay
Patel via llvm-dev
Sent: Friday, March 11, 2016 3:37 AM
To: llvm-dev
Subject: [llvm-dev] masked-load endpoints optimization

If we're loading the first and last elements of a vector using a masked load
[1], can we replace the masked load with a full vector load?

"The result of this operation is equivalent to a regular vector load
instruction followed by a ‘select’ between the loaded and the passthru values,
predicated on the same mask. However, using this intrinsic prevents exceptions
on memory access to masked-off lanes."

I think the fact that we're loading the endpoints of the vector guarantees
that a full vector load can't have any different faulting/exception behavior
on x86 and most (?) other targets. We would, however, be reading memory that the
program has not explicitly requested.
IR example:

define <4 x i32> @maskedload_endpoints(<4 x i32>* %addr, <4 x
i32> %v) {
  ; load the first and last elements pointed to by %addr and shuffle those into
%v
  %res = call <4 x i32> @llvm.masked.load.v4i32(<4 x i32>* %addr,
i32 4, <4 x i1> <i1 1, i1 0, i1 0, i1 1>, <4 x i32> %v)
  ret <4 x i32> %res
}
would become something like:

define <4 x i32> @maskedload_endpoints(<4 x i32>* %addr, <4 x
i32> %v) {
  %vecload = load <4 x i32>, <4 x i32>* %addr, align 4
  %sel = select <4 x i1> <i1 1, i1 0, i1 0, i1 1>, <4 x i32>
%vecload, <4 x i32> %v
  ret <4 x i32> %sel
}
If this isn't valid as an IR optimization, would it be acceptable as a DAG
combine with target hook to opt in?

[1] http://llvm.org/docs/LangRef.html#llvm-masked-load-intrinsics
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160311/13463579/attachment-0001.html>

Sanjay Patel via llvm-dev

2016-Mar-11 16:57 UTC

head link

[llvm-dev] masked-load endpoints optimization

Thanks, Ashutosh.

Yes, either TTI or TLI could be used to limit the transform if we do it in
CGP rather than the DAG.

The real question I have is whether it is legal to read the extra memory,
regardless of whether this is a masked load or something else.

Note that the x86 backend already does this, so either my proposal is ok
for x86, or we're already doing an illegal optimization:

define <4 x i32> @load_bonus_bytes(i32* %addr1, <4 x i32> %v) {
  %ld1 = load i32, i32* %addr1
  %addr2 = getelementptr i32, i32* %addr1, i64 3
  %ld2 = load i32, i32* %addr2
  %vec1 = insertelement <4 x i32> undef, i32 %ld1, i32 0
  %vec2 = insertelement <4 x i32> %vec1, i32 %ld2, i32 3
  ret <4 x i32> %vec2
}

$ ./llc -o - loadcombine.ll
...
    movups    (%rdi), %xmm0
    retq




On Thu, Mar 10, 2016 at 10:22 PM, Nema, Ashutosh <Ashutosh.Nema at
amd.com>
wrote:
> This looks interesting, the main motivation appears to be replacing masked
> vector load with a general vector load followed by a select.
>
>
>
> Observed masked vector loads are in general expensive in comparison with a
> vector load.
>
>
>
> But if first & last element of a masked vector load are guaranteed to
be
> accessed then it can be transformed to a vector load.
>
>
>
> In opt this can be driven by TTI, where the benefit of this transformation
> should be checked.
>
>
>
> Regards,
>
> Ashutosh
>
>
>
> *From:* llvm-dev [mailto:llvm-dev-bounces at lists.llvm.org] *On Behalf Of
*Sanjay
> Patel via llvm-dev
> *Sent:* Friday, March 11, 2016 3:37 AM
> *To:* llvm-dev
> *Subject:* [llvm-dev] masked-load endpoints optimization
>
>
>
> If we're loading the first and last elements of a vector using a masked
> load [1], can we replace the masked load with a full vector load?
>
> "The result of this operation is equivalent to a regular vector load
> instruction followed by a ‘select’ between the loaded and the passthru
> values, predicated on the same mask. However, using this intrinsic prevents
> exceptions on memory access to masked-off lanes."
>
> I think the fact that we're loading the endpoints of the vector
guarantees
> that a full vector load can't have any different faulting/exception
> behavior on x86 and most (?) other targets. We would, however, be reading
> memory that the program has not explicitly requested.
>
> IR example:
>
> define <4 x i32> @maskedload_endpoints(<4 x i32>* %addr, <4
x i32> %v) {
>
>   ; load the first and last elements pointed to by %addr and shuffle those
> into %v
>
>   %res = call <4 x i32> @llvm.masked.load.v4i32(<4 x i32>*
%addr, i32 4,
> <4 x i1> <i1 1, i1 0, i1 0, i1 1>, <4 x i32> %v)
>   ret <4 x i32> %res
> }
>
> would become something like:
>
>
> define <4 x i32> @maskedload_endpoints(<4 x i32>* %addr, <4
x i32> %v) {
>
>   %vecload = load <4 x i32>, <4 x i32>* %addr, align 4
>
>   %sel = select <4 x i1> <i1 1, i1 0, i1 0, i1 1>, <4 x
i32> %vecload, <4
> x i32> %v
>
>   ret <4 x i32> %sel
> }
>
> If this isn't valid as an IR optimization, would it be acceptable as a
DAG
> combine with target hook to opt in?
>
>
> [1] http://llvm.org/docs/LangRef.html#llvm-masked-load-intrinsics
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160311/d5ea5b7c/attachment.html>

llvm dev - Mar 2016 - masked-load endpoints optimization

[llvm-dev] masked-load endpoints optimization

[llvm-dev] masked-load endpoints optimization

[llvm-dev] masked-load endpoints optimization