thr3ads.net - llvm dev - [LLVMdev] Adding masked vector load and store intrinsics [Oct 2014]

If this information is useful, please help other people find it:
Share via:

Nadav Rotem

2014-Oct-24 18:38 UTC

[LLVMdev] Adding masked vector load and store intrinsics

> On Oct 24, 2014, at 10:57 AM, Adam Nemet <anemet at apple.com> wrote:
> 
> On Oct 24, 2014, at 4:24 AM, Demikhovsky, Elena <elena.demikhovsky at
intel.com <mailto:elena.demikhovsky at intel.com>> wrote:
> 
>> Hi,
>>  
>> We would like to add support for masked vector loads and stores by
introducing new target-independent intrinsics. The loop vectorizer will then be
enhanced to optimize loops containing conditional memory accesses by generating
these intrinsics for existing targets such as AVX2 and AVX-512. The vectorizer
will first ask the target about availability of masked vector loads and stores.
The SLP vectorizer can potentially be enhanced to use these intrinsics as well.
>>  
I am happy to hear that you are working on this because it means that in the
future we would be able to teach the SLP Vectorizer to vectorize types of <3
x float>.
>> The intrinsics would be legal for all targets; targets that do not
support masked vector loads or stores will scalarize them.
> 
+1. I think that this is an important requirement. 
> I do agree that we would like to have one IR node to capture these so that
they survive until ISel and that their specific semantics can be expressed. 
However, can you discuss the other options (new IR instructions, target-specific
intrinsics) and why you went with target-independent intrinsics.
> 
I agree with the approach of adding target-independent masked memory intrinsics.
One reason is that I would like to keep the vectorizers target independent (and
use the target transform info to query the backends). I oppose adding new
first-level instructions because we would need to teach all of the existing
optimizations about the new instructions, and considering the limited usefulness
of masked operations it is not worth the effort.
> My intuition would have been to go with target-specific intrinsics until we
have something solid implemented and then potentially turn this into native IR
instructions as the next step (for other targets, etc.).  I am particularly
worried whether we really want to generate these for targets that don’t have
vector predication support.
Probably not, but this is a cost-benefit decision that the vectorizers would
need to make.
> 
> There is also the related question of vector predicating any other
instruction beyond just loads and stores which AVX512 supports.  This is
probably a smaller gain but should probably be part of the plan as well.
> 
> Adam
> 
>> The addressed memory will not be touched for masked-off lanes. In
particular, if all lanes are masked off no address will be accessed.
>>  
>>   call void @llvm.masked.store (i32* %addr, <16 x i32> %data, i32
4, <16 x i1> %mask)
>>  
>>   %data = call <8 x i32> @llvm.masked.load (i32* %addr, <8 x
i32> %passthru, i32 4, <8 x i1> %mask)
>>  
>> where %passthru is used to fill the elements of %data that are
masked-off (if any; can be zeroinitializer or undef).
>>  
>> Comments so far, before we dive into more details?
>>  
>> Thank you.
>>  
>> - Elena and Ayal
>>  
>>  
>> ---------------------------------------------------------------------
>> Intel Israel (74) Limited
>> 
>> This e-mail and any attachments may contain confidential material for
>> the sole use of the intended recipient(s). Any review or distribution
>> by others is strictly prohibited. If you are not the intended
>> recipient, please contact the sender and delete all copies.
>> 
> 
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20141024/8f9fe89d/attachment.html>

Adam Nemet

2014-Oct-24 19:03 UTC

head link

[LLVMdev] Adding masked vector load and store intrinsics

On Oct 24, 2014, at 11:38 AM, Nadav Rotem <nrotem at apple.com> wrote:
> 
>> On Oct 24, 2014, at 10:57 AM, Adam Nemet <anemet at apple.com>
wrote:
>> 
>> On Oct 24, 2014, at 4:24 AM, Demikhovsky, Elena <elena.demikhovsky
at intel.com> wrote:
>> 
>>> Hi,
>>>  
>>> We would like to add support for masked vector loads and stores by
introducing new target-independent intrinsics. The loop vectorizer will then be
enhanced to optimize loops containing conditional memory accesses by generating
these intrinsics for existing targets such as AVX2 and AVX-512. The vectorizer
will first ask the target about availability of masked vector loads and stores.
The SLP vectorizer can potentially be enhanced to use these intrinsics as well.
>>>  
> 
> I am happy to hear that you are working on this because it means that in
the future we would be able to teach the SLP Vectorizer to vectorize types of
<3 x float>.
> 
>>> The intrinsics would be legal for all targets; targets that do not
support masked vector loads or stores will scalarize them.
>> 
> 
> +1. I think that this is an important requirement. 
> 
>> I do agree that we would like to have one IR node to capture these so
that they survive until ISel and that their specific semantics can be expressed.
However, can you discuss the other options (new IR instructions, target-specific
intrinsics) and why you went with target-independent intrinsics.
>> 
> 
> I agree with the approach of adding target-independent masked memory
intrinsics. One reason is that I would like to keep the vectorizers target
independent (and use the target transform info to query the backends). I oppose
adding new first-level instructions because we would need to teach all of the
existing optimizations about the new instructions, and considering the limited
usefulness of masked operations it is not worth the effort.
Thanks, Nadav, that makes sense.  Do you foresee any potential issues due to the
limitation of what information can be attached to an intrinsic call vs. a store,
e.g. alignment or alias info.  I do remember from trying to optimize
from-memory-broadcast intrinsics that the optimizers were pretty limited dealing
with intrinsics accessing memory.

Adam
> 
>> My intuition would have been to go with target-specific intrinsics
until we have something solid implemented and then potentially turn this into
native IR instructions as the next step (for other targets, etc.).  I am
particularly worried whether we really want to generate these for targets that
don’t have vector predication support.
> 
> Probably not, but this is a cost-benefit decision that the vectorizers
would need to make.
> 
>> 
>> There is also the related question of vector predicating any other
instruction beyond just loads and stores which AVX512 supports.  This is
probably a smaller gain but should probably be part of the plan as well.
>> 
>> Adam
>> 
>>> The addressed memory will not be touched for masked-off lanes. In
particular, if all lanes are masked off no address will be accessed.
>>>  
>>>   call void @llvm.masked.store (i32* %addr, <16 x i32> %data,
i32 4, <16 x i1> %mask)
>>>  
>>>   %data = call <8 x i32> @llvm.masked.load (i32* %addr, <8
x i32> %passthru, i32 4, <8 x i1> %mask)
>>>  
>>> where %passthru is used to fill the elements of %data that are
masked-off (if any; can be zeroinitializer or undef).
>>>  
>>> Comments so far, before we dive into more details?
>>>  
>>> Thank you.
>>>  
>>> - Elena and Ayal
>>>  
>>>  
>>>
---------------------------------------------------------------------
>>> Intel Israel (74) Limited
>>> 
>>> This e-mail and any attachments may contain confidential material
for
>>> the sole use of the intended recipient(s). Any review or
distribution
>>> by others is strictly prohibited. If you are not the intended
>>> recipient, please contact the sender and delete all copies.
>>> 
>> 
> 
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20141024/26af13d1/attachment.html>

Hal Finkel

2014-Oct-24 19:41 UTC

head link

[LLVMdev] Adding masked vector load and store intrinsics

----- Original Message -----> From: "Adam Nemet" <anemet at apple.com>
> To: "Nadav Rotem" <nrotem at apple.com>
> Cc: dag at cray.com, llvmdev at cs.uiuc.edu
> Sent: Friday, October 24, 2014 2:03:24 PM
> Subject: Re: [LLVMdev] Adding masked vector load and store intrinsics
> 
> On Oct 24, 2014, at 11:38 AM, Nadav Rotem < nrotem at apple.com >
wrote:
> 
> 
> On Oct 24, 2014, at 10:57 AM, Adam Nemet < anemet at apple.com >
wrote:
> 
> 
> 
> On Oct 24, 2014, at 4:24 AM, Demikhovsky, Elena <
> elena.demikhovsky at intel.com > wrote:
> 
> 
> 
> 
> 
> Hi,
> 
> We would like to add support for masked vector loads and stores by
> introducing new target-independent intrinsics. The loop vectorizer
> will then be enhanced to optimize loops containing conditional
> memory accesses by generating these intrinsics for existing targets
> such as AVX2 and AVX-512. The vectorizer will first ask the target
> about availability of masked vector loads and stores. The SLP
> vectorizer can potentially be enhanced to use these intrinsics as
> well.
> 
> 
> 
> I am happy to hear that you are working on this because it means that
> in the future we would be able to teach the SLP Vectorizer to
> vectorize types of <3 x float>.
> 
> 
> 
> 
> 
> 
> 
> 
> 
> The intrinsics would be legal for all targets; targets that do not
> support masked vector loads or stores will scalarize them.
> 
> 
> 
> 
> +1. I think that this is an important requirement.
> 
> 
> 
> 
> 
> 
> I do agree that we would like to have one IR node to capture these so
> that they survive until ISel and that their specific semantics can
> be expressed. However, can you discuss the other options (new IR
> instructions, target-specific intrinsics) and why you went with
> target-independent intrinsics.
> 
> 
> 
> 
> I agree with the approach of adding target-independent masked memory
> intrinsics. One reason is that I would like to keep the vectorizers
> target independent (and use the target transform info to query the
> backends). I oppose adding new first-level instructions because we
> would need to teach all of the existing optimizations about the new
> instructions, and considering the limited usefulness of masked
> operations it is not worth the effort.
> 
> 
> Thanks, Nadav, that makes sense. Do you foresee any potential issues
> due to the limitation of what information can be attached to an
> intrinsic call vs. a store, e.g. alignment or alias info. I do
> remember from trying to optimize from-memory-broadcast intrinsics
> that the optimizers were pretty limited dealing with intrinsics
> accessing memory.
This is, hopefully, a bit better now than it was in the past. Nevertheless, our
handling of these things is not bad to improve in general. Alignment it has, and
alias metadata should just work (except perhaps for TBAA, but that should be
easy to fix).

 -Hal
> 
> 
> Adam
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> My intuition would have been to go with target-specific intrinsics
> until we have something solid implemented and then potentially turn
> this into native IR instructions as the next step (for other
> targets, etc.). I am particularly worried whether we really want to
> generate these for targets that don’t have vector predication
> support.
> 
> 
> Probably not, but this is a cost-benefit decision that the
> vectorizers would need to make.
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> There is also the related question of vector predicating any other
> instruction beyond just loads and stores which AVX512 supports. This
> is probably a smaller gain but should probably be part of the plan
> as well.
> 
> 
> Adam
> 
> 
> 
> 
> The addressed memory will not be touched for masked-off lanes. In
> particular, if all lanes are masked off no address will be accessed.
> 
> call void @llvm.masked.store (i32* %addr, <16 x i32> %data, i32 4,
> <16 x i1> %mask)
> 
> %data = call <8 x i32> @llvm.masked.load (i32* %addr, <8 x i32>
> %passthru, i32 4, <8 x i1> %mask)
> 
> where %passthru is used to fill the elements of %data that are
> masked-off (if any; can be zeroinitializer or undef).
> 
> Comments so far, before we dive into more details?
> 
> Thank you.
> 
> - Elena and Ayal
> 
> 
> 
> ---------------------------------------------------------------------
> Intel Israel (74) Limited
> 
> This e-mail and any attachments may contain confidential material for
> the sole use of the intended recipient(s). Any review or distribution
> by others is strictly prohibited. If you are not the intended
> recipient, please contact the sender and delete all copies.
> 
> 
> 
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
> 
-- 
Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory

dag at cray.com

2014-Oct-24 20:02 UTC

head link

[LLVMdev] Adding masked vector load and store intrinsics

Nadav Rotem <nrotem at apple.com> writes:
> I oppose adding new first-level instructions because we would need to
> teach all of the existing optimizations about the new instructions,
> and considering the limited usefulness of masked operations it is not
> worth the effort.
Limited usefulness?  It is quite the opposite.  If we were starting from
scratch on an IR, we'd want to have first-class mask support, with masks
as an additional operand to nearly every IR instructions.  Given where
we are, target-independent intrinsics seems like a good compromise
because as you said it would be a huge task to teach all of the existing
LLVM code about a new instruction operand.  With intrinsics, passes are
conservative when they see an intrinsic they don't understand.  We can
teach passes about specific intrinsics as we find benefit in doing so.

                                  -David

Pete Cooper

2014-Oct-24 20:40 UTC

head link

[LLVMdev] Adding masked vector load and store intrinsics

> On Oct 24, 2014, at 11:38 AM, Nadav Rotem <nrotem at apple.com>
wrote:
> 
> I agree with the approach of adding target-independent masked memory
intrinsics. One reason is that I would like to keep the vectorizers target
independent (and use the target transform info to query the backends). I oppose
adding new first-level instructions because we would need to teach all of the
existing optimizations about the new instructions, and considering the limited
usefulness of masked operations it is not worth the effort.I agree with this.  They should be target independent.

However, what types should be supported here?  I haven’t looked in detail, but
from memory I believe AVX-512 masks 32-bit values, and not bytes.  Are we going
to have an intrinsic which can handle any vector type, or just <n x
32-bit> vectors, even at first?

Also, given that the types of the vectors matter, it seems like we’re going to
need TTI anyway whenever we want to generate one of these, or else we’ll end up
generating an illegal version which has to be scalarised in the backend.

Thanks,
Pete

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20141024/5e1e3b7c/attachment.html>

Hal Finkel

2014-Oct-24 20:49 UTC

head link

[LLVMdev] Adding masked vector load and store intrinsics

----- Original Message -----> From: "Pete Cooper" <peter_cooper at apple.com>
> To: "Nadav Rotem" <nrotem at apple.com>
> Cc: dag at cray.com, llvmdev at cs.uiuc.edu
> Sent: Friday, October 24, 2014 3:40:10 PM
> Subject: Re: [LLVMdev] Adding masked vector load and store intrinsics
> 
> On Oct 24, 2014, at 11:38 AM, Nadav Rotem < nrotem at apple.com >
wrote:
> 
> I agree with the approach of adding target-independent masked memory
> intrinsics. One reason is that I would like to keep the vectorizers
> target independent (and use the target transform info to query the
> backends). I oppose adding new first-level instructions because we
> would need to teach all of the existing optimizations about the new
> instructions, and considering the limited usefulness of masked
> operations it is not worth the effort. I agree with this. They
> should be target independent.
> 
> 
> However, what types should be supported here? I haven’t looked in
> detail, but from memory I believe AVX-512 masks 32-bit values, and
> not bytes. Are we going to have an intrinsic which can handle any
> vector type, or just <n x 32-bit> vectors, even at first?
I think you're confusing the IR types with the backend types. At the IR
level, the masks are <n x i1> (one boolean per vector lane), the backend
may represent this with a different type, but that's true of comparison
results generally (they're often represented with different types in the
backend), we already deal with that. Regarding the pointer type, it is
irrelevant, we'll just cast to it from whatever the deal pointer type is.

 -Hal
> 
> 
> Also, given that the types of the vectors matter, it seems like we’re
> going to need TTI anyway whenever we want to generate one of these,
> or else we’ll end up generating an illegal version which has to be
> scalarised in the backend.
> 
> 
> Thanks,
> Pete
> 
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
> 
-- 
Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory

dag at cray.com

2014-Oct-24 22:27 UTC

head link

[LLVMdev] Adding masked vector load and store intrinsics

Pete Cooper <peter_cooper at apple.com> writes:
> However, what types should be supported here? I haven’t looked in
> detail, but from memory I believe AVX-512 masks 32-bit values, and not
> bytes. Are we going to have an intrinsic which can handle any vector
> type, or just <n x 32-bit> vectors, even at first?
Eventually we should support at least f/i 8, 16, 32 and 64.  We can
start with f/i 32, 64 for now I think.
> Also, given that the types of the vectors matter, it seems like we’re
> going to need TTI anyway whenever we want to generate one of these, or
> else we’ll end up generating an illegal version which has to be
> scalarised in the backend.
Yep.

                          -David

Maybe Matching Threads

Search for more possibly parallel threads

llvm dev - Oct 2014 - [LLVMdev] Adding masked vector load and store intrinsics

[LLVMdev] Adding masked vector load and store intrinsics

[LLVMdev] Adding masked vector load and store intrinsics

[LLVMdev] Adding masked vector load and store intrinsics

[LLVMdev] Adding masked vector load and store intrinsics

[LLVMdev] Adding masked vector load and store intrinsics

[LLVMdev] Adding masked vector load and store intrinsics

[LLVMdev] Adding masked vector load and store intrinsics

Maybe Matching Threads