Demikhovsky, Elena
2014-Oct-24 11:24 UTC
[LLVMdev] Adding masked vector load and store intrinsics
Hi, We would like to add support for masked vector loads and stores by introducing new target-independent intrinsics. The loop vectorizer will then be enhanced to optimize loops containing conditional memory accesses by generating these intrinsics for existing targets such as AVX2 and AVX-512. The vectorizer will first ask the target about availability of masked vector loads and stores. The SLP vectorizer can potentially be enhanced to use these intrinsics as well. The intrinsics would be legal for all targets; targets that do not support masked vector loads or stores will scalarize them. The addressed memory will not be touched for masked-off lanes. In particular, if all lanes are masked off no address will be accessed. call void @llvm.masked.store (i32* %addr, <16 x i32> %data, i32 4, <16 x i1> %mask) %data = call <8 x i32> @llvm.masked.load (i32* %addr, <8 x i32> %passthru, i32 4, <8 x i1> %mask) where %passthru is used to fill the elements of %data that are masked-off (if any; can be zeroinitializer or undef). Comments so far, before we dive into more details? Thank you. - Elena and Ayal --------------------------------------------------------------------- Intel Israel (74) Limited This e-mail and any attachments may contain confidential material for the sole use of the intended recipient(s). Any review or distribution by others is strictly prohibited. If you are not the intended recipient, please contact the sender and delete all copies. -------------- next part -------------- An HTML attachment was scrubbed... URL: <lists.llvm.org/pipermail/llvm-dev/attachments/20141024/d56ce6e1/attachment.html>
----- Original Message -----> From: "Elena Demikhovsky" <elena.demikhovsky at intel.com> > To: llvmdev at cs.uiuc.edu > Cc: dag at cray.com > Sent: Friday, October 24, 2014 6:24:15 AM > Subject: [LLVMdev] Adding masked vector load and store intrinsics > > > > Hi, > > We would like to add support for masked vector loads and stores by > introducing new target-independent intrinsics. The loop vectorizer > will then be enhanced to optimize loops containing conditional > memory accesses by generating these intrinsics for existing targets > such as AVX2 and AVX-512. The vectorizer will first ask the target > about availability of masked vector loads and stores. The SLP > vectorizer can potentially be enhanced to use these intrinsics as > well. > > The intrinsics would be legal for all targets; targets that do not > support masked vector loads or stores will scalarize them. > The addressed memory will not be touched for masked-off lanes. In > particular, if all lanes are masked off no address will be accessed. > > call void @llvm.masked.store (i32* %addr, <16 x i32> %data, i32 4, > <16 x i1> %mask) > > %data = call <8 x i32> @llvm.masked.load (i32* %addr, <8 x i32> > %passthru, i32 4, <8 x i1> %mask) > > where %passthru is used to fill the elements of %data that are > masked-off (if any; can be zeroinitializer or undef). > > Comments so far, before we dive into more details?For the stores, I think this is a reasonable idea. The alternative is to represent them in scalar form with a lot of control flow, and I think that expecting the backend to properly pattern match that after isel is not realistic. For the loads, I'm must less sure. Why can't we represent the loads as select(mask, load(addr), passthru)? It is true, that the load might get separated from the select so that isel might not see it (because isel if basic-block local), but we can add some code in CodeGenPrep to fix that for targets on which it is useful to do so (which is a more-general solution than the intrinsic anyhow). What do you think? Thanks again, Hal> > Thank you. > > - Elena and Ayal > > > > --------------------------------------------------------------------- > Intel Israel (74) Limited > > This e-mail and any attachments may contain confidential material for > the sole use of the intended recipient(s). Any review or distribution > by others is strictly prohibited. If you are not the intended > recipient, please contact the sender and delete all copies. > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu llvm.cs.uiuc.edu > lists.cs.uiuc.edu/mailman/listinfo/llvmdev >-- Hal Finkel Assistant Computational Scientist Leadership Computing Facility Argonne National Laboratory
Demikhovsky, Elena
2014-Oct-24 13:07 UTC
[LLVMdev] Adding masked vector load and store intrinsics
> For the loads, I'm must less sure. Why can't we represent the loads as select(mask, load(addr), passthru)? It is true, that the load might get separated from the select so that isel might not see it (because isel if basic-block local), but we can add some code in CodeGenPrep to fix that for targets on which it is useful to do so (which is a more-general solution than the intrinsic anyhow). What do you think?We generate the vector-masked-intrinsic on IR-to-IR pass. It is too far from instruction selection. We'll need to guarantee that all subsequent IR-to-IR passes will not break the sequence. And only for one or two specific targets. Then we'll keep the logic in type legalizer, which may split or extend operations. Then we are taking care in DAG-combine. In my opinion, this is just unsafe. - Elena -----Original Message----- From: Hal Finkel [mailto:hfinkel at anl.gov] Sent: Friday, October 24, 2014 15:50 To: Demikhovsky, Elena Cc: dag at cray.com; llvmdev at cs.uiuc.edu Subject: Re: [LLVMdev] Adding masked vector load and store intrinsics ----- Original Message -----> From: "Elena Demikhovsky" <elena.demikhovsky at intel.com> > To: llvmdev at cs.uiuc.edu > Cc: dag at cray.com > Sent: Friday, October 24, 2014 6:24:15 AM > Subject: [LLVMdev] Adding masked vector load and store intrinsics > > > > Hi, > > We would like to add support for masked vector loads and stores by > introducing new target-independent intrinsics. The loop vectorizer > will then be enhanced to optimize loops containing conditional memory > accesses by generating these intrinsics for existing targets such as > AVX2 and AVX-512. The vectorizer will first ask the target about > availability of masked vector loads and stores. The SLP vectorizer can > potentially be enhanced to use these intrinsics as well. > > The intrinsics would be legal for all targets; targets that do not > support masked vector loads or stores will scalarize them. > The addressed memory will not be touched for masked-off lanes. In > particular, if all lanes are masked off no address will be accessed. > > call void @llvm.masked.store (i32* %addr, <16 x i32> %data, i32 4, > <16 x i1> %mask) > > %data = call <8 x i32> @llvm.masked.load (i32* %addr, <8 x i32> > %passthru, i32 4, <8 x i1> %mask) > > where %passthru is used to fill the elements of %data that are > masked-off (if any; can be zeroinitializer or undef). > > Comments so far, before we dive into more details?For the stores, I think this is a reasonable idea. The alternative is to represent them in scalar form with a lot of control flow, and I think that expecting the backend to properly pattern match that after isel is not realistic. For the loads, I'm must less sure. Why can't we represent the loads as select(mask, load(addr), passthru)? It is true, that the load might get separated from the select so that isel might not see it (because isel if basic-block local), but we can add some code in CodeGenPrep to fix that for targets on which it is useful to do so (which is a more-general solution than the intrinsic anyhow). What do you think? Thanks again, Hal> > Thank you. > > - Elena and Ayal > > > > --------------------------------------------------------------------- > Intel Israel (74) Limited > > This e-mail and any attachments may contain confidential material for > the sole use of the intended recipient(s). Any review or distribution > by others is strictly prohibited. If you are not the intended > recipient, please contact the sender and delete all copies. > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu llvm.cs.uiuc.edu > lists.cs.uiuc.edu/mailman/listinfo/llvmdev >-- Hal Finkel Assistant Computational Scientist Leadership Computing Facility Argonne National Laboratory --------------------------------------------------------------------- Intel Israel (74) Limited This e-mail and any attachments may contain confidential material for the sole use of the intended recipient(s). Any review or distribution by others is strictly prohibited. If you are not the intended recipient, please contact the sender and delete all copies.
Das, Dibyendu
2014-Oct-24 13:19 UTC
[LLVMdev] Adding masked vector load and store intrinsics
This looks to be a reasonable proposal. However native instructions that support such masked ld/st may have a high latency ? Also, it would be good to state some workloads where this will have a positive impact. -dibyendu From: Demikhovsky, Elena [mailto:elena.demikhovsky at intel.com] Sent: Friday, October 24, 2014 06:24 AM Central Standard Time To: llvmdev at cs.uiuc.edu <llvmdev at cs.uiuc.edu> Cc: dag at cray.com <dag at cray.com> Subject: [LLVMdev] Adding masked vector load and store intrinsics Hi, We would like to add support for masked vector loads and stores by introducing new target-independent intrinsics. The loop vectorizer will then be enhanced to optimize loops containing conditional memory accesses by generating these intrinsics for existing targets such as AVX2 and AVX-512. The vectorizer will first ask the target about availability of masked vector loads and stores. The SLP vectorizer can potentially be enhanced to use these intrinsics as well. The intrinsics would be legal for all targets; targets that do not support masked vector loads or stores will scalarize them. The addressed memory will not be touched for masked-off lanes. In particular, if all lanes are masked off no address will be accessed. call void @llvm.masked.store (i32* %addr, <16 x i32> %data, i32 4, <16 x i1> %mask) %data = call <8 x i32> @llvm.masked.load (i32* %addr, <8 x i32> %passthru, i32 4, <8 x i1> %mask) where %passthru is used to fill the elements of %data that are masked-off (if any; can be zeroinitializer or undef). Comments so far, before we dive into more details? Thank you. - Elena and Ayal --------------------------------------------------------------------- Intel Israel (74) Limited This e-mail and any attachments may contain confidential material for the sole use of the intended recipient(s). Any review or distribution by others is strictly prohibited. If you are not the intended recipient, please contact the sender and delete all copies. -------------- next part -------------- An HTML attachment was scrubbed... URL: <lists.llvm.org/pipermail/llvm-dev/attachments/20141024/b1fd8842/attachment.html>
Demikhovsky, Elena
2014-Oct-24 13:36 UTC
[LLVMdev] Adding masked vector load and store intrinsics
I wrote a loop with conditional load and store and measured performance on AVX2, where masking support is very basic, relatively to AVX-512. I got 2x speedup with vpmaskmovd. The maskmov instruction is slower than one vector load or store, but much faster than 8 scalar memory operations and 8 branches. Usage of masked instructions on AVX-512 will give much more. There is no latency on target in comparison to the regular vector memop. - Elena From: Das, Dibyendu [mailto:Dibyendu.Das at amd.com] Sent: Friday, October 24, 2014 16:20 To: Demikhovsky, Elena; 'llvmdev at cs.uiuc.edu' Cc: 'dag at cray.com' Subject: Re: [LLVMdev] Adding masked vector load and store intrinsics This looks to be a reasonable proposal. However native instructions that support such masked ld/st may have a high latency ? Also, it would be good to state some workloads where this will have a positive impact. -dibyendu From: Demikhovsky, Elena [mailto:elena.demikhovsky at intel.com] Sent: Friday, October 24, 2014 06:24 AM Central Standard Time To: llvmdev at cs.uiuc.edu<mailto:llvmdev at cs.uiuc.edu> <llvmdev at cs.uiuc.edu<mailto:llvmdev at cs.uiuc.edu>> Cc: dag at cray.com<mailto:dag at cray.com> <dag at cray.com<mailto:dag at cray.com>> Subject: [LLVMdev] Adding masked vector load and store intrinsics Hi, We would like to add support for masked vector loads and stores by introducing new target-independent intrinsics. The loop vectorizer will then be enhanced to optimize loops containing conditional memory accesses by generating these intrinsics for existing targets such as AVX2 and AVX-512. The vectorizer will first ask the target about availability of masked vector loads and stores. The SLP vectorizer can potentially be enhanced to use these intrinsics as well. The intrinsics would be legal for all targets; targets that do not support masked vector loads or stores will scalarize them. The addressed memory will not be touched for masked-off lanes. In particular, if all lanes are masked off no address will be accessed. call void @llvm.masked.store (i32* %addr, <16 x i32> %data, i32 4, <16 x i1> %mask) %data = call <8 x i32> @llvm.masked.load (i32* %addr, <8 x i32> %passthru, i32 4, <8 x i1> %mask) where %passthru is used to fill the elements of %data that are masked-off (if any; can be zeroinitializer or undef). Comments so far, before we dive into more details? Thank you. - Elena and Ayal --------------------------------------------------------------------- Intel Israel (74) Limited This e-mail and any attachments may contain confidential material for the sole use of the intended recipient(s). Any review or distribution by others is strictly prohibited. If you are not the intended recipient, please contact the sender and delete all copies. --------------------------------------------------------------------- Intel Israel (74) Limited This e-mail and any attachments may contain confidential material for the sole use of the intended recipient(s). Any review or distribution by others is strictly prohibited. If you are not the intended recipient, please contact the sender and delete all copies. -------------- next part -------------- An HTML attachment was scrubbed... URL: <lists.llvm.org/pipermail/llvm-dev/attachments/20141024/41aaba88/attachment.html>
> Why can't we represent the loads as select(mask, load(addr), passthru)?This suggests masked-off lanes are free to speculatively load from memory. Whereas proposed semantics is that:> The addressed memory will not be touched for masked-off lanes. In > particular, if all lanes are masked off no address will be accessed.Ayal. -----Original Message----- From: llvmdev-bounces at cs.uiuc.edu [mailto:llvmdev-bounces at cs.uiuc.edu] On Behalf Of Hal Finkel Sent: Friday, October 24, 2014 15:50 To: Demikhovsky, Elena Cc: dag at cray.com; llvmdev at cs.uiuc.edu Subject: Re: [LLVMdev] Adding masked vector load and store intrinsics ----- Original Message -----> From: "Elena Demikhovsky" <elena.demikhovsky at intel.com> > To: llvmdev at cs.uiuc.edu > Cc: dag at cray.com > Sent: Friday, October 24, 2014 6:24:15 AM > Subject: [LLVMdev] Adding masked vector load and store intrinsics > > > > Hi, > > We would like to add support for masked vector loads and stores by > introducing new target-independent intrinsics. The loop vectorizer > will then be enhanced to optimize loops containing conditional memory > accesses by generating these intrinsics for existing targets such as > AVX2 and AVX-512. The vectorizer will first ask the target about > availability of masked vector loads and stores. The SLP vectorizer can > potentially be enhanced to use these intrinsics as well. > > The intrinsics would be legal for all targets; targets that do not > support masked vector loads or stores will scalarize them. > The addressed memory will not be touched for masked-off lanes. In > particular, if all lanes are masked off no address will be accessed. > > call void @llvm.masked.store (i32* %addr, <16 x i32> %data, i32 4, > <16 x i1> %mask) > > %data = call <8 x i32> @llvm.masked.load (i32* %addr, <8 x i32> > %passthru, i32 4, <8 x i1> %mask) > > where %passthru is used to fill the elements of %data that are > masked-off (if any; can be zeroinitializer or undef). > > Comments so far, before we dive into more details?For the stores, I think this is a reasonable idea. The alternative is to represent them in scalar form with a lot of control flow, and I think that expecting the backend to properly pattern match that after isel is not realistic. For the loads, I'm must less sure. Why can't we represent the loads as select(mask, load(addr), passthru)? It is true, that the load might get separated from the select so that isel might not see it (because isel if basic-block local), but we can add some code in CodeGenPrep to fix that for targets on which it is useful to do so (which is a more-general solution than the intrinsic anyhow). What do you think? Thanks again, Hal> > Thank you. > > - Elena and Ayal > > > > --------------------------------------------------------------------- > Intel Israel (74) Limited > > This e-mail and any attachments may contain confidential material for > the sole use of the intended recipient(s). Any review or distribution > by others is strictly prohibited. If you are not the intended > recipient, please contact the sender and delete all copies. > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu llvm.cs.uiuc.edu > lists.cs.uiuc.edu/mailman/listinfo/llvmdev >-- Hal Finkel Assistant Computational Scientist Leadership Computing Facility Argonne National Laboratory _______________________________________________ LLVM Developers mailing list LLVMdev at cs.uiuc.edu llvm.cs.uiuc.edu lists.cs.uiuc.edu/mailman/listinfo/llvmdev --------------------------------------------------------------------- Intel Israel (74) Limited This e-mail and any attachments may contain confidential material for the sole use of the intended recipient(s). Any review or distribution by others is strictly prohibited. If you are not the intended recipient, please contact the sender and delete all copies.
dag at cray.com
2014-Oct-24 16:48 UTC
[LLVMdev] Adding masked vector load and store intrinsics
Hal Finkel <hfinkel at anl.gov> writes:> For the loads, I'm must less sure. Why can't we represent the loads as > select(mask, load(addr), passthru)?Because that does not specify the correct semantics. This formulation expects the load to happen before the mask is applied. The load could trap. The operation needs to be presented as an atomic unit. The same problem exists with any potentially trapping instruction (e.g. all floating point computations). The need for intrinsics goes way beyond loads and stores. -David
dag at cray.com
2014-Oct-24 17:20 UTC
[LLVMdev] Adding masked vector load and store intrinsics
"Demikhovsky, Elena" <elena.demikhovsky at intel.com> writes:> %data = call <8 x i32> @llvm.masked.load (i32* %addr, <8 x i32> > %passthru, i32 4, <8 x i1> %mask) > where %passthru is used to fill the elements of %data that are > masked-off (if any; can be zeroinitializer or undef).So %passthrough can *only* be undef or zeroinitializer? If that's the case it might make more sense to have two intrinsics, one that fills with undef and one that fills with zero. Using a general vector operand with a restriction on valid values seems odd and potentially misleading. Another option is to always fill with undef and require a select on top of the load to fill with zero. The load + select would be easily matchable to a target instruction. I'm trying to think beyond just AVX-512 to what other future architectures might want. It's not a given that future architectures will fill with zero *or* undef though those are the two most likely fill values. -David
dag at cray.com
2014-Oct-24 17:22 UTC
[LLVMdev] Adding masked vector load and store intrinsics
"Das, Dibyendu" <Dibyendu.Das at amd.com> writes:> This looks to be a reasonable proposal. However native instructions > that support such masked ld/st may have a high latency ? Also, it > would be good to state some workloads where this will have a positive > impact.Any significant vector workload will see a giant gain from this. The masked operations really shouldn't have any more latency. The time of the memory operation itself dominates. -David
On Oct 24, 2014, at 4:24 AM, Demikhovsky, Elena <elena.demikhovsky at intel.com> wrote:> Hi, > > We would like to add support for masked vector loads and stores by introducing new target-independent intrinsics. The loop vectorizer will then be enhanced to optimize loops containing conditional memory accesses by generating these intrinsics for existing targets such as AVX2 and AVX-512. The vectorizer will first ask the target about availability of masked vector loads and stores. The SLP vectorizer can potentially be enhanced to use these intrinsics as well. > > The intrinsics would be legal for all targets; targets that do not support masked vector loads or stores will scalarize them.I do agree that we would like to have one IR node to capture these so that they survive until ISel and that their specific semantics can be expressed. However, can you discuss the other options (new IR instructions, target-specific intrinsics) and why you went with target-independent intrinsics. My intuition would have been to go with target-specific intrinsics until we have something solid implemented and then potentially turn this into native IR instructions as the next step (for other targets, etc.). I am particularly worried whether we really want to generate these for targets that don’t have vector predication support. There is also the related question of vector predicating any other instruction beyond just loads and stores which AVX512 supports. This is probably a smaller gain but should probably be part of the plan as well. Adam> The addressed memory will not be touched for masked-off lanes. In particular, if all lanes are masked off no address will be accessed. > > call void @llvm.masked.store (i32* %addr, <16 x i32> %data, i32 4, <16 x i1> %mask) > > %data = call <8 x i32> @llvm.masked.load (i32* %addr, <8 x i32> %passthru, i32 4, <8 x i1> %mask) > > where %passthru is used to fill the elements of %data that are masked-off (if any; can be zeroinitializer or undef). > > Comments so far, before we dive into more details? > > Thank you. > > - Elena and Ayal > > > --------------------------------------------------------------------- > Intel Israel (74) Limited > > This e-mail and any attachments may contain confidential material for > the sole use of the intended recipient(s). Any review or distribution > by others is strictly prohibited. If you are not the intended > recipient, please contact the sender and delete all copies. >-------------- next part -------------- An HTML attachment was scrubbed... URL: <lists.llvm.org/pipermail/llvm-dev/attachments/20141024/00588841/attachment.html>
Smith, Kevin B
2014-Oct-24 17:58 UTC
[LLVMdev] Adding masked vector load and store intrinsics
> So %passthrough can *only* be undef or zeroinitializer?No, that wasn't the intent. %passthrough can be any other definition that is needed. Zero and undef were simply two possible values that illustrated some interesting behavior. Mapping of the %passthrough to the actual semantics of many vector instruction sets where the masked instructions leave the masked-off elements of the destination unchanged is done in a similar manner as three-address instructions are turned into two address instructions, by placing a copy as necessary so that dest and passthrough are in the same register. Kevin B. Smith -----Original Message----- From: llvmdev-bounces at cs.uiuc.edu [mailto:llvmdev-bounces at cs.uiuc.edu] On Behalf Of dag at cray.com Sent: Friday, October 24, 2014 10:21 AM To: Demikhovsky, Elena Cc: llvmdev at cs.uiuc.edu Subject: Re: [LLVMdev] Adding masked vector load and store intrinsics "Demikhovsky, Elena" <elena.demikhovsky at intel.com> writes:> %data = call <8 x i32> @llvm.masked.load (i32* %addr, <8 x i32> > %passthru, i32 4, <8 x i1> %mask) > where %passthru is used to fill the elements of %data that are > masked-off (if any; can be zeroinitializer or undef).So %passthrough can *only* be undef or zeroinitializer? If that's the case it might make more sense to have two intrinsics, one that fills with undef and one that fills with zero. Using a general vector operand with a restriction on valid values seems odd and potentially misleading. Another option is to always fill with undef and require a select on top of the load to fill with zero. The load + select would be easily matchable to a target instruction. I'm trying to think beyond just AVX-512 to what other future architectures might want. It's not a given that future architectures will fill with zero *or* undef though those are the two most likely fill values. -David _______________________________________________ LLVM Developers mailing list LLVMdev at cs.uiuc.edu llvm.cs.uiuc.edu lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Nadav Rotem
2014-Oct-24 18:38 UTC
[LLVMdev] Adding masked vector load and store intrinsics
> On Oct 24, 2014, at 10:57 AM, Adam Nemet <anemet at apple.com> wrote: > > On Oct 24, 2014, at 4:24 AM, Demikhovsky, Elena <elena.demikhovsky at intel.com <mailto:elena.demikhovsky at intel.com>> wrote: > >> Hi, >> >> We would like to add support for masked vector loads and stores by introducing new target-independent intrinsics. The loop vectorizer will then be enhanced to optimize loops containing conditional memory accesses by generating these intrinsics for existing targets such as AVX2 and AVX-512. The vectorizer will first ask the target about availability of masked vector loads and stores. The SLP vectorizer can potentially be enhanced to use these intrinsics as well. >>I am happy to hear that you are working on this because it means that in the future we would be able to teach the SLP Vectorizer to vectorize types of <3 x float>.>> The intrinsics would be legal for all targets; targets that do not support masked vector loads or stores will scalarize them. >+1. I think that this is an important requirement.> I do agree that we would like to have one IR node to capture these so that they survive until ISel and that their specific semantics can be expressed. However, can you discuss the other options (new IR instructions, target-specific intrinsics) and why you went with target-independent intrinsics. >I agree with the approach of adding target-independent masked memory intrinsics. One reason is that I would like to keep the vectorizers target independent (and use the target transform info to query the backends). I oppose adding new first-level instructions because we would need to teach all of the existing optimizations about the new instructions, and considering the limited usefulness of masked operations it is not worth the effort.> My intuition would have been to go with target-specific intrinsics until we have something solid implemented and then potentially turn this into native IR instructions as the next step (for other targets, etc.). I am particularly worried whether we really want to generate these for targets that don’t have vector predication support.Probably not, but this is a cost-benefit decision that the vectorizers would need to make.> > There is also the related question of vector predicating any other instruction beyond just loads and stores which AVX512 supports. This is probably a smaller gain but should probably be part of the plan as well. > > Adam > >> The addressed memory will not be touched for masked-off lanes. In particular, if all lanes are masked off no address will be accessed. >> >> call void @llvm.masked.store (i32* %addr, <16 x i32> %data, i32 4, <16 x i1> %mask) >> >> %data = call <8 x i32> @llvm.masked.load (i32* %addr, <8 x i32> %passthru, i32 4, <8 x i1> %mask) >> >> where %passthru is used to fill the elements of %data that are masked-off (if any; can be zeroinitializer or undef). >> >> Comments so far, before we dive into more details? >> >> Thank you. >> >> - Elena and Ayal >> >> >> --------------------------------------------------------------------- >> Intel Israel (74) Limited >> >> This e-mail and any attachments may contain confidential material for >> the sole use of the intended recipient(s). Any review or distribution >> by others is strictly prohibited. If you are not the intended >> recipient, please contact the sender and delete all copies. >> >-------------- next part -------------- An HTML attachment was scrubbed... URL: <lists.llvm.org/pipermail/llvm-dev/attachments/20141024/8f9fe89d/attachment.html>
Tian, Xinmin
2014-Oct-24 18:48 UTC
[LLVMdev] Adding masked vector load and store intrinsics
Adam, yes, there are more stuff we need to consider, e.g. masked gather / scatter, masked arithmetic ops, ...etc. This proposal serves the first step which is an important, as a direction check w/ community. Xinmin Tian From: llvmdev-bounces at cs.uiuc.edu [mailto:llvmdev-bounces at cs.uiuc.edu] On Behalf Of Adam Nemet Sent: Friday, October 24, 2014 10:58 AM To: Demikhovsky, Elena Cc: dag at cray.com; llvmdev at cs.uiuc.edu Subject: Re: [LLVMdev] Adding masked vector load and store intrinsics On Oct 24, 2014, at 4:24 AM, Demikhovsky, Elena <elena.demikhovsky at intel.com<mailto:elena.demikhovsky at intel.com>> wrote: Hi, We would like to add support for masked vector loads and stores by introducing new target-independent intrinsics. The loop vectorizer will then be enhanced to optimize loops containing conditional memory accesses by generating these intrinsics for existing targets such as AVX2 and AVX-512. The vectorizer will first ask the target about availability of masked vector loads and stores. The SLP vectorizer can potentially be enhanced to use these intrinsics as well. The intrinsics would be legal for all targets; targets that do not support masked vector loads or stores will scalarize them. I do agree that we would like to have one IR node to capture these so that they survive until ISel and that their specific semantics can be expressed. However, can you discuss the other options (new IR instructions, target-specific intrinsics) and why you went with target-independent intrinsics. My intuition would have been to go with target-specific intrinsics until we have something solid implemented and then potentially turn this into native IR instructions as the next step (for other targets, etc.). I am particularly worried whether we really want to generate these for targets that don't have vector predication support. There is also the related question of vector predicating any other instruction beyond just loads and stores which AVX512 supports. This is probably a smaller gain but should probably be part of the plan as well. Adam The addressed memory will not be touched for masked-off lanes. In particular, if all lanes are masked off no address will be accessed. call void @llvm.masked.store (i32* %addr, <16 x i32> %data, i32 4, <16 x i1> %mask) %data = call <8 x i32> @llvm.masked.load (i32* %addr, <8 x i32> %passthru, i32 4, <8 x i1> %mask) where %passthru is used to fill the elements of %data that are masked-off (if any; can be zeroinitializer or undef). Comments so far, before we dive into more details? Thank you. - Elena and Ayal --------------------------------------------------------------------- Intel Israel (74) Limited This e-mail and any attachments may contain confidential material for the sole use of the intended recipient(s). Any review or distribution by others is strictly prohibited. If you are not the intended recipient, please contact the sender and delete all copies. -------------- next part -------------- An HTML attachment was scrubbed... URL: <lists.llvm.org/pipermail/llvm-dev/attachments/20141024/d117864f/attachment.html>
dag at cray.com
2014-Oct-24 19:59 UTC
[LLVMdev] Adding masked vector load and store intrinsics
Adam Nemet <anemet at apple.com> writes:> I am particularly worried whether we really want to generate these for > targets that don’t have vector predication support.We almost certainly don't want to do that. Clang or whatever is generating LLVM IR will need to be aware of target vector capabilities. Still, legalization needs to be available to handle this situation if it arises.> There is also the related question of vector predicating any other > instruction beyond just loads and stores which AVX512 supports. This > is probably a smaller gain but should probably be part of the plan as > well.It's not a small gain, it is a *critical* thing to do. We have customers that always run with traps enabled and without masking, it severely limits what code can be vectorized. -David
Elena, As far as I can tell, consensus is strongly in favor. Please submit a patch :-) Thanks again, Hal ----- Original Message -----> From: "Elena Demikhovsky" <elena.demikhovsky at intel.com> > To: llvmdev at cs.uiuc.edu > Cc: dag at cray.com > Sent: Friday, October 24, 2014 6:24:15 AM > Subject: [LLVMdev] Adding masked vector load and store intrinsics > > > > Hi, > > We would like to add support for masked vector loads and stores by > introducing new target-independent intrinsics. The loop vectorizer > will then be enhanced to optimize loops containing conditional > memory accesses by generating these intrinsics for existing targets > such as AVX2 and AVX-512. The vectorizer will first ask the target > about availability of masked vector loads and stores. The SLP > vectorizer can potentially be enhanced to use these intrinsics as > well. > > The intrinsics would be legal for all targets; targets that do not > support masked vector loads or stores will scalarize them. > The addressed memory will not be touched for masked-off lanes. In > particular, if all lanes are masked off no address will be accessed. > > call void @llvm.masked.store (i32* %addr, <16 x i32> %data, i32 4, > <16 x i1> %mask) > > %data = call <8 x i32> @llvm.masked.load (i32* %addr, <8 x i32> > %passthru, i32 4, <8 x i1> %mask) > > where %passthru is used to fill the elements of %data that are > masked-off (if any; can be zeroinitializer or undef). > > Comments so far, before we dive into more details? > > Thank you. > > - Elena and Ayal > > > > --------------------------------------------------------------------- > Intel Israel (74) Limited > > This e-mail and any attachments may contain confidential material for > the sole use of the intended recipient(s). Any review or distribution > by others is strictly prohibited. If you are not the intended > recipient, please contact the sender and delete all copies. > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu llvm.cs.uiuc.edu > lists.cs.uiuc.edu/mailman/listinfo/llvmdev >-- Hal Finkel Assistant Computational Scientist Leadership Computing Facility Argonne National Laboratory
Demikhovsky, Elena
2014-Oct-25 11:22 UTC
[LLVMdev] Adding masked vector load and store intrinsics
> So %passthrough can *only* be undef or zeroinitializer?No, it can be any value including undef and zeroinitializer. We considered, while designing, zero and merge semantics and decided that merge semantics is better because it covers zero semantics if you use zeroinitializer in the %paththru. - Elena -----Original Message----- From: dag at cray.com [mailto:dag at cray.com] Sent: Friday, October 24, 2014 20:21 To: Demikhovsky, Elena Cc: llvmdev at cs.uiuc.edu; Zaks, Ayal; Nadav Rotem <nrotem at apple.com> (nrotem at apple.com); Chandler Carruth (chandlerc at google.com); Adam Nemet (anemet at apple.com) Subject: Re: Adding masked vector load and store intrinsics "Demikhovsky, Elena" <elena.demikhovsky at intel.com> writes:> %data = call <8 x i32> @llvm.masked.load (i32* %addr, <8 x i32> > %passthru, i32 4, <8 x i1> %mask) where %passthru is used to fill the > elements of %data that are masked-off (if any; can be zeroinitializer > or undef).So %passthrough can *only* be undef or zeroinitializer? If that's the case it might make more sense to have two intrinsics, one that fills with undef and one that fills with zero. Using a general vector operand with a restriction on valid values seems odd and potentially misleading. Another option is to always fill with undef and require a select on top of the load to fill with zero. The load + select would be easily matchable to a target instruction. I'm trying to think beyond just AVX-512 to what other future architectures might want. It's not a given that future architectures will fill with zero *or* undef though those are the two most likely fill values. -David --------------------------------------------------------------------- Intel Israel (74) Limited This e-mail and any attachments may contain confidential material for the sole use of the intended recipient(s). Any review or distribution by others is strictly prohibited. If you are not the intended recipient, please contact the sender and delete all copies.
Demikhovsky, Elena
2014-Oct-25 11:40 UTC
[LLVMdev] Adding masked vector load and store intrinsics
Thank you Hal, meanwhile, I implemented something quick to be sure that it works and estimate what pieces of LLVM code should be touched. I'll prepare a patch soon. - Elena -----Original Message----- From: Hal Finkel [mailto:hfinkel at anl.gov] Sent: Saturday, October 25, 2014 00:02 To: Demikhovsky, Elena Cc: dag at cray.com; llvmdev at cs.uiuc.edu Subject: Re: [LLVMdev] Adding masked vector load and store intrinsics Elena, As far as I can tell, consensus is strongly in favor. Please submit a patch :-) Thanks again, Hal ----- Original Message -----> From: "Elena Demikhovsky" <elena.demikhovsky at intel.com> > To: llvmdev at cs.uiuc.edu > Cc: dag at cray.com > Sent: Friday, October 24, 2014 6:24:15 AM > Subject: [LLVMdev] Adding masked vector load and store intrinsics > > > > Hi, > > We would like to add support for masked vector loads and stores by > introducing new target-independent intrinsics. The loop vectorizer > will then be enhanced to optimize loops containing conditional memory > accesses by generating these intrinsics for existing targets such as > AVX2 and AVX-512. The vectorizer will first ask the target about > availability of masked vector loads and stores. The SLP vectorizer can > potentially be enhanced to use these intrinsics as well. > > The intrinsics would be legal for all targets; targets that do not > support masked vector loads or stores will scalarize them. > The addressed memory will not be touched for masked-off lanes. In > particular, if all lanes are masked off no address will be accessed. > > call void @llvm.masked.store (i32* %addr, <16 x i32> %data, i32 4, > <16 x i1> %mask) > > %data = call <8 x i32> @llvm.masked.load (i32* %addr, <8 x i32> > %passthru, i32 4, <8 x i1> %mask) > > where %passthru is used to fill the elements of %data that are > masked-off (if any; can be zeroinitializer or undef). > > Comments so far, before we dive into more details? > > Thank you. > > - Elena and Ayal > > > > --------------------------------------------------------------------- > Intel Israel (74) Limited > > This e-mail and any attachments may contain confidential material for > the sole use of the intended recipient(s). Any review or distribution > by others is strictly prohibited. If you are not the intended > recipient, please contact the sender and delete all copies. > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu llvm.cs.uiuc.edu > lists.cs.uiuc.edu/mailman/listinfo/llvmdev >-- Hal Finkel Assistant Computational Scientist Leadership Computing Facility Argonne National Laboratory --------------------------------------------------------------------- Intel Israel (74) Limited This e-mail and any attachments may contain confidential material for the sole use of the intended recipient(s). Any review or distribution by others is strictly prohibited. If you are not the intended recipient, please contact the sender and delete all copies.
shahid shahid
2014-Oct-25 14:53 UTC
[LLVMdev] Adding masked vector load and store intrinsics
Hi Elena, Nice to see that your thinking are quite similar with mine. Do you plan to generate this intrinsic in Loop Vectorizer based on subtarget feature? If so, it would be better to let it generate here in target independent manner.Later on,during lowering, based on the availability of target support for masked ops you can decideeither to scalarize or generate the target masked ops instruction. Shahid On Friday, October 24, 2014 4:59 PM, "Demikhovsky, Elena" <elena.demikhovsky at intel.com> wrote: <!--#yiv1368802508 .yiv1368802508EmailQuote {margin-left:1pt;padding-left:4pt;border-left:#800000 2px solid;}-->Hi, We would like to add support for masked vector loads and stores by introducing new target-independent intrinsics. The loop vectorizer will then be enhanced to optimize loops containing conditional memory accesses by generating theseintrinsics for existing targets such as AVX2 and AVX-512. The vectorizer will first ask the target about availability of masked vector loads and stores. The SLP vectorizer can potentially be enhanced to use these intrinsics as well. The intrinsics would be legal for all targets; targets that do not support masked vector loads or stores will scalarize them.The addressed memory will not be touched for masked-off lanes. In particular, if all lanes are masked off no address will be accessed. call void @llvm.masked.store (i32* %addr, <16 x i32> %data, i32 4, <16 x i1> %mask) %data = call <8 x i32> @llvm.masked.load (i32* %addr, <8 x i32> %passthru, i32 4, <8 x i1> %mask) where %passthru is used to fill the elements of %data that are masked-off (if any; can be zeroinitializer or undef). Comments so far, before we dive into more details? Thank you. - Elena and Ayal --------------------------------------------------------------------- Intel Israel (74) LimitedThis e-mail and any attachments may contain confidential material for the sole use of the intended recipient(s). Any review or distribution by others is strictly prohibited. If you are not the intended recipient, please contact the sender and delete all copies. _______________________________________________ LLVM Developers mailing list LLVMdev at cs.uiuc.edu llvm.cs.uiuc.edu lists.cs.uiuc.edu/mailman/listinfo/llvmdev -------------- next part -------------- An HTML attachment was scrubbed... URL: <lists.llvm.org/pipermail/llvm-dev/attachments/20141025/9d54a5f3/attachment.html>
Demikhovsky, Elena
2014-Oct-26 07:07 UTC
[LLVMdev] Adding masked vector load and store intrinsics
We may receive less optimal code on other targets as a result. User may want optimize a sequence of scalar instructions after vectorization did not pass. - Elena From: shahid shahid [mailto:shahid77c at yahoo.com] Sent: Saturday, October 25, 2014 17:53 To: Demikhovsky, Elena; llvmdev at cs.uiuc.edu Cc: dag at cray.com Subject: Re: [LLVMdev] Adding masked vector load and store intrinsics Hi Elena, Nice to see that your thinking are quite similar with mine. Do you plan to generate this intrinsic in Loop Vectorizer based on subtarget feature? If so, it would be better to let it generate here in target independent manner.Later on, during lowering, based on the availability of target support for masked ops you can decide either to scalarize or generate the target masked ops instruction. Shahid On Friday, October 24, 2014 4:59 PM, "Demikhovsky, Elena" <elena.demikhovsky at intel.com<mailto:elena.demikhovsky at intel.com>> wrote: Hi, We would like to add support for masked vector loads and stores by introducing new target-independent intrinsics. The loop vectorizer will then be enhanced to optimize loops containing conditional memory accesses by generating these intrinsics for existing targets such as AVX2 and AVX-512. The vectorizer will first ask the target about availability of masked vector loads and stores. The SLP vectorizer can potentially be enhanced to use these intrinsics as well. The intrinsics would be legal for all targets; targets that do not support masked vector loads or stores will scalarize them. The addressed memory will not be touched for masked-off lanes. In particular, if all lanes are masked off no address will be accessed. call void @llvm.masked.store (i32* %addr, <16 x i32> %data, i32 4, <16 x i1> %mask) %data = call <8 x i32> @llvm.masked.load (i32* %addr, <8 x i32> %passthru, i32 4, <8 x i1> %mask) where %passthru is used to fill the elements of %data that are masked-off (if any; can be zeroinitializer or undef). Comments so far, before we dive into more details? Thank you. - Elena and Ayal --------------------------------------------------------------------- Intel Israel (74) Limited This e-mail and any attachments may contain confidential material for the sole use of the intended recipient(s). Any review or distribution by others is strictly prohibited. If you are not the intended recipient, please contact the sender and delete all copies. _______________________________________________ LLVM Developers mailing list LLVMdev at cs.uiuc.edu<mailto:LLVMdev at cs.uiuc.edu> llvm.cs.uiuc.edu<llvm.cs.uiuc.edu> lists.cs.uiuc.edu/mailman/listinfo/llvmdev --------------------------------------------------------------------- Intel Israel (74) Limited This e-mail and any attachments may contain confidential material for the sole use of the intended recipient(s). Any review or distribution by others is strictly prohibited. If you are not the intended recipient, please contact the sender and delete all copies. -------------- next part -------------- An HTML attachment was scrubbed... URL: <lists.llvm.org/pipermail/llvm-dev/attachments/20141026/0a41fcc5/attachment.html>
Owen Anderson
2014-Oct-26 21:56 UTC
[LLVMdev] Adding masked vector load and store intrinsics
What is the motivation for using intrinsics versus adding new instructions? —Owen> On Oct 24, 2014, at 4:24 AM, Demikhovsky, Elena <elena.demikhovsky at intel.com> wrote: > > Hi, > > We would like to add support for masked vector loads and stores by introducing new target-independent intrinsics. The loop vectorizer will then be enhanced to optimize loops containing conditional memory accesses by generating these intrinsics for existing targets such as AVX2 and AVX-512. The vectorizer will first ask the target about availability of masked vector loads and stores. The SLP vectorizer can potentially be enhanced to use these intrinsics as well. > > The intrinsics would be legal for all targets; targets that do not support masked vector loads or stores will scalarize them. > The addressed memory will not be touched for masked-off lanes. In particular, if all lanes are masked off no address will be accessed. > > call void @llvm.masked.store (i32* %addr, <16 x i32> %data, i32 4, <16 x i1> %mask) > > %data = call <8 x i32> @llvm.masked.load (i32* %addr, <8 x i32> %passthru, i32 4, <8 x i1> %mask) > > where %passthru is used to fill the elements of %data that are masked-off (if any; can be zeroinitializer or undef). > > Comments so far, before we dive into more details? > > Thank you. > > - Elena and Ayal > > > --------------------------------------------------------------------- > Intel Israel (74) Limited > > This e-mail and any attachments may contain confidential material for > the sole use of the intended recipient(s). Any review or distribution > by others is strictly prohibited. If you are not the intended > recipient, please contact the sender and delete all copies. > > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu <mailto:LLVMdev at cs.uiuc.edu> llvm.cs.uiuc.edu <llvm.cs.uiuc.edu> > lists.cs.uiuc.edu/mailman/listinfo/llvmdev <lists.cs.uiuc.edu/mailman/listinfo/llvmdev>-------------- next part -------------- An HTML attachment was scrubbed... URL: <lists.llvm.org/pipermail/llvm-dev/attachments/20141026/3c561069/attachment.html>
Demikhovsky, Elena
2014-Oct-27 07:02 UTC
[LLVMdev] Adding masked vector load and store intrinsics
we just follow a common recommendation to start with intrinsics: llvm.org/docs/ExtendingLLVM.html - Elena From: Owen Anderson [mailto:resistor at mac.com] Sent: Sunday, October 26, 2014 23:57 To: Demikhovsky, Elena Cc: llvmdev at cs.uiuc.edu; dag at cray.com Subject: Re: [LLVMdev] Adding masked vector load and store intrinsics What is the motivation for using intrinsics versus adding new instructions? —Owen On Oct 24, 2014, at 4:24 AM, Demikhovsky, Elena <elena.demikhovsky at intel.com<mailto:elena.demikhovsky at intel.com>> wrote: Hi, We would like to add support for masked vector loads and stores by introducing new target-independent intrinsics. The loop vectorizer will then be enhanced to optimize loops containing conditional memory accesses by generating these intrinsics for existing targets such as AVX2 and AVX-512. The vectorizer will first ask the target about availability of masked vector loads and stores. The SLP vectorizer can potentially be enhanced to use these intrinsics as well. The intrinsics would be legal for all targets; targets that do not support masked vector loads or stores will scalarize them. The addressed memory will not be touched for masked-off lanes. In particular, if all lanes are masked off no address will be accessed. call void @llvm.masked.store (i32* %addr, <16 x i32> %data, i32 4, <16 x i1> %mask) %data = call <8 x i32> @llvm.masked.load (i32* %addr, <8 x i32> %passthru, i32 4, <8 x i1> %mask) where %passthru is used to fill the elements of %data that are masked-off (if any; can be zeroinitializer or undef). Comments so far, before we dive into more details? Thank you. - Elena and Ayal --------------------------------------------------------------------- Intel Israel (74) Limited This e-mail and any attachments may contain confidential material for the sole use of the intended recipient(s). Any review or distribution by others is strictly prohibited. If you are not the intended recipient, please contact the sender and delete all copies. _______________________________________________ LLVM Developers mailing list LLVMdev at cs.uiuc.edu<mailto:LLVMdev at cs.uiuc.edu> llvm.cs.uiuc.edu<llvm.cs.uiuc.edu> lists.cs.uiuc.edu/mailman/listinfo/llvmdev --------------------------------------------------------------------- Intel Israel (74) Limited This e-mail and any attachments may contain confidential material for the sole use of the intended recipient(s). Any review or distribution by others is strictly prohibited. If you are not the intended recipient, please contact the sender and delete all copies. -------------- next part -------------- An HTML attachment was scrubbed... URL: <lists.llvm.org/pipermail/llvm-dev/attachments/20141027/647272c8/attachment.html>
Apparently Analagous Threads
- [LLVMdev] Adding masked vector load and store intrinsics
- [LLVMdev] Adding masked vector load and store intrinsics
- [LLVMdev] Adding masked vector load and store intrinsics
- RFC: New intrinsics masked.expandload and masked.compressstore
- RFC: New intrinsics masked.expandload and masked.compressstore