Demikhovsky, Elena
2014-Dec-18 14:40 UTC
[LLVMdev] Indexed Load and Store Intrinsics - proposal
Hi, Recent Intel architectures AVX-512 and AVX2 provide vector gather and/or scatter instructions. Gather/scatter instructions allow read/write access to multiple memory addresses. The addresses are specified using a base address and a vector of indices. We'd like Vectorizers to tap this functionality, and propose to do so by introducing new intrinsics: VectorValue = @llvm.sindex.load (BaseAddr, VectorOfIndices, Scale) VectorValue = @llvm.uindex.load (BaseAddr, VectorOfIndices, Scale) VectorValue = @llvm.sindex.masked.load (BaseAddr, VectorOfIndices, Scale, PassThruVal, Mask) VectorValue = @llvm.uindex.masked.load (BaseAddr, VectorOfIndices, Scale, PassThruVal, Mask) Semantics: For i=0,1,...,N-1: if (Mask[i]) {VectorValue[i] = *(BaseAddr + VectorOfIndices[i]*Scale) else VectorValue[i]=PassThruVal[i];} void @llvm.sindex.store (BaseAddr, VectorValue, VectorOfIndices, Scale) void @llvm.uindex.store (BaseAddr, VectorValue, VectorOfIndices, Scale) void @llvm.sindex.masked.store (BaseAddr, VectorValue, VectorOfIndices, Scale, Mask) void @llvm.uindex.masked.store (BaseAddr, VectorValue, VectorOfIndices, Scale, Mask) Semantics: For i=0,1,...,N-1: if (Mask[i]) {*(BaseAddr + VectorOfIndices[i]*Scale) = VectorValue[i];} VectorValue: any float or integer vector type. BaseAddr: a pointer; may be zero if full address is placed in the index. VectorOfIndices: a vector of i32 or i64 signed or unsigned integer values. Scale: a compile time constant 1, 2, 4 or 8. VectorValue, VectorOfIndices and Mask must have the same vector width. An indexed store instruction with complete or partial overlap in memory (i.e., two indices with same or close values) will provide the result equivalent to serial scalar stores from least to most significant vector elements. The new intrinsics are common for all targets, like recently introduced masked load and store. Examples: <16 x float> @llvm.sindex.load.v16f32.v16i32 (i8 *%ptr, <16 x i32> %index, i32 %scale) <16 x float> @llvm.masked.sindex.load.v16f32.v16i32 (i8 *%ptr, <16 x i32> %index, <16 x float> %passthru, <16 x i1> %mask) void @llvm.sindex.store.v16f32.v16i64(i8* %ptr, <16 x float> %value, <16 x 164> %index, i32 %scale, <16 x i1> %mask) Comments? Thank you. - Elena --------------------------------------------------------------------- Intel Israel (74) Limited This e-mail and any attachments may contain confidential material for the sole use of the intended recipient(s). Any review or distribution by others is strictly prohibited. If you are not the intended recipient, please contact the sender and delete all copies. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20141218/3e4feecd/attachment.html>
Hi Elena, I think that in general this proposal makes sense and is consistent with discussions that we’ve had in the past. These new intrinsics can be very useful for vectorization. I have a few comments below> On Dec 18, 2014, at 6:40 AM, Demikhovsky, Elena <elena.demikhovsky at intel.com> wrote: > > Hi, > > Recent Intel architectures AVX-512 and AVX2 provide vector gather and/or scatter instructions. > Gather/scatter instructions allow read/write access to multiple memory addresses. The addresses are specified using a base address and a vector of indices. > We’d like Vectorizers to tap this functionality, and propose to do so by introducing new intrinsics: > > VectorValue = @llvm.sindex.load (BaseAddr, VectorOfIndices, Scale) > VectorValue = @llvm.uindex.load (BaseAddr, VectorOfIndices, Scale) > VectorValue = @llvm.sindex.masked.load (BaseAddr, VectorOfIndices, Scale, PassThruVal, Mask) > VectorValue = @llvm.uindex.masked.load (BaseAddr, VectorOfIndices, Scale, PassThruVal, Mask) >It looks like the proposed intrinsic is very specific to the x86 implementation of gather/scatter. Would it be possible to remove the PassThrough value from the intrinsic and define the masked-out value to be undef? You would still be able to pattern match it if you use a maskedload + select. Can we remove the masked version of the intrinsic altogether and pattern match it using the non-masked version somehow? Can we infer the scale value based on the loaded element type?> Semantics: > For i=0,1,…,N-1: if (Mask[i]) {VectorValue[i] = *(BaseAddr + VectorOfIndices[i]*Scale) else VectorValue[i]=PassThruVal[i];} > > void @llvm.sindex.store (BaseAddr, VectorValue, VectorOfIndices, Scale) > void @llvm.uindex.store (BaseAddr, VectorValue, VectorOfIndices, Scale) > void @llvm.sindex.masked.store (BaseAddr, VectorValue, VectorOfIndices, Scale, Mask) > void @llvm.uindex.masked.store (BaseAddr, VectorValue, VectorOfIndices, Scale, Mask) > > Semantics: > For i=0,1,…,N-1: if (Mask[i]) {*(BaseAddr + VectorOfIndices[i]*Scale) = VectorValue[i];} > > VectorValue: any float or integer vector type.We should also support loading and storing pointer values.> BaseAddr: a pointer; may be zero if full address is placed in the index. > VectorOfIndices: a vector of i32 or i64 signed or unsigned integer values. > Scale: a compile time constant 1, 2, 4 or 8.Why do we need to limit the scale values?> VectorValue, VectorOfIndices and Mask must have the same vector width. > > An indexed store instruction with complete or partial overlap in memory (i.e., two indices with same or close values) will provide the result equivalent to serial scalar stores from least to most significant vector elements. >Thanks, Nadav -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20141218/f382b3f3/attachment.html>
dag at cray.com
2014-Dec-18 19:56 UTC
[LLVMdev] Indexed Load and Store Intrinsics - proposal
"Demikhovsky, Elena" <elena.demikhovsky at intel.com> writes:> Semantics: > For i=0,1,…,N-1: if (Mask[i]) {*(BaseAddr + VectorOfIndices[i]*Scale) > = VectorValue[i];} > VectorValue: any float or integer vector type. > BaseAddr: a pointer; may be zero if full address is placed in the > index. > VectorOfIndices: a vector of i32 or i64 signed or unsigned integer > values.What about the case of a gather/scatter where the BaseAddr is zero and the indices are pointers? Must we do a ptrtoint? llvm.org is down at the moment but I don't think we currently have a vector ptrtoint.> Scale: a compile time constant 1, 2, 4 or 8.This seems a bit too Intel-focused. Why not allow arbitrary scales? Or alternatively, eliminate the Scale and do a vector multiply on VectorOfIndices. It should be simple enough to write matching TableGen patterns. We do it now for the x86 memop stuff.> VectorValue, VectorOfIndices and Mask must have the same vector width.>From your example, you mean they must have the same number of vectorelements, not the same bit width, right? I'm used to "width" meaning a specific bit length and "vector length" meaning "number of elements." With that terminology, I think you mean they must have the same vector length.> An indexed store instruction with complete or partial overlap in > memory (i.e., two indices with same or close values) will provide the > result equivalent to serial scalar stores from least to most > significant vector elements.Yep, they must be ordered. Do we want to provide unordered scatters as well? Some (non-LLVM) targets have them. We don't need to add them right now but it's worth thinking about.> The new intrinsics are common for all targets, like recently > introduced masked load and store. > Examples: > <16 x float> @llvm.sindex.load.v16f32.v16i32 (i8 *%ptr, <16 x i32> > %index, i32 %scale) > <16 x float> @llvm.masked.sindex.load.v16f32.v16i32 (i8 *%ptr, <16 x > i32> %index, <16 x float> %passthru, <16 x i1> %mask) > void @llvm.sindex.store.v16f32.v16i64(i8* %ptr, <16 x float> %value, > <16 x 164> %index, i32 %scale, <16 x i1> %mask) > Comments?I think it's definitely a good idea to introduce them, but let's make them a little more target-neutral if we can. -David
dag at cray.com
2014-Dec-18 20:04 UTC
[LLVMdev] Indexed Load and Store Intrinsics - proposal
Nadav Rotem <nrotem at apple.com> writes:> It looks like the proposed intrinsic is very specific to the x86 > implementation of gather/scatter. Would it be possible to remove the > PassThrough value from the intrinsic and define the masked-out value > to be undef? You would still be able to pattern match it if you use a > maskedload + select.This makes sense to me.> Can we remove the masked version of the intrinsic altogether and > pattern match it using the non-masked version somehow?I don't think that's possible. The masking needs to be atomic with the memory operation. A non-masked memory operation could fault. This: vec = masked_gather(base, indices, mask) is not semantically equivalent to vec1 = unmasked_gather(base, indices) vec = select(mask, vec1, othervec) The gather could fault. It is not sufficient to pattern-match this because some pass before isel could look at this code and assume the gather doesn't fault, rearranging code in illegal ways.> Can we infer the scale value based on the loaded element type?No, I don't think we can do that. Consider the case where Base is zero and VectorOfIndices contains pointers. [As an aside, LLVM does indeed have vector ptrtoint, so we could always use that, though another intrinsic allowing vectors of pointers might be cleaner.] I think requiring a multiply of the VectorOfIndices before the gather/scatter is the most flexible course.> Semantics: > For i=0,1,…,N-1: if (Mask[i]) {VectorValue[i] = *(BaseAddr + > VectorOfIndices[i]*Scale) else VectorValue[i]=PassThruVal[i];} > > void @llvm.sindex.store (BaseAddr, VectorValue, VectorOfIndices, > Scale) > void @llvm.uindex.store (BaseAddr, VectorValue, VectorOfIndices, > Scale) > void @llvm.sindex.masked.store (BaseAddr, VectorValue, > VectorOfIndices, Scale, Mask) > void @llvm.uindex.masked.store (BaseAddr, VectorValue, > VectorOfIndices, Scale, Mask) > > Semantics: > For i=0,1,…,N-1: if (Mask[i]) {*(BaseAddr + VectorOfIndices > [i]*Scale) = VectorValue[i];} > > VectorValue: any float or integer vector type. > > We should also support loading and storing pointer values.Yes, though ptrtoint could also be used here. -David
Demikhovsky, Elena
2014-Dec-18 21:38 UTC
[LLVMdev] Indexed Load and Store Intrinsics - proposal
Hi Nadav, It looks like the proposed intrinsic is very specific to the x86 implementation of gather/scatter. Would it be possible to remove the PassThrough value from the intrinsic and define the masked-out value to be undef? You would still be able to pattern match it if you use a maskedload + select. [Demikhovsky, Elena] We have PassThrough value in masked load. We want to be consistent in all intrinsics. Can we remove the masked version of the intrinsic altogether and pattern match it using the non-masked version somehow? [Demikhovsky, Elena] on the contrary, we can remove non-masked and use masked somehow. Using non-masked+select is not safe for gather and meaningless for scatter. That’s why we added masked load/store intrinsics. Can we infer the scale value based on the loaded element type? [Demikhovsky, Elena] In this case we need two different intrinsics: one with non-zero base and vector of indices (index is relative to base) and implicit scale based, as you say, on element type. And the second one without base, without scale, just vector of pointers. Semantics: For i=0,1,…,N-1: if (Mask[i]) {VectorValue[i] = *(BaseAddr + VectorOfIndices[i]*Scale) else VectorValue[i]=PassThruVal[i];} void @llvm.sindex.store (BaseAddr, VectorValue, VectorOfIndices, Scale) void @llvm.uindex.store (BaseAddr, VectorValue, VectorOfIndices, Scale) void @llvm.sindex.masked.store (BaseAddr, VectorValue, VectorOfIndices, Scale, Mask) void @llvm.uindex.masked.store (BaseAddr, VectorValue, VectorOfIndices, Scale, Mask) Semantics: For i=0,1,…,N-1: if (Mask[i]) {*(BaseAddr + VectorOfIndices[i]*Scale) = VectorValue[i];} VectorValue: any float or integer vector type. We should also support loading and storing pointer values. BaseAddr: a pointer; may be zero if full address is placed in the index. VectorOfIndices: a vector of i32 or i64 signed or unsigned integer values. Scale: a compile time constant 1, 2, 4 or 8. Why do we need to limit the scale values? VectorValue, VectorOfIndices and Mask must have the same vector width. An indexed store instruction with complete or partial overlap in memory (i.e., two indices with same or close values) will provide the result equivalent to serial scalar stores from least to most significant vector elements. Thanks, Nadav --------------------------------------------------------------------- Intel Israel (74) Limited This e-mail and any attachments may contain confidential material for the sole use of the intended recipient(s). Any review or distribution by others is strictly prohibited. If you are not the intended recipient, please contact the sender and delete all copies. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20141218/c701ff2f/attachment.html>
Demikhovsky, Elena
2014-Dec-18 21:52 UTC
[LLVMdev] Indexed Load and Store Intrinsics - proposal
Hi David,>What about the case of a gather/scatter where the BaseAddr is zero and the indices are pointers? Must we do a ptrtoint? llvm.org is down at the moment but I don't think we currently have a vector ptrtoint.[Demikhovsky, Elena] From the site: The ‘ptrtoint‘ instruction converts the pointer or a vector of pointers value to the integer (or vector of integers) type ty2.>> Scale: a compile time constant 1, 2, 4 or 8.>This seems a bit too Intel-focused. Why not allow arbitrary scales? Or alternatively, eliminate the Scale and do a vector multiply on VectorOfIndices. It should be simple enough to write matching TableGen patterns. We do it now for the x86 memop stuff.[Demikhovsky, Elena] As I wrote to Nadav, may be two intrinsics will be more general. I'm just looking at usage model. If the index is a pointer, scale = 1, base = 0. If it is an index inside array, scale covers all basic types from char to double. Do you think that 2 intrinsics will be less-Intel-focused? (1) non-zero base and vector of indices (index is relative to base) and implicit scale based on element type. (2) without base, without scale, just vector of pointers --------------------------------------------------------------------- Intel Israel (74) Limited This e-mail and any attachments may contain confidential material for the sole use of the intended recipient(s). Any review or distribution by others is strictly prohibited. If you are not the intended recipient, please contact the sender and delete all copies.
Philip Reames
2014-Dec-21 18:24 UTC
[LLVMdev] Indexed Load and Store Intrinsics - proposal
On 12/18/2014 11:56 AM, dag at cray.com wrote:> "Demikhovsky, Elena" <elena.demikhovsky at intel.com> writes: > >> Semantics: >> For i=0,1,…,N-1: if (Mask[i]) {*(BaseAddr + VectorOfIndices[i]*Scale) >> = VectorValue[i];} >> VectorValue: any float or integer vector type. >> BaseAddr: a pointer; may be zero if full address is placed in the >> index. >> VectorOfIndices: a vector of i32 or i64 signed or unsigned integer >> values. > What about the case of a gather/scatter where the BaseAddr is zero and > the indices are pointers? Must we do a ptrtoint? llvm.org is down at > the moment but I don't think we currently have a vector ptrtoint.I would be opposed to any representation which required the introduction of ptrtoint casts by the vectorizer. If it were the only option available, I could be argued around, but I think we should try to avoid this. More generally, I'm somewhat hesitant of representing a scatter with explicit base and offsets at all. Why shouldn't the IR representation simply be a load from a vector of arbitrary pointers? The backend can pattern match the actual gather instructions it supports and scalarize the rest. The proposal being made seems very specific to the current generation of x86 hardware. p.s. Where is the documentation for the existing mask load intrinsics? I can't find it with a quick search through the LangRef. Philip
Hi Elena, I think such intrinsics are very useful. Do you have any plan to upstream them? Thanks, -Hao 2014-12-18 22:40 GMT+08:00 Demikhovsky, Elena <elena.demikhovsky at intel.com>:> Hi, > > Recent Intel architectures AVX-512 and AVX2 provide vector gather and/or > scatter instructions. > Gather/scatter instructions allow read/write access to multiple memory > addresses. The addresses are specified using a base address and a vector of > indices. > We’d like Vectorizers to tap this functionality, and propose to do so by > introducing new intrinsics: > > VectorValue = @llvm.sindex.load (BaseAddr, VectorOfIndices, Scale) > VectorValue = @llvm.uindex.load (BaseAddr, VectorOfIndices, Scale) > VectorValue = @llvm.sindex.masked.load (BaseAddr, VectorOfIndices, Scale, > PassThruVal, Mask) > VectorValue = @llvm.uindex.masked.load (BaseAddr, VectorOfIndices, Scale, > PassThruVal, Mask) > > Semantics: > For i=0,1,…,N-1: if (Mask[i]) {VectorValue[i] = *(BaseAddr + > VectorOfIndices[i]*Scale) else VectorValue[i]=PassThruVal[i];} > > void @llvm.sindex.store (BaseAddr, VectorValue, VectorOfIndices, Scale) > void @llvm.uindex.store (BaseAddr, VectorValue, VectorOfIndices, Scale) > void @llvm.sindex.masked.store (BaseAddr, VectorValue, VectorOfIndices, > Scale, Mask) > void @llvm.uindex.masked.store (BaseAddr, VectorValue, VectorOfIndices, > Scale, Mask) > > Semantics: > For i=0,1,…,N-1: if (Mask[i]) {*(BaseAddr + VectorOfIndices[i]*Scale) > VectorValue[i];} > > VectorValue: any float or integer vector type. > BaseAddr: a pointer; may be zero if full address is placed in the index. > VectorOfIndices: a vector of i32 or i64 signed or unsigned integer values. > Scale: a compile time constant 1, 2, 4 or 8. > VectorValue, VectorOfIndices and Mask must have the same vector width. > > An indexed store instruction with complete or partial overlap in memory > (i.e., two indices with same or close values) will provide the result > equivalent to serial scalar stores from least to most significant vector > elements. > > The new intrinsics are common for all targets, like recently introduced > masked load and store. > > Examples: > > <16 x float> @llvm.sindex.load.v16f32.v16i32 (i8 *%ptr, <16 x i32> %index, > i32 %scale) > <16 x float> @llvm.masked.sindex.load.v16f32.v16i32 (i8 *%ptr, <16 x i32> > %index, <16 x float> %passthru, <16 x i1> %mask) > void @llvm.sindex.store.v16f32.v16i64(i8* %ptr, <16 x float> %value, <16 x > 164> %index, i32 %scale, <16 x i1> %mask) > > Comments? > > Thank you. > > > Elena > > > > > > --------------------------------------------------------------------- > Intel Israel (74) Limited > > This e-mail and any attachments may contain confidential material for > the sole use of the intended recipient(s). Any review or distribution > by others is strictly prohibited. If you are not the intended > recipient, please contact the sender and delete all copies. > > > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev >
Demikhovsky, Elena
2015-Mar-15 10:21 UTC
[LLVMdev] Indexed Load and Store Intrinsics - proposal
hi Hao, I started to upstream and the second patch is stalled under review now. - Elena -----Original Message----- From: Hao Liu [mailto:haoliuts at gmail.com] Sent: Friday, March 13, 2015 05:56 To: Demikhovsky, Elena Cc: llvmdev at cs.uiuc.edu Subject: Re: [LLVMdev] Indexed Load and Store Intrinsics - proposal Hi Elena, I think such intrinsics are very useful. Do you have any plan to upstream them? Thanks, -Hao 2014-12-18 22:40 GMT+08:00 Demikhovsky, Elena <elena.demikhovsky at intel.com>:> Hi, > > Recent Intel architectures AVX-512 and AVX2 provide vector gather > and/or scatter instructions. > Gather/scatter instructions allow read/write access to multiple memory > addresses. The addresses are specified using a base address and a > vector of indices. > We’d like Vectorizers to tap this functionality, and propose to do so > by introducing new intrinsics: > > VectorValue = @llvm.sindex.load (BaseAddr, VectorOfIndices, Scale) > VectorValue = @llvm.uindex.load (BaseAddr, VectorOfIndices, Scale) > VectorValue = @llvm.sindex.masked.load (BaseAddr, VectorOfIndices, > Scale, PassThruVal, Mask) VectorValue = @llvm.uindex.masked.load > (BaseAddr, VectorOfIndices, Scale, PassThruVal, Mask) > > Semantics: > For i=0,1,…,N-1: if (Mask[i]) {VectorValue[i] = *(BaseAddr + > VectorOfIndices[i]*Scale) else VectorValue[i]=PassThruVal[i];} > > void @llvm.sindex.store (BaseAddr, VectorValue, VectorOfIndices, > Scale) void @llvm.uindex.store (BaseAddr, VectorValue, > VectorOfIndices, Scale) void @llvm.sindex.masked.store (BaseAddr, > VectorValue, VectorOfIndices, Scale, Mask) void > @llvm.uindex.masked.store (BaseAddr, VectorValue, VectorOfIndices, > Scale, Mask) > > Semantics: > For i=0,1,…,N-1: if (Mask[i]) {*(BaseAddr + VectorOfIndices[i]*Scale) > = VectorValue[i];} > > VectorValue: any float or integer vector type. > BaseAddr: a pointer; may be zero if full address is placed in the index. > VectorOfIndices: a vector of i32 or i64 signed or unsigned integer values. > Scale: a compile time constant 1, 2, 4 or 8. > VectorValue, VectorOfIndices and Mask must have the same vector width. > > An indexed store instruction with complete or partial overlap in > memory (i.e., two indices with same or close values) will provide the > result equivalent to serial scalar stores from least to most > significant vector elements. > > The new intrinsics are common for all targets, like recently > introduced masked load and store. > > Examples: > > <16 x float> @llvm.sindex.load.v16f32.v16i32 (i8 *%ptr, <16 x i32> %index, > i32 %scale) > <16 x float> @llvm.masked.sindex.load.v16f32.v16i32 (i8 *%ptr, <16 x i32> > %index, <16 x float> %passthru, <16 x i1> %mask) > void @llvm.sindex.store.v16f32.v16i64(i8* %ptr, <16 x float> %value, <16 x > 164> %index, i32 %scale, <16 x i1> %mask) > > Comments? > > Thank you. > > > Elena > > > > > > --------------------------------------------------------------------- > Intel Israel (74) Limited > > This e-mail and any attachments may contain confidential material for > the sole use of the intended recipient(s). Any review or distribution > by others is strictly prohibited. If you are not the intended > recipient, please contact the sender and delete all copies. > > > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev >--------------------------------------------------------------------- Intel Israel (74) Limited This e-mail and any attachments may contain confidential material for the sole use of the intended recipient(s). Any review or distribution by others is strictly prohibited. If you are not the intended recipient, please contact the sender and delete all copies.
Possibly Parallel Threads
- [LLVMdev] Indexed Load and Store Intrinsics - proposal
- [LLVMdev] Indexed Load and Store Intrinsics - proposal
- [LLVMdev] Indexed Load and Store Intrinsics - proposal
- [LLVMdev] Indexed Load and Store Intrinsics - proposal
- enabling interleaved access loop vectorization