Haidl, Michael via llvm-dev
2017-Aug-04 14:03 UTC
[llvm-dev] Status of llvm.experimental.vector.reduce.* intrinsics
I assume smaller types like <4 x i1> are getting zero extended to e.g., i8? Am 04.08.2017 um 15:58 schrieb Amara Emerson:> Actually for mask vectors of i1 values, you don't need to use reductions > at all(although for SVE this is what we'll do). You can instead bitcast > the vector value to an i8/i16/whatever and then compare against zero. > > Amara > > On 4 August 2017 at 14:55, Haidl, Michael via llvm-dev > <llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>> wrote: > > > I am currently working on a transformation pass that transforms > masked.load and masked.store intrinsics to (hopefully) increase > performance on targets where masked.load and masked.store are not legal. > To check if the loads and stores are necessary at all I take the mask > for the masked operations and want to reduce them to a single value. > vector.reduce.or seemed very handy to do the job. > > I will take a look into the function you suggested. Maybe I can come up > with something that drives the development of these intrinsics ahead. > > Cheers, > Michael > > Am 04.08.2017 um 15:25 schrieb Amara Emerson: > > Can you tell us what you're looking to do with the intrinsics? > > > > On all non-AArch64 targets the ExpandReductions pass will convert > them > > to the shuffle pattern as you're seeing. That pass was written in > order > > to allow experimentation of the effects of using reduction > intrinsics at > > the IR level only, hence we convert into the shuffles very late > in the > > pass pipeline. > > > > Since we haven't seen any adverse effects of representing the > reductions > > as intrinsics at the IR level, I think in that respect the intrinsics > > have probably proven themselves to be stable. However the error > you're > > seeing is because the AArch64 backend still expects to deal with only > > intrinsics it can *natively* support, and i1 is not a natively > supported > > type for reductions. See the code in > > AArch64TargetTransformInfo.cpp:useReductionIntrinsic() for where we > > decide which reduction types we can support. > > > > For these cases, we need to implement more generic legalization > support > > in order to either promote to a legal type, or in cases where the > target > > cannot support it as a native operation at all, to expand it to a > > shuffle pattern as a fallback. Once we have all that in place, I > think > > we're in a strong position to move to the intrinsic form as the > > canonical representation. > > > > FYI one of the motivating reasons for these to be introduced was to > > allow non power-of-2 vector architectures like SVE to express > reduction > > operations. > > > > Amara > > > > On 4 August 2017 at 13:36, Haidl, Michael > <michael.haidl at uni-muenster.de <mailto:michael.haidl at uni-muenster.de> > > <mailto:michael.haidl at uni-muenster.de > <mailto:michael.haidl at uni-muenster.de>>> wrote: > > > > Hi Renato, > > > > just to make it clear, I didn't implement reductions on > x86_64 they just > > worked when I tried to lower an > > llvm.experimentel.vector.reduce.or.i1.v8i1 intrinsic. A > shuffle pattern > > is generated for the intrinsic. > > > > vpshufd $78, %xmm0, %xmm1 # xmm1 = xmm0[2,3,0,1] > > vpor %xmm1, %xmm0, %xmm0 > > vpshufd $229, %xmm0, %xmm1 # xmm1 = xmm0[1,1,2,3] > > vpor %xmm1, %xmm0, %xmm0 > > vpsrld $16, %xmm0, %xmm1 > > vpor %xmm1, %xmm0, %xmm0 > > vpextrb $0, %xmm0, %eax > > > > > > However, on AArche64 I encountered an unreachable where > codegen does not > > know how to promote the i1 type. Since I am more familiar > with the > > midlevel I have to start digging into codegen. Any hints > where to start > > would be awesome. > > > > Cheers, > > Michael > > > > Am 04.08.2017 um 08:18 schrieb Renato Golin: > > > On 3 August 2017 at 19:48, Haidl, Michael via llvm-dev > > > <llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org> > <mailto:llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>>> > wrote: > > >> thank you for the clarification. I tested the intrinsics > x86_64 > > and it > > >> seemed to work pretty well. Looking forward to try this > > intrinsics with > > >> the AArch64 backend. Maybe I find the time to look into > codegen > > to get > > >> this intrinsics out of experimental stage. They seem > pretty useful. > > > > > > In addition to Amara's point, it'd be good to have it > working and > > > default for other architectures before we can move out of > > experimental > > > if we indeed intend to make it non-arch-specific (which we > do). > > > > > > So, if you could share your code for the x86 port, that'd > be great. > > > But if you could help with the final touches on the > code-gen part, > > > that'd be awesome. > > > > > > cheers, > > > --renato > > > > > > > > > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org> > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev > <http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev> > >
Renato Golin via llvm-dev
2017-Aug-04 14:11 UTC
[llvm-dev] Status of llvm.experimental.vector.reduce.* intrinsics
It may not be related, but there was a talk on EuroLLVM regarding i1 types on x86 vector expansion with some pitfalls. I recommend you to have a look. Is the aarch64 error an assert/internal one? If so, we may need better error handling... On 4 Aug 2017 11:03 a.m., "Haidl, Michael via llvm-dev" < llvm-dev at lists.llvm.org> wrote:> I assume smaller types like <4 x i1> are getting zero extended to e.g., i8? > > Am 04.08.2017 um 15:58 schrieb Amara Emerson: > > Actually for mask vectors of i1 values, you don't need to use reductions > > at all(although for SVE this is what we'll do). You can instead bitcast > > the vector value to an i8/i16/whatever and then compare against zero. > > > > Amara > > > > On 4 August 2017 at 14:55, Haidl, Michael via llvm-dev > > <llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>> wrote: > > > > > > I am currently working on a transformation pass that transforms > > masked.load and masked.store intrinsics to (hopefully) increase > > performance on targets where masked.load and masked.store are not > legal. > > To check if the loads and stores are necessary at all I take the mask > > for the masked operations and want to reduce them to a single value. > > vector.reduce.or seemed very handy to do the job. > > > > I will take a look into the function you suggested. Maybe I can come > up > > with something that drives the development of these intrinsics ahead. > > > > Cheers, > > Michael > > > > Am 04.08.2017 um 15:25 schrieb Amara Emerson: > > > Can you tell us what you're looking to do with the intrinsics? > > > > > > On all non-AArch64 targets the ExpandReductions pass will convert > > them > > > to the shuffle pattern as you're seeing. That pass was written in > > order > > > to allow experimentation of the effects of using reduction > > intrinsics at > > > the IR level only, hence we convert into the shuffles very late > > in the > > > pass pipeline. > > > > > > Since we haven't seen any adverse effects of representing the > > reductions > > > as intrinsics at the IR level, I think in that respect the > intrinsics > > > have probably proven themselves to be stable. However the error > > you're > > > seeing is because the AArch64 backend still expects to deal with > only > > > intrinsics it can *natively* support, and i1 is not a natively > > supported > > > type for reductions. See the code in > > > AArch64TargetTransformInfo.cpp:useReductionIntrinsic() for where > we > > > decide which reduction types we can support. > > > > > > For these cases, we need to implement more generic legalization > > support > > > in order to either promote to a legal type, or in cases where the > > target > > > cannot support it as a native operation at all, to expand it to a > > > shuffle pattern as a fallback. Once we have all that in place, I > > think > > > we're in a strong position to move to the intrinsic form as the > > > canonical representation. > > > > > > FYI one of the motivating reasons for these to be introduced was > to > > > allow non power-of-2 vector architectures like SVE to express > > reduction > > > operations. > > > > > > Amara > > > > > > On 4 August 2017 at 13:36, Haidl, Michael > > <michael.haidl at uni-muenster.de <mailto:michael.haidl at uni-muenster.de > > > > > <mailto:michael.haidl at uni-muenster.de > > <mailto:michael.haidl at uni-muenster.de>>> wrote: > > > > > > Hi Renato, > > > > > > just to make it clear, I didn't implement reductions on > > x86_64 they just > > > worked when I tried to lower an > > > llvm.experimentel.vector.reduce.or.i1.v8i1 intrinsic. A > > shuffle pattern > > > is generated for the intrinsic. > > > > > > vpshufd $78, %xmm0, %xmm1 # xmm1 > xmm0[2,3,0,1] > > > vpor %xmm1, %xmm0, %xmm0 > > > vpshufd $229, %xmm0, %xmm1 # xmm1 > xmm0[1,1,2,3] > > > vpor %xmm1, %xmm0, %xmm0 > > > vpsrld $16, %xmm0, %xmm1 > > > vpor %xmm1, %xmm0, %xmm0 > > > vpextrb $0, %xmm0, %eax > > > > > > > > > However, on AArche64 I encountered an unreachable where > > codegen does not > > > know how to promote the i1 type. Since I am more familiar > > with the > > > midlevel I have to start digging into codegen. Any hints > > where to start > > > would be awesome. > > > > > > Cheers, > > > Michael > > > > > > Am 04.08.2017 um 08:18 schrieb Renato Golin: > > > > On 3 August 2017 at 19:48, Haidl, Michael via llvm-dev > > > > <llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org> > > <mailto:llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>>> > > wrote: > > > >> thank you for the clarification. I tested the intrinsics > > x86_64 > > > and it > > > >> seemed to work pretty well. Looking forward to try this > > > intrinsics with > > > >> the AArch64 backend. Maybe I find the time to look into > > codegen > > > to get > > > >> this intrinsics out of experimental stage. They seem > > pretty useful. > > > > > > > > In addition to Amara's point, it'd be good to have it > > working and > > > > default for other architectures before we can move out of > > > experimental > > > > if we indeed intend to make it non-arch-specific (which we > > do). > > > > > > > > So, if you could share your code for the x86 port, that'd > > be great. > > > > But if you could help with the final touches on the > > code-gen part, > > > > that'd be awesome. > > > > > > > > cheers, > > > > --renato > > > > > > > > > > > > > > _______________________________________________ > > LLVM Developers mailing list > > llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org> > > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev > > <http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev> > > > > > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170804/91b82b2b/attachment.html>
Amara Emerson via llvm-dev
2017-Aug-04 14:20 UTC
[llvm-dev] Status of llvm.experimental.vector.reduce.* intrinsics
Bitcasting is only valid between types of the same size, so you can bitcast to i4 and then directly do a cmp i4 %castval, 0 etc. Amara On 4 August 2017 at 15:03, Haidl, Michael <michael.haidl at uni-muenster.de> wrote:> I assume smaller types like <4 x i1> are getting zero extended to e.g., i8? > > Am 04.08.2017 um 15:58 schrieb Amara Emerson: > > Actually for mask vectors of i1 values, you don't need to use reductions > > at all(although for SVE this is what we'll do). You can instead bitcast > > the vector value to an i8/i16/whatever and then compare against zero. > > > > Amara > > > > On 4 August 2017 at 14:55, Haidl, Michael via llvm-dev > > <llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>> wrote: > > > > > > I am currently working on a transformation pass that transforms > > masked.load and masked.store intrinsics to (hopefully) increase > > performance on targets where masked.load and masked.store are not > legal. > > To check if the loads and stores are necessary at all I take the mask > > for the masked operations and want to reduce them to a single value. > > vector.reduce.or seemed very handy to do the job. > > > > I will take a look into the function you suggested. Maybe I can come > up > > with something that drives the development of these intrinsics ahead. > > > > Cheers, > > Michael > > > > Am 04.08.2017 um 15:25 schrieb Amara Emerson: > > > Can you tell us what you're looking to do with the intrinsics? > > > > > > On all non-AArch64 targets the ExpandReductions pass will convert > > them > > > to the shuffle pattern as you're seeing. That pass was written in > > order > > > to allow experimentation of the effects of using reduction > > intrinsics at > > > the IR level only, hence we convert into the shuffles very late > > in the > > > pass pipeline. > > > > > > Since we haven't seen any adverse effects of representing the > > reductions > > > as intrinsics at the IR level, I think in that respect the > intrinsics > > > have probably proven themselves to be stable. However the error > > you're > > > seeing is because the AArch64 backend still expects to deal with > only > > > intrinsics it can *natively* support, and i1 is not a natively > > supported > > > type for reductions. See the code in > > > AArch64TargetTransformInfo.cpp:useReductionIntrinsic() for where > we > > > decide which reduction types we can support. > > > > > > For these cases, we need to implement more generic legalization > > support > > > in order to either promote to a legal type, or in cases where the > > target > > > cannot support it as a native operation at all, to expand it to a > > > shuffle pattern as a fallback. Once we have all that in place, I > > think > > > we're in a strong position to move to the intrinsic form as the > > > canonical representation. > > > > > > FYI one of the motivating reasons for these to be introduced was > to > > > allow non power-of-2 vector architectures like SVE to express > > reduction > > > operations. > > > > > > Amara > > > > > > On 4 August 2017 at 13:36, Haidl, Michael > > <michael.haidl at uni-muenster.de <mailto:michael.haidl at uni-muenster.de > > > > > <mailto:michael.haidl at uni-muenster.de > > <mailto:michael.haidl at uni-muenster.de>>> wrote: > > > > > > Hi Renato, > > > > > > just to make it clear, I didn't implement reductions on > > x86_64 they just > > > worked when I tried to lower an > > > llvm.experimentel.vector.reduce.or.i1.v8i1 intrinsic. A > > shuffle pattern > > > is generated for the intrinsic. > > > > > > vpshufd $78, %xmm0, %xmm1 # xmm1 > xmm0[2,3,0,1] > > > vpor %xmm1, %xmm0, %xmm0 > > > vpshufd $229, %xmm0, %xmm1 # xmm1 > xmm0[1,1,2,3] > > > vpor %xmm1, %xmm0, %xmm0 > > > vpsrld $16, %xmm0, %xmm1 > > > vpor %xmm1, %xmm0, %xmm0 > > > vpextrb $0, %xmm0, %eax > > > > > > > > > However, on AArche64 I encountered an unreachable where > > codegen does not > > > know how to promote the i1 type. Since I am more familiar > > with the > > > midlevel I have to start digging into codegen. Any hints > > where to start > > > would be awesome. > > > > > > Cheers, > > > Michael > > > > > > Am 04.08.2017 um 08:18 schrieb Renato Golin: > > > > On 3 August 2017 at 19:48, Haidl, Michael via llvm-dev > > > > <llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org> > > <mailto:llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>>> > > wrote: > > > >> thank you for the clarification. I tested the intrinsics > > x86_64 > > > and it > > > >> seemed to work pretty well. Looking forward to try this > > > intrinsics with > > > >> the AArch64 backend. Maybe I find the time to look into > > codegen > > > to get > > > >> this intrinsics out of experimental stage. They seem > > pretty useful. > > > > > > > > In addition to Amara's point, it'd be good to have it > > working and > > > > default for other architectures before we can move out of > > > experimental > > > > if we indeed intend to make it non-arch-specific (which we > > do). > > > > > > > > So, if you could share your code for the x86 port, that'd > > be great. > > > > But if you could help with the final touches on the > > code-gen part, > > > > that'd be awesome. > > > > > > > > cheers, > > > > --renato > > > > > > > > > > > > > > _______________________________________________ > > LLVM Developers mailing list > > llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org> > > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev > > <http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev> > > > > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170804/51484f1e/attachment.html>
Haidl, Michael via llvm-dev
2017-Aug-04 14:28 UTC
[llvm-dev] Status of llvm.experimental.vector.reduce.* intrinsics
Thanks, I already found it out the hard way ;) Now it works and looks nice and shiny. Michael Am 04.08.2017 um 16:20 schrieb Amara Emerson:> Bitcasting is only valid between types of the same size, so you > can bitcast to i4 and then directly do a cmp i4 %castval, 0 etc. > > Amara > > On 4 August 2017 at 15:03, Haidl, Michael <michael.haidl at uni-muenster.de > <mailto:michael.haidl at uni-muenster.de>> wrote: > > I assume smaller types like <4 x i1> are getting zero extended to > e.g., i8? > > Am 04.08.2017 um 15:58 schrieb Amara Emerson: > > Actually for mask vectors of i1 values, you don't need to use reductions > > at all(although for SVE this is what we'll do). You can instead bitcast > > the vector value to an i8/i16/whatever and then compare against zero. > > > > Amara > > > > On 4 August 2017 at 14:55, Haidl, Michael via llvm-dev > > <llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org> > <mailto:llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>>> > wrote: > > > > > > I am currently working on a transformation pass that transforms > > masked.load and masked.store intrinsics to (hopefully) increase > > performance on targets where masked.load and masked.store are > not legal. > > To check if the loads and stores are necessary at all I take > the mask > > for the masked operations and want to reduce them to a single > value. > > vector.reduce.or seemed very handy to do the job. > > > > I will take a look into the function you suggested. Maybe I > can come up > > with something that drives the development of these > intrinsics ahead. > > > > Cheers, > > Michael > > > > Am 04.08.2017 um 15:25 schrieb Amara Emerson: > > > Can you tell us what you're looking to do with the intrinsics? > > > > > > On all non-AArch64 targets the ExpandReductions pass will > convert > > them > > > to the shuffle pattern as you're seeing. That pass was > written in > > order > > > to allow experimentation of the effects of using reduction > > intrinsics at > > > the IR level only, hence we convert into the shuffles very > late > > in the > > > pass pipeline. > > > > > > Since we haven't seen any adverse effects of representing the > > reductions > > > as intrinsics at the IR level, I think in that respect the > intrinsics > > > have probably proven themselves to be stable. However the > error > > you're > > > seeing is because the AArch64 backend still expects to > deal with only > > > intrinsics it can *natively* support, and i1 is not a natively > > supported > > > type for reductions. See the code in > > > AArch64TargetTransformInfo.cpp:useReductionIntrinsic() for > where we > > > decide which reduction types we can support. > > > > > > For these cases, we need to implement more generic > legalization > > support > > > in order to either promote to a legal type, or in cases > where the > > target > > > cannot support it as a native operation at all, to expand > it to a > > > shuffle pattern as a fallback. Once we have all that in > place, I > > think > > > we're in a strong position to move to the intrinsic form > as the > > > canonical representation. > > > > > > FYI one of the motivating reasons for these to be > introduced was to > > > allow non power-of-2 vector architectures like SVE to express > > reduction > > > operations. > > > > > > Amara > > > > > > On 4 August 2017 at 13:36, Haidl, Michael > > <michael.haidl at uni-muenster.de > <mailto:michael.haidl at uni-muenster.de> > <mailto:michael.haidl at uni-muenster.de > <mailto:michael.haidl at uni-muenster.de>> > > > <mailto:michael.haidl at uni-muenster.de > <mailto:michael.haidl at uni-muenster.de> > > <mailto:michael.haidl at uni-muenster.de > <mailto:michael.haidl at uni-muenster.de>>>> wrote: > > > > > > Hi Renato, > > > > > > just to make it clear, I didn't implement reductions on > > x86_64 they just > > > worked when I tried to lower an > > > llvm.experimentel.vector.reduce.or.i1.v8i1 intrinsic. A > > shuffle pattern > > > is generated for the intrinsic. > > > > > > vpshufd $78, %xmm0, %xmm1 # xmm1 > xmm0[2,3,0,1] > > > vpor %xmm1, %xmm0, %xmm0 > > > vpshufd $229, %xmm0, %xmm1 # xmm1 > xmm0[1,1,2,3] > > > vpor %xmm1, %xmm0, %xmm0 > > > vpsrld $16, %xmm0, %xmm1 > > > vpor %xmm1, %xmm0, %xmm0 > > > vpextrb $0, %xmm0, %eax > > > > > > > > > However, on AArche64 I encountered an unreachable where > > codegen does not > > > know how to promote the i1 type. Since I am more familiar > > with the > > > midlevel I have to start digging into codegen. Any hints > > where to start > > > would be awesome. > > > > > > Cheers, > > > Michael > > > > > > Am 04.08.2017 um 08:18 schrieb Renato Golin: > > > > On 3 August 2017 at 19:48, Haidl, Michael via llvm-dev > > > > <llvm-dev at lists.llvm.org > <mailto:llvm-dev at lists.llvm.org> <mailto:llvm-dev at lists.llvm.org > <mailto:llvm-dev at lists.llvm.org>> > > <mailto:llvm-dev at lists.llvm.org > <mailto:llvm-dev at lists.llvm.org> <mailto:llvm-dev at lists.llvm.org > <mailto:llvm-dev at lists.llvm.org>>>> > > wrote: > > > >> thank you for the clarification. I tested the > intrinsics > > x86_64 > > > and it > > > >> seemed to work pretty well. Looking forward to try > this > > > intrinsics with > > > >> the AArch64 backend. Maybe I find the time to look > into > > codegen > > > to get > > > >> this intrinsics out of experimental stage. They seem > > pretty useful. > > > > > > > > In addition to Amara's point, it'd be good to have it > > working and > > > > default for other architectures before we can move > out of > > > experimental > > > > if we indeed intend to make it non-arch-specific > (which we > > do). > > > > > > > > So, if you could share your code for the x86 port, > that'd > > be great. > > > > But if you could help with the final touches on the > > code-gen part, > > > > that'd be awesome. > > > > > > > > cheers, > > > > --renato > > > > > > > > > > > > > > _______________________________________________ > > LLVM Developers mailing list > > llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org> > <mailto:llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>> > > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev > <http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev> > > <http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev > <http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev>> > > > > > >