> On Dec 19, 2018, at 11:09 AM, Stephen Canon via llvm-dev <llvm-dev at lists.llvm.org> wrote: > >> On Dec 18, 2018, at 10:18 PM, Adam Nemet <anemet at apple.com <mailto:anemet at apple.com>> wrote: >> >>> I don’t understand this. What is the benefit of providing layout info to element wise operations? This defeats the goal of having simple lowering and representation: you are encoding an ND vector form into the IR in a really ugly way, and this will cause a proliferation of intrinsics that are redundant with the core ops. >> >> The reason we need that information so that for example we can lower an operation on a 3-element column into a vector of 2 and a scalar op. This should be beneficial for power consumption since for example in the case of a 3x3 with a single element padding rather than operating on 12 elements you’d operate only on 9 (vector ops consume more power than their scalar counterparts). >> >> That said we should be able to remove these intrinsics in the long term. Once we have masking on the core ops in the IR, we should be able to express the same semantics without dedicated intrinsics. > > There may be some cases where this holds (maybe with 5x5 or something), but most of the time I would expect to get better power from doing a four-element vector op with one wasted lane than doing two arithmetic ops (plus possibly extracts and inserts, depending on physical layout details). > > Explicit masking or arranging for zero in padding lanes seems like a better way forward to me. > – SteveI spent some time chatting with Adam about this and have a better understanding of his concerns here. It seems to me that if having masking intrinsics is the long-term solution we want, we should do that now (for add and sub) rather than building arbitrary matrix layout info into intrinsics, since a mask has all the information that we actually need. – Steve -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20181219/3715fd08/attachment.html>
> On Dec 19, 2018, at 1:31 PM, Stephen Canon <scanon at apple.com> wrote: > >> On Dec 19, 2018, at 11:09 AM, Stephen Canon via llvm-dev <llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>> wrote: >> >>> On Dec 18, 2018, at 10:18 PM, Adam Nemet <anemet at apple.com <mailto:anemet at apple.com>> wrote: >>> >>>> I don’t understand this. What is the benefit of providing layout info to element wise operations? This defeats the goal of having simple lowering and representation: you are encoding an ND vector form into the IR in a really ugly way, and this will cause a proliferation of intrinsics that are redundant with the core ops. >>> >>> The reason we need that information so that for example we can lower an operation on a 3-element column into a vector of 2 and a scalar op. This should be beneficial for power consumption since for example in the case of a 3x3 with a single element padding rather than operating on 12 elements you’d operate only on 9 (vector ops consume more power than their scalar counterparts). >>> >>> That said we should be able to remove these intrinsics in the long term. Once we have masking on the core ops in the IR, we should be able to express the same semantics without dedicated intrinsics. >> >> There may be some cases where this holds (maybe with 5x5 or something), but most of the time I would expect to get better power from doing a four-element vector op with one wasted lane than doing two arithmetic ops (plus possibly extracts and inserts, depending on physical layout details). >> >> Explicit masking or arranging for zero in padding lanes seems like a better way forward to me. >> – Steve > > I spent some time chatting with Adam about this and have a better understanding of his concerns here. It seems to me that if having masking intrinsics is the long-term solution we want, we should do that now (for add and sub) rather than building arbitrary matrix layout info into intrinsics, since a mask has all the information that we actually need.I think that sounds like a reasonable compromise. We already have masked load/store intrinsics so adding add and sub just follows that precedent. If the decision is made to move masking to the core operations, the new intrinsics would just move as well. So an add->multiply for option B + masking intrinsics would look like this: %a = load <12 x float>, <12 x float>* %A, align 16 %b = load <12 x float>, <12 x float>* %B, align 16 %c = load <8 x float>, <8 x float>* %C, align 16 %add = call <12 x float> @llvm.masked.fadd(<12 x float> %a, <12 x float> %b, ; mask, if false element is taken from passthrough <12 x i1> <i1 true, i1 true, i1 true, i1 false, i1 true, i1 true, i1 true, i1 false, i1 true, i1 true, i1 true, i1 false > ; passthrough: <12 x float> <float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef >) %mul = call <8 x float> @llvm.matrix.multiply(<12 x float> %add, <8 x float> %c, ; 3 x 3 3 x 2 column-major: i32 3, i32 3, i32 3, i32 2, i1 true) store <8 x float> %mul, <8 x float>* %MUL, align 16 -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20181219/00df7388/attachment.html>
Adam Nemet via llvm-dev <llvm-dev at lists.llvm.org> writes:> I spent some time chatting with Adam about this and have a better > understanding of his concerns here. It seems to me that if having > masking intrinsics is the long-term solution we want, we should do > that now (for add and sub) rather than building arbitrary matrix > layout info into intrinsics, since a mask has all the information > that we actually need. > > I think that sounds like a reasonable compromise. We already have > masked load/store intrinsics so adding add and sub just follows that > precedent. If the decision is made to move masking to the core > operations, the new intrinsics would just move as well.How will existing passes be taught about the new intrinsics? For example, what would have to be done to instcombine to teach it about these intrinsics? Let's suppose every existing operation had an equivalent masked intrinsic. Would it be easier to teach all of the passes about them or would it be easier to teach the passes about a mask operand on the existing Instructions? Would it be easier to teach isel about all the intrinsics or would it be easier to teach isel about a mask operand? I honestly don't know the answers to these questions. But I think they are important to consider, especially if intrinsics are seen as a bridge to first-class IR support for masking. -David
Hi, On 12/19/18 11:07 PM, Adam Nemet via llvm-dev wrote:> > >> On Dec 19, 2018, at 1:31 PM, Stephen Canon <scanon at apple.com >> <mailto:scanon at apple.com>> wrote: >> >>> On Dec 19, 2018, at 11:09 AM, Stephen Canon via llvm-dev >>> <llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>> wrote: >>> >>>> On Dec 18, 2018, at 10:18 PM, Adam Nemet <anemet at apple.com >>>> <mailto:anemet at apple.com>> wrote: >>>> >>>>> I don’t understand this. What is the benefit of providing layout >>>>> info to element wise operations? This defeats the goal of having >>>>> simple lowering and representation: you are encoding an ND vector >>>>> form into the IR in a really ugly way, and this will cause a >>>>> proliferation of intrinsics that are redundant with the core ops. >>>> >>>> The reason we need that information so that for example we can >>>> lower an operation on a 3-element column into a vector of 2 and a >>>> scalar op. This should be beneficial for power consumption since >>>> for example in the case of a 3x3 with a single element padding >>>> rather than operating on 12 elements you’d operate only on 9 >>>> (vector ops consume more power than their scalar counterparts). >>>> >>>> That said we should be able to remove these intrinsics in the long >>>> term. Once we have masking on the core ops in the IR, we should be >>>> able to express the same semantics without dedicated intrinsics. >>> >>> There may be some cases where this holds (maybe with 5x5 or >>> something), but most of the time I would expect to get better power >>> from doing a four-element vector op with one wasted lane than doing >>> two arithmetic ops (plus possibly extracts and inserts, depending on >>> physical layout details). >>> >>> Explicit masking or arranging for zero in padding lanes seems like a >>> better way forward to me. >>> – Steve >> >> I spent some time chatting with Adam about this and have a better >> understanding of his concerns here. It seems to me that if having >> masking intrinsics is the long-term solution we want, we should do >> that now (for add and sub) rather than building arbitrary matrix >> layout info into intrinsics, since a mask has all the information >> that we actually need. > > I think that sounds like a reasonable compromise. We already have > masked load/store intrinsics so adding add and sub just follows that > precedent. If the decision is made to move masking to the core > operations, the new intrinsics would just move as well. > > So an add->multiply for option B + masking intrinsics would look like > this: > > %a = load <12 x float>, <12 x float>* %A, align 16 > %b = load <12 x float>, <12 x float>* %B, align 16 > %c = load <8 x float>, <8 x float>* %C, align 16 > > %add = call <12 x float> @llvm.masked.fadd(<12 x float> %a, <12 x > float> %b, > ; mask, if false element is taken from passthrough > <12 x i1> <i1 true, i1 true, i1 true, i1 false, > i1 true, i1 true, i1 true, i1 false, > i1 true, i1 true, i1 true, i1 false > > ; passthrough: > <12 x float> <float undef, float undef, float undef, float undef, > float undef, float undef, float undef, float undef, > float undef, float undef, float undef, float undef >) > > %mul = call <8 x float> @llvm.matrix.multiply(<12 x float> %add, <8 x > float> %c, > ; 3 x 3 3 x 2 column-major: > i32 3, i32 3, i32 3, i32 2, i1 true) > store <8 x float> %mul, <8 x float>* %MUL, align 16We've started an RFC that proposes exactly this: https://reviews.llvm.org/D53613 The RFC proposes intrinsics that take a mask and an explicit vector length argument. The explicit vector length is aimed at RISC-V V and NEC SX-Aurora and it can be legalized away for targets that do not support it (eg AVX512). We also propose a couple of new attributes that should help with function call vectorization. I'll present this in Zurich at the upcoming LLVM Social on January, 10th for people who are interested. I also talked about a bit about this at the last DevMtg (from ~15:00 in https://youtu.be/BAZClv6nMxY). - Simon> > > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev-- Simon Moll Researcher / PhD Student Compiler Design Lab (Prof. Hack) Saarland University, Computer Science Building E1.3, Room 4.31 Tel. +49 (0)681 302-57521 : moll at cs.uni-saarland.de Fax. +49 (0)681 302-3065 : http://compilers.cs.uni-saarland.de/people/moll -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20181219/3c852f22/attachment.html>