Simon Pilgrim via llvm-dev
2019-Apr-05 08:47 UTC
[llvm-dev] [RFC] Changes to llvm.experimental.vector.reduce intrinsics
On 05/04/2019 09:37, Simon Pilgrim via llvm-dev wrote:> On 04/04/2019 14:11, Sander De Smalen wrote: >> Proposed change: >> >> ---------------------------- >> >> In this RFC I propose changing the intrinsics for >> llvm.experimental.vector.reduce.fadd and >> llvm.experimental.vector.reduce.fmul (see options A and B). I also >> propose renaming the 'accumulator' operand to 'start value' because >> for fmul this is the start value of the reduction, rather than a >> value to which the fmul reduction is accumulated into. >> >> [Option A] Always using the start value operand in the reduction >> (https://reviews.llvm.org/D60261) >> >> declare float >> @llvm.experimental.vector.reduce.v2.fadd.f32.v4f32(float >> %start_value, <4 x float> %vec) >> >> This means that if the start value is 'undef', the result will be >> undef and all code creating such a reduction will need to ensure it >> has a sensible start value (e.g. 0.0 for fadd, 1.0 for fmul). When >> using 'fast' or ‘reassoc’ on the call it will be implemented using an >> unordered reduction, otherwise it will be implemented with an ordered >> reduction. Note that a new intrinsic is required to capture the new >> semantics. In this proposal the intrinsic is prefixed with a 'v2' for >> the time being, with the expectation this will be dropped when we >> remove 'experimental' from the reduction intrinsics in the future. >> >> [Option B] Having separate ordered and unordered intrinsics >> (https://reviews.llvm.org/D60262). >> >> declare float >> @llvm.experimental.vector.reduce.ordered.fadd.f32.v4f32(float >> %start_value, <4 x float> %vec) >> >> declare float >> @llvm.experimental.vector.reduce.unordered.fadd.f32.v4f32(<4 x float> >> %vec) >> >> This will mean that the behaviour is explicit from the intrinsic and >> the use of 'fast' or ‘reassoc’ on the call has no effect on how that >> intrinsic is lowered. The ordered reduction intrinsic will take a >> scalar start-value operand, where the unordered reduction intrinsic >> will only take a vector operand. >> >> Both options auto-upgrade the IR to use the new (version of the) >> intrinsics. I'm personally slightly in favour of [Option B], because >> it better aligns with the definition of the SelectionDAG nodes and is >> more explicit in its semantics. We also avoid having to use an >> artificial 'v2' like prefix to denote the new behaviour of the intrinsic. >> > Do we have any targets with instructions that can actually use the > start value? TBH I'd be tempted to suggest we just make the initial > extractelement/fadd/insertelement pattern a manual extra stage and > avoid having having that argument entirely. > >> Further efforts: >> >> ---------------------------- >> >> Here a non-exhaustive list of items I think work towards making the >> intrinsics non-experimental: >> >> * Adding SelectionDAG legalization for the _STRICT reduction >> SDNodes. After some great work from Nikita in D58015, unordered >> reductions are now legalized/expanded in SelectionDAG, so if we >> add expansion in SelectionDAG for strict reductions this would >> make the ExpandReductionsPass redundant. >> * Better enforcing the constraints of the intrinsics (see >> https://reviews.llvm.org/D60260 ). >> * I think we'll also want to be able to overload the result operand >> based on the vector element type for the intrinsics having the >> constraint that the result type must match the vector element >> type. e.g. dropping the redundant 'i32' in: >> i32 @llvm.experimental.vector.reduce.and.i32.v4i32(<4 x i32> %a) >> => i32 @llvm.experimental.vector.reduce.and.v4i32(<4 x i32> %a) >> >> since i32 is implied by <4 x i32>. This would have the added benefit >> that LLVM would automatically check for the operands to match. >> > Won't this cause issues with overflow? Isn't the point of an add (or > mul....) reduction of say, <64 x i8> giving a larger (i32 or i64) > result so we don't lose anything? I agree for bitop reductions it > doesn't make sense though. >Sorry - I forgot to add: which asks the question - should we be considering signed/unsigned add/mul and possibly saturation reductions? -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20190405/cb9e00ac/attachment.html>
Sander De Smalen via llvm-dev
2019-Apr-05 15:26 UTC
[llvm-dev] [RFC] Changes to llvm.experimental.vector.reduce intrinsics
Hi Simon, Thanks for your feedback! See my comments inline. On 5 Apr 2019, at 09:47, Simon Pilgrim via llvm-dev <llvm-dev at lists.llvm.org<mailto:llvm-dev at lists.llvm.org>> wrote: On 05/04/2019 09:37, Simon Pilgrim via llvm-dev wrote: On 04/04/2019 14:11, Sander De Smalen wrote: Proposed change: ---------------------------- In this RFC I propose changing the intrinsics for llvm.experimental.vector.reduce.fadd and llvm.experimental.vector.reduce.fmul (see options A and B). I also propose renaming the 'accumulator' operand to 'start value' because for fmul this is the start value of the reduction, rather than a value to which the fmul reduction is accumulated into. [Option A] Always using the start value operand in the reduction (https://reviews.llvm.org/D60261) declare float @llvm.experimental.vector.reduce.v2.fadd.f32.v4f32(float %start_value, <4 x float> %vec) This means that if the start value is 'undef', the result will be undef and all code creating such a reduction will need to ensure it has a sensible start value (e.g. 0.0 for fadd, 1.0 for fmul). When using 'fast' or ‘reassoc’ on the call it will be implemented using an unordered reduction, otherwise it will be implemented with an ordered reduction. Note that a new intrinsic is required to capture the new semantics. In this proposal the intrinsic is prefixed with a 'v2' for the time being, with the expectation this will be dropped when we remove 'experimental' from the reduction intrinsics in the future. [Option B] Having separate ordered and unordered intrinsics (https://reviews.llvm.org/D60262). declare float @llvm.experimental.vector.reduce.ordered.fadd.f32.v4f32(float %start_value, <4 x float> %vec) declare float @llvm.experimental.vector.reduce.unordered.fadd.f32.v4f32(<4 x float> %vec) This will mean that the behaviour is explicit from the intrinsic and the use of 'fast' or ‘reassoc’ on the call has no effect on how that intrinsic is lowered. The ordered reduction intrinsic will take a scalar start-value operand, where the unordered reduction intrinsic will only take a vector operand. Both options auto-upgrade the IR to use the new (version of the) intrinsics. I'm personally slightly in favour of [Option B], because it better aligns with the definition of the SelectionDAG nodes and is more explicit in its semantics. We also avoid having to use an artificial 'v2' like prefix to denote the new behaviour of the intrinsic. Do we have any targets with instructions that can actually use the start value? TBH I'd be tempted to suggest we just make the initial extractelement/fadd/insertelement pattern a manual extra stage and avoid having having that argument entirely. ARM SVE has the FADDA instruction for strict fadd reductions (see for example test/MC/AArch64/SVE/fadda.s). This instruction takes an explicit start-value operand. The reduction intrinsics were originally introduced for SVE where we modelled the fadd/fmul reductions with this instruction in mind. Just to clarify, is this what you are suggesting regarding extract/fadd/insert? %first = extractelement <4 x float> %input, i32 0 %first.new = fadd float %start, %first %input.new = insertelement <4 x float> %input, float %first.new, i32 0 %red = call float @llvm.experimental.vector.reduce.ordered.fadd.f32.v4f32(<4 x float> %input.new) My only reservation here is that LLVM might obfuscate this code so that CodeGen couldn't easily match the extract/fadd/insert pattern, thus adding the extra fadd instruction. This could for example happen if the loop would be rotated/pipelined to load the next iteration and doing the first 'fadd' before the next iteration. In such case having the extra operand would be more descriptive. Further efforts: ---------------------------- Here a non-exhaustive list of items I think work towards making the intrinsics non-experimental: * Adding SelectionDAG legalization for the _STRICT reduction SDNodes. After some great work from Nikita in D58015, unordered reductions are now legalized/expanded in SelectionDAG, so if we add expansion in SelectionDAG for strict reductions this would make the ExpandReductionsPass redundant. * Better enforcing the constraints of the intrinsics (see https://reviews.llvm.org/D60260 ). * I think we'll also want to be able to overload the result operand based on the vector element type for the intrinsics having the constraint that the result type must match the vector element type. e.g. dropping the redundant 'i32' in: i32 @llvm.experimental.vector.reduce.and.i32.v4i32(<4 x i32> %a) => i32 @llvm.experimental.vector.reduce.and.v4i32(<4 x i32> %a) since i32 is implied by <4 x i32>. This would have the added benefit that LLVM would automatically check for the operands to match. Won't this cause issues with overflow? Isn't the point of an add (or mul....) reduction of say, <64 x i8> giving a larger (i32 or i64) result so we don't lose anything? I agree for bitop reductions it doesn't make sense though. Sorry - I forgot to add: which asks the question - should we be considering signed/unsigned add/mul and possibly saturation reductions? The current intrinsics explicitly specify that: "The return type matches the element-type of the vector input" This was done to avoid having explicit signed/unsigned add reductions, reasoning that zero- and sign-extension can be done on the input values to the reduction. We had a bit of debate on this internally, and it would come down to similar reasons as for the extra 'start value' operand to fadd reductions. I think we'd welcome the signed/unsigned variants as they would be more descriptive and would safeguard the code from transformations that make it difficult to fold the sign/zero extend into the operation during CodeGen. The downside however is that for signed/unsigned add reductions it would mean that both operations are the same when the result type equals the element type. Saturating vector reductions sound sensible, but are there any targets that implement these at the moment? _______________________________________________ LLVM Developers mailing list llvm-dev at lists.llvm.org<mailto:llvm-dev at lists.llvm.org> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20190405/68fa3073/attachment-0001.html>
Simon Pilgrim via llvm-dev
2019-Apr-07 13:56 UTC
[llvm-dev] [RFC] Changes to llvm.experimental.vector.reduce intrinsics
On 05/04/2019 16:26, Sander De Smalen wrote:> Hi Simon, > > Thanks for your feedback! See my comments inline. > >> On 5 Apr 2019, at 09:47, Simon Pilgrim via llvm-dev >> <llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>> wrote: >> >> >> On 05/04/2019 09:37, Simon Pilgrim via llvm-dev wrote: >>> On 04/04/2019 14:11, Sander De Smalen wrote: >>>> Proposed change: >>>> ---------------------------- >>>> In this RFC I propose changing the intrinsics for >>>> llvm.experimental.vector.reduce.fadd and >>>> llvm.experimental.vector.reduce.fmul (see options A and B). I also >>>> propose renaming the 'accumulator' operand to 'start value' because >>>> for fmul this is the start value of the reduction, rather than a >>>> value to which the fmul reduction is accumulated into. >>>> [Option A] Always using the start value operand in the reduction >>>> (https://reviews.llvm.org/D60261) >>>> declare float >>>> @llvm.experimental.vector.reduce.v2.fadd.f32.v4f32(float >>>> %start_value, <4 x float> %vec) >>>> This means that if the start value is 'undef', the result will be >>>> undef and all code creating such a reduction will need to ensure it >>>> has a sensible start value (e.g. 0.0 for fadd, 1.0 for fmul). When >>>> using 'fast' or ‘reassoc’ on the call it will be implemented using >>>> an unordered reduction, otherwise it will be implemented with an >>>> ordered reduction. Note that a new intrinsic is required to capture >>>> the new semantics. In this proposal the intrinsic is prefixed with >>>> a 'v2' for the time being, with the expectation this will be >>>> dropped when we remove 'experimental' from the reduction intrinsics >>>> in the future. >>>> [Option B] Having separate ordered and unordered intrinsics >>>> (https://reviews.llvm.org/D60262). >>>> declare float >>>> @llvm.experimental.vector.reduce.ordered.fadd.f32.v4f32(float >>>> %start_value, <4 x float> %vec) >>>> declare float >>>> @llvm.experimental.vector.reduce.unordered.fadd.f32.v4f32(<4 x >>>> float> %vec) >>>> This will mean that the behaviour is explicit from the intrinsic >>>> and the use of 'fast' or ‘reassoc’ on the call has no effect on how >>>> that intrinsic is lowered. The ordered reduction intrinsic will >>>> take a scalar start-value operand, where the unordered reduction >>>> intrinsic will only take a vector operand. >>>> Both options auto-upgrade the IR to use the new (version of the) >>>> intrinsics. I'm personally slightly in favour of [Option B], >>>> because it better aligns with the definition of the SelectionDAG >>>> nodes and is more explicit in its semantics. We also avoid having >>>> to use an artificial 'v2' like prefix to denote the new behaviour >>>> of the intrinsic. >>> >>> Do we have any targets with instructions that can actually use the >>> start value? TBH I'd be tempted to suggest we just make the initial >>> extractelement/fadd/insertelement pattern a manual extra stage and >>> avoid having having that argument entirely. >>> > ARM SVE has the FADDA instruction for strict fadd reductions (see for > example test/MC/AArch64/SVE/fadda.s). This instruction takes an > explicit start-value operand. The reduction intrinsics were originally > introduced for SVE where we modelled the fadd/fmul reductions with > this instruction in mind. > > Just to clarify, is this what you are suggesting regarding > extract/fadd/insert? > > %first = extractelement <4 x float> %input, i32 0 > %first.new = fadd float %start, %first > %input.new = insertelement <4 x float> %input, float %first.new, i32 0 > %red = call float > @llvm.experimental.vector.reduce.ordered.fadd.f32.v4f32(<4 x float> > %input.new) > > My only reservation here is that LLVM might obfuscate this code so > that CodeGen couldn't easily match the extract/fadd/insert pattern, > thus adding the extra fadd instruction. This could for example happen > if the loop would be rotated/pipelined to load the next iteration and > doing the first 'fadd' before the next iteration. In such case having > the extra operand would be more descriptive.Yes that was the IR I had in mind, but you're right in that its probably useful for chained fadd reductions as well as the SVE specific instruction. If we're getting rid of the fast math 'undef' special case and we expect a 'identity' start value (fadd = 0.0f, fmul = 1.0f) that we can optimize away then I've no objections.>>>> Further efforts: >>>> ---------------------------- >>>> Here a non-exhaustive list of items I think work towards making the >>>> intrinsics non-experimental: >>>> >>>> * Adding SelectionDAG legalization for the _STRICT reduction >>>> SDNodes. After some great work from Nikita in D58015, unordered >>>> reductions are now legalized/expanded in SelectionDAG, so if we >>>> add expansion in SelectionDAG for strict reductions this would >>>> make the ExpandReductionsPass redundant. >>>> * Better enforcing the constraints of the intrinsics >>>> (seehttps://reviews.llvm.org/D60260). >>>> * I think we'll also want to be able to overload the result >>>> operand based on the vector element type for the intrinsics >>>> having the constraint that the result type must match the >>>> vector element type. e.g. dropping the redundant 'i32' in: >>>> i32 @llvm.experimental.vector.reduce.and.i32.v4i32(<4 x i32> >>>> %a) => i32 @llvm.experimental.vector.reduce.and.v4i32(<4 x i32> %a) >>>> >>>> since i32 is implied by <4 x i32>. This would have the added >>>> benefit that LLVM would automatically check for the operands to match. >>> >>> Won't this cause issues with overflow? Isn't the point of an add >>> (or mul....) reduction of say, <64 x i8> giving a larger (i32 or >>> i64) result so we don't lose anything? I agree for bitop reductions >>> it doesn't make sense though. >>> >> Sorry - I forgot to add: which asks the question - should we be >> considering signed/unsigned add/mul and possibly saturation reductions? > The current intrinsics explicitly specify that: > "The return type matches the element-type of the vector input" > > This was done to avoid having explicit signed/unsigned add reductions, > reasoning that zero- and sign-extension can be done on the input > values to the reduction. We had a bit of debate on this internally, > and it would come down to similar reasons as for the extra 'start > value' operand to fadd reductions. I think we'd welcome the > signed/unsigned variants as they would be more descriptive and would > safeguard the code from transformations that make it difficult to fold > the sign/zero extend into the operation during CodeGen. The downside > however is that for signed/unsigned add reductions it would mean that > both operations are the same when the result type equals the element type.An alternative would be that we limit the existing add/mul cases to the same result type (along with and/or/xor/smax/smin/umax/umin) and we add sadd/uadd/smul/umul extending reductions as well.> Saturating vector reductions sound sensible, but are there any targets > that implement these at the moment?X86/SSE has the v8i16 HADDS/HSUBS horizontal signed saturation instructions, and X86/XOP has extend+horizontal-add/sub instructions (https://en.wikipedia.org/wiki/XOP_instruction_set). -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20190407/88d6b8bc/attachment-0001.html>
Simon Moll via llvm-dev
2019-Apr-08 10:37 UTC
[llvm-dev] [RFC] Changes to llvm.experimental.vector.reduce intrinsics
Hi, On 4/5/19 10:47 AM, Simon Pilgrim via llvm-dev wrote:> On 05/04/2019 09:37, Simon Pilgrim via llvm-dev wrote: >> On 04/04/2019 14:11, Sander De Smalen wrote: >>> Proposed change: >>> >>> ---------------------------- >>> >>> In this RFC I propose changing the intrinsics for >>> llvm.experimental.vector.reduce.fadd and >>> llvm.experimental.vector.reduce.fmul (see options A and B). I also >>> propose renaming the 'accumulator' operand to 'start value' because >>> for fmul this is the start value of the reduction, rather than a >>> value to which the fmul reduction is accumulated into. >>>Note that the LLVM-VP proposal also changes the way reductions are handled in IR (https://reviews.llvm.org/D57504). This could be an opportunity to avoid the "v2" suffix issue: LLVM-VP moves the intrinsic to the "llvm.vp.*" namespace and we can fix the reduction semantics in the progress. Btw, if you are at EuroLLVM. There is a BoF at 2pm today on LLVM-VP.>>> [Option A] Always using the start value operand in the reduction >>> (https://reviews.llvm.org/D60261) >>> >>> declare float >>> @llvm.experimental.vector.reduce.v2.fadd.f32.v4f32(float >>> %start_value, <4 x float> %vec) >>> >>> This means that if the start value is 'undef', the result will be >>> undef and all code creating such a reduction will need to ensure it >>> has a sensible start value (e.g. 0.0 for fadd, 1.0 for fmul). When >>> using 'fast' or ‘reassoc’ on the call it will be implemented using >>> an unordered reduction, otherwise it will be implemented with an >>> ordered reduction. Note that a new intrinsic is required to capture >>> the new semantics. In this proposal the intrinsic is prefixed with a >>> 'v2' for the time being, with the expectation this will be dropped >>> when we remove 'experimental' from the reduction intrinsics in the >>> future. >>> >>> [Option B] Having separate ordered and unordered intrinsics >>> (https://reviews.llvm.org/D60262). >>> >>> declare float >>> @llvm.experimental.vector.reduce.ordered.fadd.f32.v4f32(float >>> %start_value, <4 x float> %vec) >>> >>> declare float >>> @llvm.experimental.vector.reduce.unordered.fadd.f32.v4f32(<4 x >>> float> %vec) >>> >>> This will mean that the behaviour is explicit from the intrinsic and >>> the use of 'fast' or ‘reassoc’ on the call has no effect on how that >>> intrinsic is lowered. The ordered reduction intrinsic will take a >>> scalar start-value operand, where the unordered reduction intrinsic >>> will only take a vector operand. >>> >>> Both options auto-upgrade the IR to use the new (version of the) >>> intrinsics. I'm personally slightly in favour of [Option B], because >>> it better aligns with the definition of the SelectionDAG nodes and >>> is more explicit in its semantics. We also avoid having to use an >>> artificial 'v2' like prefix to denote the new behaviour of the >>> intrinsic. >>> >> Do we have any targets with instructions that can actually use the >> start value? TBH I'd be tempted to suggest we just make the initial >> extractelement/fadd/insertelement pattern a manual extra stage and >> avoid having having that argument entirely. >>NEC SX-Aurora has reduction instructions that take in a start value in a scalar register. We are hoping to upstream the backend: http://lists.llvm.org/pipermail/llvm-dev/2019-April/131580.html>> >>> Further efforts: >>> >>> ---------------------------- >>> >>> Here a non-exhaustive list of items I think work towards making the >>> intrinsics non-experimental: >>> >>> * Adding SelectionDAG legalization for the _STRICT reduction >>> SDNodes. After some great work from Nikita in D58015, unordered >>> reductions are now legalized/expanded in SelectionDAG, so if we >>> add expansion in SelectionDAG for strict reductions this would >>> make the ExpandReductionsPass redundant. >>> * Better enforcing the constraints of the intrinsics (see >>> https://reviews.llvm.org/D60260 ). >>> * I think we'll also want to be able to overload the result >>> operand based on the vector element type for the intrinsics >>> having the constraint that the result type must match the vector >>> element type. e.g. dropping the redundant 'i32' in: >>> i32 @llvm.experimental.vector.reduce.and.i32.v4i32(<4 x i32> %a) >>> => i32 @llvm.experimental.vector.reduce.and.v4i32(<4 x i32> %a) >>> >>> since i32 is implied by <4 x i32>. This would have the added benefit >>> that LLVM would automatically check for the operands to match. >>> >> Won't this cause issues with overflow? Isn't the point of an add (or >> mul....) reduction of say, <64 x i8> giving a larger (i32 or i64) >> result so we don't lose anything? I agree for bitop reductions it >> doesn't make sense though. >> > Sorry - I forgot to add: which asks the question - should we be > considering signed/unsigned add/mul and possibly saturation reductions? > > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev-- Simon Moll Researcher / PhD Student Compiler Design Lab (Prof. Hack) Saarland University, Computer Science Building E1.3, Room 4.31 Tel. +49 (0)681 302-57521 : moll at cs.uni-saarland.de Fax. +49 (0)681 302-3065 : http://compilers.cs.uni-saarland.de/people/moll -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20190408/6673e7b2/attachment-0001.html>
Sander De Smalen via llvm-dev
2019-Apr-10 12:59 UTC
[llvm-dev] [RFC] Changes to llvm.experimental.vector.reduce intrinsics
> On 8 Apr 2019, at 11:37, Simon Moll <moll at cs.uni-saarland.de> wrote: > > Hi, > > On 4/5/19 10:47 AM, Simon Pilgrim via llvm-dev wrote: >> On 05/04/2019 09:37, Simon Pilgrim via llvm-dev wrote: >>> On 04/04/2019 14:11, Sander De Smalen wrote: >>>> Proposed change: >>>> ---------------------------- >>>> In this RFC I propose changing the intrinsics for llvm.experimental.vector.reduce.fadd and llvm.experimental.vector.reduce.fmul (see options A and B). I also propose renaming the 'accumulator' operand to 'start value' because for fmul this is the start value of the reduction, rather than a value to which the fmul reduction is accumulated into. > Note that the LLVM-VP proposal also changes the way reductions are handled in IR (https://reviews.llvm.org/D57504). This could be an opportunity to avoid the "v2" suffix issue: LLVM-VP moves the intrinsic to the "llvm.vp.*" namespace and we can fix the reduction semantics in the progress.Thanks for pointing out Simon. I think for now we should keep this proposal separate from LLVM-VP as they serve different purposes and have different scope. But yes we can easily rename the intrinsics again when the VP proposal lands.> > Btw, if you are at EuroLLVM. There is a BoF at 2pm today on LLVM-VP. > >>>> >>>> [Option A] Always using the start value operand in the reduction (https://reviews.llvm.org/D60261) >>>> >>>> declare float @llvm.experimental.vector.reduce.v2.fadd.f32.v4f32(float %start_value, <4 x float> %vec) >>>> >>>> This means that if the start value is 'undef', the result will be undef and all code creating such a reduction will need to ensure it has a sensible start value (e.g. 0.0 for fadd, 1.0 for fmul). When using 'fast' or ‘reassoc’ on the call it will be implemented using an unordered reduction, otherwise it will be implemented with an ordered reduction. Note that a new intrinsic is required to capture the new semantics. In this proposal the intrinsic is prefixed with a 'v2' for the time being, with the expectation this will be dropped when we remove 'experimental' from the reduction intrinsics in the future. >>>> >>>> [Option B] Having separate ordered and unordered intrinsics (https://reviews.llvm.org/D60262). >>>> >>>> declare float @llvm.experimental.vector.reduce.ordered.fadd.f32.v4f32(float %start_value, <4 x float> %vec) >>>> declare float @llvm.experimental.vector.reduce.unordered.fadd.f32.v4f32(<4 x float> %vec) >>>> >>>> This will mean that the behaviour is explicit from the intrinsic and the use of 'fast' or ‘reassoc’ on the call has no effect on how that intrinsic is lowered. The ordered reduction intrinsic will take a scalar start-value operand, where the unordered reduction intrinsic will only take a vector operand. >>>> >>>> Both options auto-upgrade the IR to use the new (version of the) intrinsics. I'm personally slightly in favour of [Option B], because it better aligns with the definition of the SelectionDAG nodes and is more explicit in its semantics. We also avoid having to use an artificial 'v2' like prefix to denote the new behaviour of the intrinsic. >>> Do we have any targets with instructions that can actually use the start value? TBH I'd be tempted to suggest we just make the initial extractelement/fadd/insertelement pattern a manual extra stage and avoid having having that argument entirely. >>> > NEC SX-Aurora has reduction instructions that take in a start value in a scalar register. We are hoping to upstream the backend: http://lists.llvm.org/pipermail/llvm-dev/2019-April/131580.htmlGreat, I think combined with the argument for chaining of ordered reductions (often inside vectorized loops) and two architectures (ARM SVE and SX-Aurora) taking a scalar start register, this is enough of an argument to keep the explicit operand for the ordered reductions.>>> >>>> Further efforts: >>>> ---------------------------- >>>> Here a non-exhaustive list of items I think work towards making the intrinsics non-experimental: >>>> >>>> • Adding SelectionDAG legalization for the _STRICT reduction SDNodes. After some great work from Nikita in D58015, unordered reductions are now legalized/expanded in SelectionDAG, so if we add expansion in SelectionDAG for strict reductions this would make the ExpandReductionsPass redundant. >>>> • Better enforcing the constraints of the intrinsics (see https://reviews.llvm.org/D60260 ). >>>> >>>> • I think we'll also want to be able to overload the result operand based on the vector element type for the intrinsics having the constraint that the result type must match the vector element type. e.g. dropping the redundant 'i32' in: >>>> i32 @llvm.experimental.vector.reduce.and.i32.v4i32(<4 x i32> %a) => i32 @llvm.experimental.vector.reduce.and.v4i32(<4 x i32> %a) >>>> since i32 is implied by <4 x i32>. This would have the added benefit that LLVM would automatically check for the operands to match. >>>> >>> Won't this cause issues with overflow? Isn't the point of an add (or mul....) reduction of say, <64 x i8> giving a larger (i32 or i64) result so we don't lose anything? I agree for bitop reductions it doesn't make sense though. >>> >> Sorry - I forgot to add: which asks the question - should we be considering signed/unsigned add/mul and possibly saturation reductions? >> >> >> _______________________________________________ >> LLVM Developers mailing list >> >> llvm-dev at lists.llvm.org >> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev > -- > > Simon Moll > Researcher / PhD Student > > Compiler Design Lab (Prof. Hack) > Saarland University, Computer Science > Building E1.3, Room 4.31 > > Tel. +49 (0)681 302-57521 : > moll at cs.uni-saarland.de > > Fax. +49 (0)681 302-3065 : > http://compilers.cs.uni-saarland.de/people/moll