I think you and I are talking two different things. As far as Intel’s vector function ABI is concerned, unless the programmer specifically says otherwise, given an OpenMP declare simd function, compiler will deduce the VF from HW vector register size and other function signatures. Of course, there can be different vector function ABIs for different targets. Intel compiler cost model uses vector function VF as part of loop vectorization VF determination. So, it’s tightly coupled. A hypothetical vector target may vectorize such a vector function for 4096b vector, with an explicit VF parameter 20 also passed to it, to execute only the lower 20-elements parts of the whole thing. I think this scenario answers Philip’s question on why separate mask and VF parameters and why VF can’t be conservatively deduced from the mask/mask compute. From: Bruce Hoult [mailto:bruce at hoult.org] Sent: Thursday, January 31, 2019 5:13 PM To: Saito, Hideki <hideki.saito at intel.com> Cc: Philip Reames <listmail at philipreames.com>; Robin Kruppe <robin.kruppe at gmail.com>; David Greene <dag at cray.com>; via llvm-dev <llvm-dev at lists.llvm.org>; Maslov, Sergey V <sergey.v.maslov at intel.com>; Topper, Craig <craig.topper at intel.com> Subject: Re: [llvm-dev] [RFC] Vector Predication On Thu, Jan 31, 2019 at 4:31 PM Saito, Hideki via llvm-dev <llvm-dev at lists.llvm.org<mailto:llvm-dev at lists.llvm.org>> wrote:>when we have a mask loaded from an external source (memory, function call boundary, etc...) and a short sequence of vector opsMask value from function call parameter is common. OpenMP declare simd function does exactly that for the masked cases. Such a mask is at the application level, not at the vector strip-mining loop level. As well as possibly being many times longer than the masks the hardware works with, it's likely to not even in the the format the hardware uses: different library APIs might pack a mask into bits, or one mask element per byte, short, or int. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20190201/40558889/attachment-0001.html>
On 1/31/19 5:41 PM, Saito, Hideki wrote:> > I think you and I are talking two different things. > > As far as Intel’s vector function ABI is concerned, unless the > programmer specifically says otherwise, given an OpenMP declare simd > function, compiler will > > deduce the VF from HW vector register size and other function > signatures. Of course, there can be different vector function ABIs for > different targets. Intel > > compiler cost model uses vector function VF as part of loop > vectorization VF determination. So, it’s tightly coupled. > > A hypothetical vector target may vectorize such a vector function for > 4096b vector, with an explicit VF parameter 20 also passed to it, to > execute only the lower > > 20-elements parts of the whole thing. > > I think this scenario answers Philip’s question on why separate mask > and VF parameters and why VF can’t be conservatively deduced from the > mask/mask compute. >I think this does come close, yes. There's still the question of just how common a short vectorized function of this form is in practice after inlining, but I can understand why being able to represent this cleanly/concisely would be useful. My scheme would require the mask->length computation code be inserted as essentially part of the prolog, and doing so might be reasonable expensive. On the other hand, if the vector length is already part of the ABI - which is sounds like this case is - inserting a bit of dummy code which enforces the predicate mask only has bits set below VLen could be done w/a simple shift/dec/and sequence. While the sequence itself would be dynamically useless, it would make it obvious what the vlen for the function was if it hadn't been expressed in the IR. Or alternatively, we could use the calling convention ABI detail to *assume* (and thus insert during SelectionDAG), the fact that the VLEN parameter's relation to the vector mask one. My point in the above is not that this is obviously the right answer - it's not - simply that it probably could be made to work. As such, I don't think we should be automatically assuming we have to match the IR definition precisely to the hardware. Doing so is a recipe for over-fitting and a hard to maintain long term design. It's worth pointing out that including the vlen parameter in the intrinsic definitions creates exactly the opposite problem on a SIMD platform. (i.e. we have to mask out the predicated based on the length when generating code.) Philip p.s. Reminder, just playing devil's advocate. No strong opinions actually held. :)> *From:*Bruce Hoult [mailto:bruce at hoult.org] > *Sent:* Thursday, January 31, 2019 5:13 PM > *To:* Saito, Hideki <hideki.saito at intel.com> > *Cc:* Philip Reames <listmail at philipreames.com>; Robin Kruppe > <robin.kruppe at gmail.com>; David Greene <dag at cray.com>; via llvm-dev > <llvm-dev at lists.llvm.org>; Maslov, Sergey V > <sergey.v.maslov at intel.com>; Topper, Craig <craig.topper at intel.com> > *Subject:* Re: [llvm-dev] [RFC] Vector Predication > > On Thu, Jan 31, 2019 at 4:31 PM Saito, Hideki via llvm-dev > <llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>> wrote: > > >when we have a mask loaded from an external source (memory, > function call boundary, etc...) and a short sequence of vector ops > > Mask value from function call parameter is common. OpenMP declare > simd function does exactly that for the masked cases. > > Such a mask is at the application level, not at the vector > strip-mining loop level. > > As well as possibly being many times longer than the masks the > hardware works with, it's likely to not even in the the format the > hardware uses: different library APIs might pack a mask into bits, or > one mask element per byte, short, or int. >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20190204/fd78f4e7/attachment.html>
>how common a short vectorized function of this form is in practice after inliningDoesn’t have to be short. You can write 1000+ lines of masked vector code w/o any computation on the mask itself (e.g., mask is read-only).>inserting a bit of dummy code which enforces the predicate mask only has bits set below VLen could be done w/a simple shift/dec/and sequenceEssentially, dummy bit scan at the beginning of the function and use that as dynamic VLen ---- replace that with real VLen parameter in CodeGen. Not clean, but conceptually, that might work. The rest of the discussion along this, I’ll punt to explicit vector length folks to justify. From: Philip Reames [mailto:listmail at philipreames.com] Sent: Monday, February 04, 2019 4:16 PM To: Saito, Hideki <hideki.saito at intel.com>; Bruce Hoult <bruce at hoult.org> Cc: Robin Kruppe <robin.kruppe at gmail.com>; David Greene <dag at cray.com>; via llvm-dev <llvm-dev at lists.llvm.org>; Maslov, Sergey V <sergey.v.maslov at intel.com>; Topper, Craig <craig.topper at intel.com> Subject: Re: [llvm-dev] [RFC] Vector Predication On 1/31/19 5:41 PM, Saito, Hideki wrote: I think you and I are talking two different things. As far as Intel’s vector function ABI is concerned, unless the programmer specifically says otherwise, given an OpenMP declare simd function, compiler will deduce the VF from HW vector register size and other function signatures. Of course, there can be different vector function ABIs for different targets. Intel compiler cost model uses vector function VF as part of loop vectorization VF determination. So, it’s tightly coupled. A hypothetical vector target may vectorize such a vector function for 4096b vector, with an explicit VF parameter 20 also passed to it, to execute only the lower 20-elements parts of the whole thing. I think this scenario answers Philip’s question on why separate mask and VF parameters and why VF can’t be conservatively deduced from the mask/mask compute. I think this does come close, yes. There's still the question of just how common a short vectorized function of this form is in practice after inlining, but I can understand why being able to represent this cleanly/concisely would be useful. My scheme would require the mask->length computation code be inserted as essentially part of the prolog, and doing so might be reasonable expensive. On the other hand, if the vector length is already part of the ABI - which is sounds like this case is - inserting a bit of dummy code which enforces the predicate mask only has bits set below VLen could be done w/a simple shift/dec/and sequence. While the sequence itself would be dynamically useless, it would make it obvious what the vlen for the function was if it hadn't been expressed in the IR. Or alternatively, we could use the calling convention ABI detail to *assume* (and thus insert during SelectionDAG), the fact that the VLEN parameter's relation to the vector mask one. My point in the above is not that this is obviously the right answer - it's not - simply that it probably could be made to work. As such, I don't think we should be automatically assuming we have to match the IR definition precisely to the hardware. Doing so is a recipe for over-fitting and a hard to maintain long term design. It's worth pointing out that including the vlen parameter in the intrinsic definitions creates exactly the opposite problem on a SIMD platform. (i.e. we have to mask out the predicated based on the length when generating code.) Philip p.s. Reminder, just playing devil's advocate. No strong opinions actually held. :) From: Bruce Hoult [mailto:bruce at hoult.org] Sent: Thursday, January 31, 2019 5:13 PM To: Saito, Hideki <hideki.saito at intel.com><mailto:hideki.saito at intel.com> Cc: Philip Reames <listmail at philipreames.com><mailto:listmail at philipreames.com>; Robin Kruppe <robin.kruppe at gmail.com><mailto:robin.kruppe at gmail.com>; David Greene <dag at cray.com><mailto:dag at cray.com>; via llvm-dev <llvm-dev at lists.llvm.org><mailto:llvm-dev at lists.llvm.org>; Maslov, Sergey V <sergey.v.maslov at intel.com><mailto:sergey.v.maslov at intel.com>; Topper, Craig <craig.topper at intel.com><mailto:craig.topper at intel.com> Subject: Re: [llvm-dev] [RFC] Vector Predication On Thu, Jan 31, 2019 at 4:31 PM Saito, Hideki via llvm-dev <llvm-dev at lists.llvm.org<mailto:llvm-dev at lists.llvm.org>> wrote:>when we have a mask loaded from an external source (memory, function call boundary, etc...) and a short sequence of vector opsMask value from function call parameter is common. OpenMP declare simd function does exactly that for the masked cases. Such a mask is at the application level, not at the vector strip-mining loop level. As well as possibly being many times longer than the masks the hardware works with, it's likely to not even in the the format the hardware uses: different library APIs might pack a mask into bits, or one mask element per byte, short, or int. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20190205/2d745979/attachment-0001.html>
Philip Reames <listmail at philipreames.com> writes:> I think this does come close, yes. There's still the question of just > how common a short vectorized function of this form is in practice > after inliningThey are quite common for long vector-length machines (2048 bits or more). Not just functions but loops too. -David