thr3ads.net - llvm dev - [llvm-dev] [RFC] Vector Predication [Feb 2019]

If this information is useful, please help other people find it:
Share via:

Saito, Hideki via llvm-dev

2019-Feb-01 01:41 UTC

[llvm-dev] [RFC] Vector Predication

I think you and I are talking two different things.

As far as Intel’s vector function ABI is concerned, unless the programmer
specifically says otherwise, given an OpenMP declare simd function, compiler
will
deduce the VF from HW vector register size and other function signatures. Of
course, there can be different vector function ABIs for different targets. Intel
compiler cost model uses vector function VF as part of loop vectorization VF
determination. So, it’s tightly coupled.

A hypothetical vector target may vectorize such a vector function for 4096b
vector, with an explicit VF parameter 20 also passed to it, to execute only the
lower
20-elements parts of the whole thing.

I think this scenario answers Philip’s question on why separate mask and VF
parameters and why VF can’t be conservatively deduced from the mask/mask
compute.


From: Bruce Hoult [mailto:bruce at hoult.org]
Sent: Thursday, January 31, 2019 5:13 PM
To: Saito, Hideki <hideki.saito at intel.com>
Cc: Philip Reames <listmail at philipreames.com>; Robin Kruppe
<robin.kruppe at gmail.com>; David Greene <dag at cray.com>; via
llvm-dev <llvm-dev at lists.llvm.org>; Maslov, Sergey V
<sergey.v.maslov at intel.com>; Topper, Craig <craig.topper at
intel.com>
Subject: Re: [llvm-dev] [RFC] Vector Predication

On Thu, Jan 31, 2019 at 4:31 PM Saito, Hideki via llvm-dev <llvm-dev at
lists.llvm.org<mailto:llvm-dev at lists.llvm.org>> wrote:
>when we have a mask loaded from an external source (memory, function call
boundary, etc...) and a short sequence of vector ops
Mask value from function call parameter is common. OpenMP declare simd function
does exactly that for the masked cases.

Such a mask is at the application level, not at the vector strip-mining loop
level.

As well as possibly being many times longer than the masks the hardware works
with, it's likely to not even in the the format the hardware uses: different
library APIs might pack a mask into bits, or one mask element per byte, short,
or int.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20190201/40558889/attachment-0001.html>

Philip Reames via llvm-dev

2019-Feb-05 00:16 UTC

head link

[llvm-dev] [RFC] Vector Predication

On 1/31/19 5:41 PM, Saito, Hideki wrote:>
> I think you and I are talking two different things.
>
> As far as Intel’s vector function ABI is concerned, unless the 
> programmer specifically says otherwise, given an OpenMP declare simd 
> function, compiler will
>
> deduce the VF from HW vector register size and other function 
> signatures. Of course, there can be different vector function ABIs for 
> different targets. Intel
>
> compiler cost model uses vector function VF as part of loop 
> vectorization VF determination. So, it’s tightly coupled.
>
> A hypothetical vector target may vectorize such a vector function for 
> 4096b vector, with an explicit VF parameter 20 also passed to it, to 
> execute only the lower
>
> 20-elements parts of the whole thing.
>
> I think this scenario answers Philip’s question on why separate mask 
> and VF parameters and why VF can’t be conservatively deduced from the 
> mask/mask compute.
>I think this does come close, yes.  There's still the question of just 
how common a short vectorized function of this form is in practice after 
inlining, but I can understand why being able to represent this 
cleanly/concisely would be useful.  My scheme would require the 
mask->length computation code be inserted as essentially part of the 
prolog, and doing so might be reasonable expensive.

On the other hand, if the vector length is already part of the ABI - 
which is sounds like this case is - inserting a bit of dummy code which 
enforces the predicate mask only has bits set below VLen could be done 
w/a simple shift/dec/and sequence.  While the sequence itself would be 
dynamically useless, it would make it obvious what the vlen for the 
function was if it hadn't been expressed in the IR.

Or alternatively, we could use the calling convention ABI detail to 
*assume* (and thus insert during SelectionDAG), the fact that the VLEN 
parameter's relation to the vector mask one.

My point in the above is not that this is obviously the right answer - 
it's not - simply that it probably could be made to work.  As such, I 
don't think we should be automatically assuming we have to match the IR 
definition precisely to the hardware. Doing so is a recipe for 
over-fitting and a hard to maintain long term design.

It's worth pointing out that including the vlen parameter in the 
intrinsic definitions creates exactly the opposite problem on a SIMD 
platform.  (i.e. we have to mask out the predicated based on the length 
when generating code.)

Philip

p.s. Reminder, just playing devil's advocate.  No strong opinions 
actually held.  :)

> *From:*Bruce Hoult [mailto:bruce at hoult.org]
> *Sent:* Thursday, January 31, 2019 5:13 PM
> *To:* Saito, Hideki <hideki.saito at intel.com>
> *Cc:* Philip Reames <listmail at philipreames.com>; Robin Kruppe 
> <robin.kruppe at gmail.com>; David Greene <dag at cray.com>;
via llvm-dev
> <llvm-dev at lists.llvm.org>; Maslov, Sergey V 
> <sergey.v.maslov at intel.com>; Topper, Craig <craig.topper at
intel.com>
> *Subject:* Re: [llvm-dev] [RFC] Vector Predication
>
> On Thu, Jan 31, 2019 at 4:31 PM Saito, Hideki via llvm-dev 
> <llvm-dev at lists.llvm.org <mailto:llvm-dev at
lists.llvm.org>> wrote:
>
>     >when we have a mask loaded from an external source (memory,
>     function call boundary, etc...) and a short sequence of vector ops
>
>     Mask value from function call parameter is common. OpenMP declare
>     simd function does exactly that for the masked cases.
>
> Such a mask is at the application level, not at the vector 
> strip-mining loop level.
>
> As well as possibly being many times longer than the masks the 
> hardware works with, it's likely to not even in the the format the 
> hardware uses: different library APIs might pack a mask into bits, or 
> one mask element per byte, short, or int.
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20190204/fd78f4e7/attachment.html>

Saito, Hideki via llvm-dev

2019-Feb-05 00:45 UTC

head link

[llvm-dev] [RFC] Vector Predication

>how common a short vectorized function of this form is in practice after
inlining
Doesn’t have to be short. You can write 1000+ lines of masked vector code w/o
any computation on the mask itself (e.g., mask is read-only).
>inserting a bit of dummy code which enforces the predicate mask only has
bits set below VLen could be done w/a simple shift/dec/and sequence
Essentially, dummy bit scan at the beginning of the function and use that as
dynamic VLen ---- replace that with real VLen parameter in CodeGen.
Not clean, but conceptually, that might work. The rest of the discussion along
this, I’ll punt to explicit vector length folks to justify.

From: Philip Reames [mailto:listmail at philipreames.com]
Sent: Monday, February 04, 2019 4:16 PM
To: Saito, Hideki <hideki.saito at intel.com>; Bruce Hoult <bruce at
hoult.org>
Cc: Robin Kruppe <robin.kruppe at gmail.com>; David Greene <dag at
cray.com>; via llvm-dev <llvm-dev at lists.llvm.org>; Maslov, Sergey V
<sergey.v.maslov at intel.com>; Topper, Craig <craig.topper at
intel.com>
Subject: Re: [llvm-dev] [RFC] Vector Predication

On 1/31/19 5:41 PM, Saito, Hideki wrote:

I think you and I are talking two different things.

As far as Intel’s vector function ABI is concerned, unless the programmer
specifically says otherwise, given an OpenMP declare simd function, compiler
will
deduce the VF from HW vector register size and other function signatures. Of
course, there can be different vector function ABIs for different targets. Intel
compiler cost model uses vector function VF as part of loop vectorization VF
determination. So, it’s tightly coupled.

A hypothetical vector target may vectorize such a vector function for 4096b
vector, with an explicit VF parameter 20 also passed to it, to execute only the
lower
20-elements parts of the whole thing.

I think this scenario answers Philip’s question on why separate mask and VF
parameters and why VF can’t be conservatively deduced from the mask/mask
compute.

I think this does come close, yes.  There's still the question of just how
common a short vectorized function of this form is in practice after inlining,
but I can understand why being able to represent this cleanly/concisely would be
useful.  My scheme would require the mask->length computation code be
inserted as essentially part of the prolog, and doing so might be reasonable
expensive.

On the other hand, if the vector length is already part of the ABI - which is
sounds like this case is - inserting a bit of dummy code which enforces the
predicate mask only has bits set below VLen could be done w/a simple
shift/dec/and sequence.  While the sequence itself would be dynamically useless,
it would make it obvious what the vlen for the function was if it hadn't
been expressed in the IR.

Or alternatively, we could use the calling convention ABI detail to *assume*
(and thus insert during SelectionDAG), the fact that the VLEN parameter's
relation to the vector mask one.

My point in the above is not that this is obviously the right answer - it's
not - simply that it probably could be made to work.  As such, I don't think
we should be automatically assuming we have to match the IR definition precisely
to the hardware.  Doing so is a recipe for over-fitting and a hard to maintain
long term design.

It's worth pointing out that including the vlen parameter in the intrinsic
definitions creates exactly the opposite problem on a SIMD platform.  (i.e. we
have to mask out the predicated based on the length when generating code.)

Philip

p.s. Reminder, just playing devil's advocate.  No strong opinions actually
held.  :)

From: Bruce Hoult [mailto:bruce at hoult.org]
Sent: Thursday, January 31, 2019 5:13 PM
To: Saito, Hideki <hideki.saito at intel.com><mailto:hideki.saito at
intel.com>
Cc: Philip Reames <listmail at philipreames.com><mailto:listmail at
philipreames.com>; Robin Kruppe <robin.kruppe at
gmail.com><mailto:robin.kruppe at gmail.com>; David Greene <dag at
cray.com><mailto:dag at cray.com>; via llvm-dev <llvm-dev at
lists.llvm.org><mailto:llvm-dev at lists.llvm.org>; Maslov, Sergey V
<sergey.v.maslov at intel.com><mailto:sergey.v.maslov at intel.com>;
Topper, Craig <craig.topper at intel.com><mailto:craig.topper at
intel.com>
Subject: Re: [llvm-dev] [RFC] Vector Predication

On Thu, Jan 31, 2019 at 4:31 PM Saito, Hideki via llvm-dev <llvm-dev at
lists.llvm.org<mailto:llvm-dev at lists.llvm.org>> wrote:
>when we have a mask loaded from an external source (memory, function call
boundary, etc...) and a short sequence of vector ops
Mask value from function call parameter is common. OpenMP declare simd function
does exactly that for the masked cases.

Such a mask is at the application level, not at the vector strip-mining loop
level.

As well as possibly being many times longer than the masks the hardware works
with, it's likely to not even in the the format the hardware uses: different
library APIs might pack a mask into bits, or one mask element per byte, short,
or int.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20190205/2d745979/attachment-0001.html>

David Greene via llvm-dev

2019-Feb-07 17:12 UTC

head link

[llvm-dev] [RFC] Vector Predication

Philip Reames <listmail at philipreames.com> writes:
> I think this does come close, yes. There's still the question of just
> how common a short vectorized function of this form is in practice
> after inlining
They are quite common for long vector-length machines (2048 bits or
more).  Not just functions but loops too.

                                -David

llvm dev - Feb 2019 - [RFC] Vector Predication

[llvm-dev] [RFC] Vector Predication

[llvm-dev] [RFC] Vector Predication

[llvm-dev] [RFC] Vector Predication

[llvm-dev] [RFC] Vector Predication