thr3ads.net - llvm dev - [llvm-dev] [RFC] Vector Predication [Feb 2019]

If this information is useful, please help other people find it:
Share via:

Luke Kenneth Casson Leighton via llvm-dev

2019-Feb-01 07:52 UTC

[llvm-dev] [RFC] Vector Predication

---
crowd-funded eco-conscious hardware: https://www.crowdsupply.com/eoma68

On Thu, Jan 31, 2019 at 10:22 PM Jacob Lifshay <programmerjake at
gmail.com> wrote:>
> We're in-progress designing a RISC-V extension
(http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2019-January/000433.html)
that would have variable-length vectors of short vectors (1 to 4):
> <VL x <4 x float>>
> where each predicate bit masks out a whole short vector. We're using
this extension to vectorize graphics code where where variables in the
pre-vectorization code are short vectors.
> So, vectorizing code like:
> for(int i = 0; i < 1000; i++)
> {
>     vec4 color = colors[i];
>     vec3 normal = normals[i];
>     color.rgb *= fmax(0.0, dot(normal, light_dir));
>     colors[i] = color;
> }
>
> I'm planning on passing already vectorized code into LLVM and using
LLVM as a backend for optimization and JIT code generation.
>
> Do you think the EVL proposal would support an ISA like this as it's
currently
> written (by pattern matching on predicate expansion and vector-length
> multiplication)?
whilst it may be tempting to suggest that a solution is to multiply up
the bits in the predicate (into groups of 3 or 4), the problem with
that is that if there are operations that require vec3 or vec4 as
operands interspersed with predicated operations that do not, that
realistically implies a need for two separate predicate registers,
otherwise cycles are wasted swapping predicates OR it implies that the
architecture *allows* two separate predicate registers to be selected.

 consequently, it would be much, much better to be able to have a
single bit of a predicate apply to the *entire* vec3 or vec4 type, on
each outer loop.

l.

Bruce Hoult via llvm-dev

2019-Feb-01 09:18 UTC

head link

[llvm-dev] [RFC] Vector Predication

On Thu, Jan 31, 2019 at 11:53 PM Luke Kenneth Casson Leighton via
llvm-dev <llvm-dev at lists.llvm.org> wrote:>
> ---
> crowd-funded eco-conscious hardware: https://www.crowdsupply.com/eoma68
>
> On Thu, Jan 31, 2019 at 10:22 PM Jacob Lifshay <programmerjake at
gmail.com> wrote:
> >
> > We're in-progress designing a RISC-V extension
(http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2019-January/000433.html)
that would have variable-length vectors of short vectors (1 to 4):
> > <VL x <4 x float>>
> > where each predicate bit masks out a whole short vector. We're
using this extension to vectorize graphics code where where variables in the
pre-vectorization code are short vectors.
> > So, vectorizing code like:
> > for(int i = 0; i < 1000; i++)
> > {
> >     vec4 color = colors[i];
> >     vec3 normal = normals[i];
> >     color.rgb *= fmax(0.0, dot(normal, light_dir));
> >     colors[i] = color;
> > }
> >
> > I'm planning on passing already vectorized code into LLVM and
using LLVM as a backend for optimization and JIT code generation.
> >
> > Do you think the EVL proposal would support an ISA like this as
it's currently
> > written (by pattern matching on predicate expansion and vector-length
> > multiplication)?
>
> whilst it may be tempting to suggest that a solution is to multiply up
> the bits in the predicate (into groups of 3 or 4), the problem with
> that is that if there are operations that require vec3 or vec4 as
> operands interspersed with predicated operations that do not, that
> realistically implies a need for two separate predicate registers,
> otherwise cycles are wasted swapping predicates OR it implies that the
> architecture *allows* two separate predicate registers to be selected.
>
>  consequently, it would be much, much better to be able to have a
> single bit of a predicate apply to the *entire* vec3 or vec4 type, on
> each outer loop.
This situation can be handled easily in the standard RISC-V vector
extension. You'd do something like...

vsetvli t0, a0, vsew128,vnreg8,vdiv4

... to configure the vector unit to provide eight vector register
variables divided into a standard element width of 128 bits (some
instructions will widen or narrow one step to/from 64 bits or 256
bits), and then dividing each 128 bit element into 4 parts.

Arithmetic/logical/shift will happen on 32 bit elements, but
predication and loads and stores (including strided or scatter/gather)
will operate on 128 bit elements.

[I just made up "vnreg8" as an alias for the standard
"vlmul4" because
"vlmul4,vdiv4" might look confusing. Either way it means to put 0b10
into bits [1:0] of the vtype CSR specifying that the 32 vector
registers should be ganged into 8 groups each 4x longer than standard
because (I'm assuming) we need more than four vector registers in this
loop, but no more than eight]

Jacob Lifshay via llvm-dev

2019-Feb-01 10:09 UTC

head link

[llvm-dev] [RFC] Vector Predication

On Fri, Feb 1, 2019 at 1:19 AM Bruce Hoult <brucehoult at sifive.com>
wrote:
> On Thu, Jan 31, 2019 at 11:53 PM Luke Kenneth Casson Leighton via
> llvm-dev <llvm-dev at lists.llvm.org> wrote:
> >
> > ---
> > crowd-funded eco-conscious hardware:
https://www.crowdsupply.com/eoma68
> >
> > On Thu, Jan 31, 2019 at 10:22 PM Jacob Lifshay <programmerjake at
gmail.com>
> wrote:
> > >
> > > We're in-progress designing a RISC-V extension (
>
http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2019-January/000433.html)
> that would have variable-length vectors of short vectors (1 to 4):
> > > <VL x <4 x float>>
> > > where each predicate bit masks out a whole short vector.
We're using
> this extension to vectorize graphics code where where variables in the
> pre-vectorization code are short vectors.
> > > So, vectorizing code like:
> > > for(int i = 0; i < 1000; i++)
> > > {
> > >     vec4 color = colors[i];
> > >     vec3 normal = normals[i];
> > >     color.rgb *= fmax(0.0, dot(normal, light_dir));
> > >     colors[i] = color;
> > > }
> > >
> > > I'm planning on passing already vectorized code into LLVM and
using
> LLVM as a backend for optimization and JIT code generation.
> > >
> > > Do you think the EVL proposal would support an ISA like this as
it's
> currently
> > > written (by pattern matching on predicate expansion and
vector-length
> > > multiplication)?
> >
> > whilst it may be tempting to suggest that a solution is to multiply up
> > the bits in the predicate (into groups of 3 or 4), the problem with
> > that is that if there are operations that require vec3 or vec4 as
> > operands interspersed with predicated operations that do not, that
> > realistically implies a need for two separate predicate registers,
> > otherwise cycles are wasted swapping predicates OR it implies that the
> > architecture *allows* two separate predicate registers to be selected.
> >
> >  consequently, it would be much, much better to be able to have a
> > single bit of a predicate apply to the *entire* vec3 or vec4 type, on
> > each outer loop.
>
> This situation can be handled easily in the standard RISC-V vector
> extension. You'd do something like...
>
> vsetvli t0, a0, vsew128,vnreg8,vdiv4
>
> ... to configure the vector unit to provide eight vector register
> variables divided into a standard element width of 128 bits (some
> instructions will widen or narrow one step to/from 64 bits or 256
> bits), and then dividing each 128 bit element into 4 parts.
>
> Arithmetic/logical/shift will happen on 32 bit elements, but
> predication and loads and stores (including strided or scatter/gather)
> will operate on 128 bit elements.
>
> [I just made up "vnreg8" as an alias for the standard
"vlmul4" because
> "vlmul4,vdiv4" might look confusing. Either way it means to put
0b10
> into bits [1:0] of the vtype CSR specifying that the 32 vector
> registers should be ganged into 8 groups each 4x longer than standard
> because (I'm assuming) we need more than four vector registers in this
> loop, but no more than eight]
>Neat! I did not know that about the V extension. So this sounds as though
the V extension would like support for <VL x <4 x float>>-style
vectors as
well.

We are currently thinking of defining the extension in terms of a 16-bit
prefix that changes standard 32-bit instructions into vectorized 48-bit
instructions, allowing most future or current standard/non-standard
extensions to be vectorized, rather than having to wait for additional
extensions to have vector versions added to the V extension (one reason we
are not using the V extension instead), such as the B extension. Having a
prefix rather than, or in addition to, a layout configuration register
allows intermixing vector operations on different group/element sizes
without having to constantly change the vector configuration every few
instructions. We would also be reserving a 32-bit opcode for compressed
variants of the most commonly used 48-bit prefixed instructions, similar in
style to the C extension. Having a prefix also allows (assuming we don't
run out of encoding space) larger register fields, we're planning on 128
integer and 128 fp registers.

Btw, for anyone interested, feel free to join us over on libre-riscv-dev or
follow at https://www.crowdsupply.com/libre-risc-v/m-class (currently
somewhat out-of-date).

Jacob
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20190201/e5796da6/attachment.html>

llvm dev - Feb 2019 - [RFC] Vector Predication

[llvm-dev] [RFC] Vector Predication

[llvm-dev] [RFC] Vector Predication

[llvm-dev] [RFC] Vector Predication