thr3ads.net - llvm dev - [llvm-dev] [RFC] Vector Predication [Feb 2019]

If this information is useful, please help other people find it:
Share via:

Jacob Lifshay via llvm-dev

2019-Feb-01 10:09 UTC

[llvm-dev] [RFC] Vector Predication

On Fri, Feb 1, 2019 at 1:19 AM Bruce Hoult <brucehoult at sifive.com>
wrote:
> On Thu, Jan 31, 2019 at 11:53 PM Luke Kenneth Casson Leighton via
> llvm-dev <llvm-dev at lists.llvm.org> wrote:
> >
> > ---
> > crowd-funded eco-conscious hardware:
https://www.crowdsupply.com/eoma68
> >
> > On Thu, Jan 31, 2019 at 10:22 PM Jacob Lifshay <programmerjake at
gmail.com>
> wrote:
> > >
> > > We're in-progress designing a RISC-V extension (
>
http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2019-January/000433.html)
> that would have variable-length vectors of short vectors (1 to 4):
> > > <VL x <4 x float>>
> > > where each predicate bit masks out a whole short vector.
We're using
> this extension to vectorize graphics code where where variables in the
> pre-vectorization code are short vectors.
> > > So, vectorizing code like:
> > > for(int i = 0; i < 1000; i++)
> > > {
> > >     vec4 color = colors[i];
> > >     vec3 normal = normals[i];
> > >     color.rgb *= fmax(0.0, dot(normal, light_dir));
> > >     colors[i] = color;
> > > }
> > >
> > > I'm planning on passing already vectorized code into LLVM and
using
> LLVM as a backend for optimization and JIT code generation.
> > >
> > > Do you think the EVL proposal would support an ISA like this as
it's
> currently
> > > written (by pattern matching on predicate expansion and
vector-length
> > > multiplication)?
> >
> > whilst it may be tempting to suggest that a solution is to multiply up
> > the bits in the predicate (into groups of 3 or 4), the problem with
> > that is that if there are operations that require vec3 or vec4 as
> > operands interspersed with predicated operations that do not, that
> > realistically implies a need for two separate predicate registers,
> > otherwise cycles are wasted swapping predicates OR it implies that the
> > architecture *allows* two separate predicate registers to be selected.
> >
> >  consequently, it would be much, much better to be able to have a
> > single bit of a predicate apply to the *entire* vec3 or vec4 type, on
> > each outer loop.
>
> This situation can be handled easily in the standard RISC-V vector
> extension. You'd do something like...
>
> vsetvli t0, a0, vsew128,vnreg8,vdiv4
>
> ... to configure the vector unit to provide eight vector register
> variables divided into a standard element width of 128 bits (some
> instructions will widen or narrow one step to/from 64 bits or 256
> bits), and then dividing each 128 bit element into 4 parts.
>
> Arithmetic/logical/shift will happen on 32 bit elements, but
> predication and loads and stores (including strided or scatter/gather)
> will operate on 128 bit elements.
>
> [I just made up "vnreg8" as an alias for the standard
"vlmul4" because
> "vlmul4,vdiv4" might look confusing. Either way it means to put
0b10
> into bits [1:0] of the vtype CSR specifying that the 32 vector
> registers should be ganged into 8 groups each 4x longer than standard
> because (I'm assuming) we need more than four vector registers in this
> loop, but no more than eight]
>Neat! I did not know that about the V extension. So this sounds as though
the V extension would like support for <VL x <4 x float>>-style
vectors as
well.

We are currently thinking of defining the extension in terms of a 16-bit
prefix that changes standard 32-bit instructions into vectorized 48-bit
instructions, allowing most future or current standard/non-standard
extensions to be vectorized, rather than having to wait for additional
extensions to have vector versions added to the V extension (one reason we
are not using the V extension instead), such as the B extension. Having a
prefix rather than, or in addition to, a layout configuration register
allows intermixing vector operations on different group/element sizes
without having to constantly change the vector configuration every few
instructions. We would also be reserving a 32-bit opcode for compressed
variants of the most commonly used 48-bit prefixed instructions, similar in
style to the C extension. Having a prefix also allows (assuming we don't
run out of encoding space) larger register fields, we're planning on 128
integer and 128 fp registers.

Btw, for anyone interested, feel free to join us over on libre-riscv-dev or
follow at https://www.crowdsupply.com/libre-risc-v/m-class (currently
somewhat out-of-date).

Jacob
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20190201/e5796da6/attachment.html>

Bruce Hoult via llvm-dev

2019-Feb-01 10:58 UTC

head link

[llvm-dev] [RFC] Vector Predication

On Fri, Feb 1, 2019 at 2:09 AM Jacob Lifshay <programmerjake at gmail.com>
wrote:> Neat! I did not know that about the V extension. So this sounds as though
the V extension would like support for <VL x <4 x float>>-style
vectors as well.
Yes. In general, support for <VL x <M x iN>> where M is in {2,4,8}
and
N could be as small as 1 though support for smaller than i8 is
optional. (no distinction is drawn between int and float in the vector
configuration -- that's up to the operations performed)
> We are currently thinking of defining the extension in terms of a 16-bit
prefix that changes standard 32-bit instructions into vectorized 48-bit
instructions, allowing most future or current standard/non-standard extensions
to be vectorized, rather than having to wait for additional extensions to have
vector versions added to the V extension (one reason we are not using the V
extension instead), such as the B extension.
Do you mean instructions following the standard 48-bit encoding
scheme, that happen to contain a standard 32 bit instruction as a
payload?
>Having a prefix rather than, or in addition to, a layout configuration
register allows intermixing vector operations on different group/element sizes
without having to constantly change the vector configuration every few
instructions.
No real difference. The standard RISC-V Vector extension is intended
to allow exactly those changes to the vector configuration every few
instructions. It's mostly the microcontroller people coming from
DSP/SIMD who want to do that, so it's up to them to make that
efficient on their cores -- they might even do macro-op fusion on it.
Big OoO/Supercomputer style code compiled from C/FORTRAN in general
doesn't want to do that kind of thing.

Example code that changes the configuration within a loop to do 16 bit
loads, 16x16->32 multiply, then 32 bit shift and store:

# Example: Load 16-bit values, widen multiply to 32b, shift 32b result
# right by 3, store 32b values.
loop:
    vsetvli a3, a0, vsew16,vlmul4  # vtype = 16-bit integer vectors
    vlh.v v4, (a1)          # Get 16b vector
      slli t1, a3, 1
      add a1, a1, t1        # Bump pointer
    vwmul.vs v8, v4, v1     # 32b in <v8--v15>

    vsetvli x0, a0, vsew32,vlmul8  # Operate on 32b values
    vsrl.vi v8, v8, 3
    vsw.v v8, (a2)          # Store vector of 32b
      slli t1, t1, 2
      add a2, a2, t1        # Bump pointer
      sub a0, a0, a3        # Decrement count
      bnez a0, loop         # Any more?

(this example is probably only useful if 16x16->32 mul is
significantly faster than 32x32->32, otherwise you'd just load and
sign extend the 16 bit data into 32 bit elements)

A note on vector register numbering. There are registers 0..31. If you
specify vlmul4 then only v0,v4,v8,v12,v16,v20,v24,v28 are valid
register numbers. If you specify vlmul8 then only v0,v8,v16,v24 are
valid.

Jacob Lifshay via llvm-dev

2019-Feb-01 11:45 UTC

head link

[llvm-dev] [RFC] Vector Predication

On Fri, Feb 1, 2019 at 2:59 AM Bruce Hoult <brucehoult at sifive.com>
wrote:
> On Fri, Feb 1, 2019 at 2:09 AM Jacob Lifshay <programmerjake at
gmail.com>
> wrote:
> > Neat! I did not know that about the V extension. So this sounds as
> though the V extension would like support for <VL x <4 x
float>>-style
> vectors as well.
>
> Yes. In general, support for <VL x <M x iN>> where M is in
{2,4,8} and
> N could be as small as 1 though support for smaller than i8 is
> optional. (no distinction is drawn between int and float in the vector
> configuration -- that's up to the operations performed)
>
> > We are currently thinking of defining the extension in terms of a
16-bit
> prefix that changes standard 32-bit instructions into vectorized 48-bit
> instructions, allowing most future or current standard/non-standard
> extensions to be vectorized, rather than having to wait for additional
> extensions to have vector versions added to the V extension (one reason we
> are not using the V extension instead), such as the B extension.
>
> Do you mean instructions following the standard 48-bit encoding
> scheme, that happen to contain a standard 32 bit instruction as a
> payload?
>Yes. We reuse the 2 LSB bits from the 32-bit instruction (since they are
constant) to allow for more prefix bits. An example prefix scheme (that
took the complexity waaay too far, we're working on that):
https://salsa.debian.org/Kazan-team/kazan/blob/0c5abb5d35b03c52a21a54d4002f76bcec6c5d1d/docs/Prefix%20Proposal.md
>
> >Having a prefix rather than, or in addition to, a layout configuration
> register allows intermixing vector operations on different group/element
> sizes without having to constantly change the vector configuration every
> few instructions.
>
> No real difference. The standard RISC-V Vector extension is intended
> to allow exactly those changes to the vector configuration every few
> instructions. It's mostly the microcontroller people coming from
> DSP/SIMD who want to do that, so it's up to them to make that
> efficient on their cores -- they might even do macro-op fusion on it.
>Yeah, that works, but you need a larger instruction fetch bandwidth.
> Big OoO/Supercomputer style code compiled from C/FORTRAN in general
> doesn't want to do that kind of thing.
>We're aiming for SIMT-style code (Vulkan Shaders) converted into
variable-length vector operations, so it's different than either
microcontroller or supercomputer styles.
Before vectorization, short vectors are used to represent:
- colors (RGBA)
- positions (XYZ)
- geometric vectors (XYZ)
- transformation matrices (4x4 or 4x3/3x4)
- positions in homogeneous coordinates (XYZW)
- and more.

The short vectors are used more as a grouping mechanism (like a struct or
class) rather than just a method of improving performance.

One problem with the V extension in this use case is that 3-element vectors
(pre-vectorization) are quite common, so if there were a mechanism to
natively support them, we could pack them tightly in registers and ALUs,
preventing a 25% performance loss.

An example:
http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2019-January/000433.html
Relevant section reproduced for convenience:
struct VertexIn
{
    vec3 position;
    vec3 normal;
    vec4 color; // rgba
};
struct VertexOut
{
    vec4 position; // xyzw
    vec4 color;
};
VertexIn vertexes_in[];
VertexOut vertexes_out[];
vec3 light_dir;
float ambient, diffuse;
for(int i = 0; i < 1000; i++)
{
    // calculate vertex colors using
    // lambert's cos model and fixed ambient brightness
    vec3 n = vertexes_in[i].normal;
    vec3 l = light_dir;
    float dot = n.x * l.x + n.y * l.y + n.z * l.z;
    float brightness = max(dot, 0.0) * diffuse + ambient;
    vec4 c = vertexes_in[i].color;
    c.rgb *= brightness;
    vertexes_out[i].color = c;
    // orthographic projection
    vertexes_out[i].position = vec4(vertexes_in[i].position, 1.0);
}

vectorization produces:
for(int i = 0;;)
{
    VL = setvl(1000 - i);
    vec3xVL n = load3xVL_strided(&vertexes_in[i].normal, sizeof(VertexIn));
    vec3 l = light_dir;
    vecVL dot = n.x * l.x + n.y * l.y + n.z * l.z;
    vecVL brightness = max(dot, 0.0) * diffuse + ambient;
    vec4xVL c = load4xVL_strided(&vertexes_in[i].color, sizeof(VertexIn));
    vec3xVL c_rgb = c.rgb;
    c_rgb *= brightness;
    c.rgb = c_rgb;
    store4xVL_strided(&vertexes_out[i].color, c, sizeof(VertexOut));
    vec4xVL p = 1.0;
    p.xyz = load3xVL_strided(&vertexes_in[i].position, sizeof(VertexIn));
    store4xVL_strided(&vertexes_out[i].position, p, sizeof(VertexOut));
    i += VL;
}

Jacob Lifshay
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20190201/7346b32b/attachment.html>

Luke Kenneth Casson Leighton via llvm-dev

2019-Feb-02 00:29 UTC

head link

[llvm-dev] [RFC] Vector Predication

On Friday, February 1, 2019, Jacob Lifshay <programmerjake at gmail.com>
wrote:

Neat! I did not know that about the V extension. So this sounds as
though> the V extension would like support for <VL x <4 x float>>-style
vectors as
> well.
>
> We are currently thinking of defining the extension in terms of a 16-bit
> prefix that changes standard 32-bit instructions into vectorized 48-bit
> instructions,
>
>(and Compressed 16-bit ones into vectorised 32-bit, unmodified)

 allowing most future or current standard/non-standard extensions to
be> vectorized, rather than having to wait for additional extensions to have
> vector versions added to the V extension (one reason we are not using the V
> extension instead), such as the B extension.
>
(and there isn't enough free RISCV 32 bit opcode space to add vectorised
versions of xBitManip, nor any future custom or standard ops. So the
simplest most powerful way to move forward quickly and without disruption
is a prefix)

L.



-- 
---
crowd-funded eco-conscious hardware: https://www.crowdsupply.com/eoma68
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20190202/994d5e93/attachment.html>

llvm dev - Feb 2019 - [RFC] Vector Predication

[llvm-dev] [RFC] Vector Predication

[llvm-dev] [RFC] Vector Predication

[llvm-dev] [RFC] Vector Predication

[llvm-dev] [RFC] Vector Predication