On Fri, Feb 1, 2019 at 1:19 AM Bruce Hoult <brucehoult at sifive.com> wrote:> On Thu, Jan 31, 2019 at 11:53 PM Luke Kenneth Casson Leighton via > llvm-dev <llvm-dev at lists.llvm.org> wrote: > > > > --- > > crowd-funded eco-conscious hardware: https://www.crowdsupply.com/eoma68 > > > > On Thu, Jan 31, 2019 at 10:22 PM Jacob Lifshay <programmerjake at gmail.com> > wrote: > > > > > > We're in-progress designing a RISC-V extension ( > http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2019-January/000433.html) > that would have variable-length vectors of short vectors (1 to 4): > > > <VL x <4 x float>> > > > where each predicate bit masks out a whole short vector. We're using > this extension to vectorize graphics code where where variables in the > pre-vectorization code are short vectors. > > > So, vectorizing code like: > > > for(int i = 0; i < 1000; i++) > > > { > > > vec4 color = colors[i]; > > > vec3 normal = normals[i]; > > > color.rgb *= fmax(0.0, dot(normal, light_dir)); > > > colors[i] = color; > > > } > > > > > > I'm planning on passing already vectorized code into LLVM and using > LLVM as a backend for optimization and JIT code generation. > > > > > > Do you think the EVL proposal would support an ISA like this as it's > currently > > > written (by pattern matching on predicate expansion and vector-length > > > multiplication)? > > > > whilst it may be tempting to suggest that a solution is to multiply up > > the bits in the predicate (into groups of 3 or 4), the problem with > > that is that if there are operations that require vec3 or vec4 as > > operands interspersed with predicated operations that do not, that > > realistically implies a need for two separate predicate registers, > > otherwise cycles are wasted swapping predicates OR it implies that the > > architecture *allows* two separate predicate registers to be selected. > > > > consequently, it would be much, much better to be able to have a > > single bit of a predicate apply to the *entire* vec3 or vec4 type, on > > each outer loop. > > This situation can be handled easily in the standard RISC-V vector > extension. You'd do something like... > > vsetvli t0, a0, vsew128,vnreg8,vdiv4 > > ... to configure the vector unit to provide eight vector register > variables divided into a standard element width of 128 bits (some > instructions will widen or narrow one step to/from 64 bits or 256 > bits), and then dividing each 128 bit element into 4 parts. > > Arithmetic/logical/shift will happen on 32 bit elements, but > predication and loads and stores (including strided or scatter/gather) > will operate on 128 bit elements. > > [I just made up "vnreg8" as an alias for the standard "vlmul4" because > "vlmul4,vdiv4" might look confusing. Either way it means to put 0b10 > into bits [1:0] of the vtype CSR specifying that the 32 vector > registers should be ganged into 8 groups each 4x longer than standard > because (I'm assuming) we need more than four vector registers in this > loop, but no more than eight] >Neat! I did not know that about the V extension. So this sounds as though the V extension would like support for <VL x <4 x float>>-style vectors as well. We are currently thinking of defining the extension in terms of a 16-bit prefix that changes standard 32-bit instructions into vectorized 48-bit instructions, allowing most future or current standard/non-standard extensions to be vectorized, rather than having to wait for additional extensions to have vector versions added to the V extension (one reason we are not using the V extension instead), such as the B extension. Having a prefix rather than, or in addition to, a layout configuration register allows intermixing vector operations on different group/element sizes without having to constantly change the vector configuration every few instructions. We would also be reserving a 32-bit opcode for compressed variants of the most commonly used 48-bit prefixed instructions, similar in style to the C extension. Having a prefix also allows (assuming we don't run out of encoding space) larger register fields, we're planning on 128 integer and 128 fp registers. Btw, for anyone interested, feel free to join us over on libre-riscv-dev or follow at https://www.crowdsupply.com/libre-risc-v/m-class (currently somewhat out-of-date). Jacob -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20190201/e5796da6/attachment.html>
On Fri, Feb 1, 2019 at 2:09 AM Jacob Lifshay <programmerjake at gmail.com> wrote:> Neat! I did not know that about the V extension. So this sounds as though the V extension would like support for <VL x <4 x float>>-style vectors as well.Yes. In general, support for <VL x <M x iN>> where M is in {2,4,8} and N could be as small as 1 though support for smaller than i8 is optional. (no distinction is drawn between int and float in the vector configuration -- that's up to the operations performed)> We are currently thinking of defining the extension in terms of a 16-bit prefix that changes standard 32-bit instructions into vectorized 48-bit instructions, allowing most future or current standard/non-standard extensions to be vectorized, rather than having to wait for additional extensions to have vector versions added to the V extension (one reason we are not using the V extension instead), such as the B extension.Do you mean instructions following the standard 48-bit encoding scheme, that happen to contain a standard 32 bit instruction as a payload?>Having a prefix rather than, or in addition to, a layout configuration register allows intermixing vector operations on different group/element sizes without having to constantly change the vector configuration every few instructions.No real difference. The standard RISC-V Vector extension is intended to allow exactly those changes to the vector configuration every few instructions. It's mostly the microcontroller people coming from DSP/SIMD who want to do that, so it's up to them to make that efficient on their cores -- they might even do macro-op fusion on it. Big OoO/Supercomputer style code compiled from C/FORTRAN in general doesn't want to do that kind of thing. Example code that changes the configuration within a loop to do 16 bit loads, 16x16->32 multiply, then 32 bit shift and store: # Example: Load 16-bit values, widen multiply to 32b, shift 32b result # right by 3, store 32b values. loop: vsetvli a3, a0, vsew16,vlmul4 # vtype = 16-bit integer vectors vlh.v v4, (a1) # Get 16b vector slli t1, a3, 1 add a1, a1, t1 # Bump pointer vwmul.vs v8, v4, v1 # 32b in <v8--v15> vsetvli x0, a0, vsew32,vlmul8 # Operate on 32b values vsrl.vi v8, v8, 3 vsw.v v8, (a2) # Store vector of 32b slli t1, t1, 2 add a2, a2, t1 # Bump pointer sub a0, a0, a3 # Decrement count bnez a0, loop # Any more? (this example is probably only useful if 16x16->32 mul is significantly faster than 32x32->32, otherwise you'd just load and sign extend the 16 bit data into 32 bit elements) A note on vector register numbering. There are registers 0..31. If you specify vlmul4 then only v0,v4,v8,v12,v16,v20,v24,v28 are valid register numbers. If you specify vlmul8 then only v0,v8,v16,v24 are valid.
On Fri, Feb 1, 2019 at 2:59 AM Bruce Hoult <brucehoult at sifive.com> wrote:> On Fri, Feb 1, 2019 at 2:09 AM Jacob Lifshay <programmerjake at gmail.com> > wrote: > > Neat! I did not know that about the V extension. So this sounds as > though the V extension would like support for <VL x <4 x float>>-style > vectors as well. > > Yes. In general, support for <VL x <M x iN>> where M is in {2,4,8} and > N could be as small as 1 though support for smaller than i8 is > optional. (no distinction is drawn between int and float in the vector > configuration -- that's up to the operations performed) > > > We are currently thinking of defining the extension in terms of a 16-bit > prefix that changes standard 32-bit instructions into vectorized 48-bit > instructions, allowing most future or current standard/non-standard > extensions to be vectorized, rather than having to wait for additional > extensions to have vector versions added to the V extension (one reason we > are not using the V extension instead), such as the B extension. > > Do you mean instructions following the standard 48-bit encoding > scheme, that happen to contain a standard 32 bit instruction as a > payload? >Yes. We reuse the 2 LSB bits from the 32-bit instruction (since they are constant) to allow for more prefix bits. An example prefix scheme (that took the complexity waaay too far, we're working on that): https://salsa.debian.org/Kazan-team/kazan/blob/0c5abb5d35b03c52a21a54d4002f76bcec6c5d1d/docs/Prefix%20Proposal.md> > >Having a prefix rather than, or in addition to, a layout configuration > register allows intermixing vector operations on different group/element > sizes without having to constantly change the vector configuration every > few instructions. > > No real difference. The standard RISC-V Vector extension is intended > to allow exactly those changes to the vector configuration every few > instructions. It's mostly the microcontroller people coming from > DSP/SIMD who want to do that, so it's up to them to make that > efficient on their cores -- they might even do macro-op fusion on it. >Yeah, that works, but you need a larger instruction fetch bandwidth.> Big OoO/Supercomputer style code compiled from C/FORTRAN in general > doesn't want to do that kind of thing. >We're aiming for SIMT-style code (Vulkan Shaders) converted into variable-length vector operations, so it's different than either microcontroller or supercomputer styles. Before vectorization, short vectors are used to represent: - colors (RGBA) - positions (XYZ) - geometric vectors (XYZ) - transformation matrices (4x4 or 4x3/3x4) - positions in homogeneous coordinates (XYZW) - and more. The short vectors are used more as a grouping mechanism (like a struct or class) rather than just a method of improving performance. One problem with the V extension in this use case is that 3-element vectors (pre-vectorization) are quite common, so if there were a mechanism to natively support them, we could pack them tightly in registers and ALUs, preventing a 25% performance loss. An example: http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2019-January/000433.html Relevant section reproduced for convenience: struct VertexIn { vec3 position; vec3 normal; vec4 color; // rgba }; struct VertexOut { vec4 position; // xyzw vec4 color; }; VertexIn vertexes_in[]; VertexOut vertexes_out[]; vec3 light_dir; float ambient, diffuse; for(int i = 0; i < 1000; i++) { // calculate vertex colors using // lambert's cos model and fixed ambient brightness vec3 n = vertexes_in[i].normal; vec3 l = light_dir; float dot = n.x * l.x + n.y * l.y + n.z * l.z; float brightness = max(dot, 0.0) * diffuse + ambient; vec4 c = vertexes_in[i].color; c.rgb *= brightness; vertexes_out[i].color = c; // orthographic projection vertexes_out[i].position = vec4(vertexes_in[i].position, 1.0); } vectorization produces: for(int i = 0;;) { VL = setvl(1000 - i); vec3xVL n = load3xVL_strided(&vertexes_in[i].normal, sizeof(VertexIn)); vec3 l = light_dir; vecVL dot = n.x * l.x + n.y * l.y + n.z * l.z; vecVL brightness = max(dot, 0.0) * diffuse + ambient; vec4xVL c = load4xVL_strided(&vertexes_in[i].color, sizeof(VertexIn)); vec3xVL c_rgb = c.rgb; c_rgb *= brightness; c.rgb = c_rgb; store4xVL_strided(&vertexes_out[i].color, c, sizeof(VertexOut)); vec4xVL p = 1.0; p.xyz = load3xVL_strided(&vertexes_in[i].position, sizeof(VertexIn)); store4xVL_strided(&vertexes_out[i].position, p, sizeof(VertexOut)); i += VL; } Jacob Lifshay -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20190201/7346b32b/attachment.html>
Luke Kenneth Casson Leighton via llvm-dev
2019-Feb-02 00:29 UTC
[llvm-dev] [RFC] Vector Predication
On Friday, February 1, 2019, Jacob Lifshay <programmerjake at gmail.com> wrote: Neat! I did not know that about the V extension. So this sounds as though> the V extension would like support for <VL x <4 x float>>-style vectors as > well. > > We are currently thinking of defining the extension in terms of a 16-bit > prefix that changes standard 32-bit instructions into vectorized 48-bit > instructions, > >(and Compressed 16-bit ones into vectorised 32-bit, unmodified) allowing most future or current standard/non-standard extensions to be> vectorized, rather than having to wait for additional extensions to have > vector versions added to the V extension (one reason we are not using the V > extension instead), such as the B extension. >(and there isn't enough free RISCV 32 bit opcode space to add vectorised versions of xBitManip, nor any future custom or standard ops. So the simplest most powerful way to move forward quickly and without disruption is a prefix) L. -- --- crowd-funded eco-conscious hardware: https://www.crowdsupply.com/eoma68 -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20190202/994d5e93/attachment.html>